Aquileo | R - yoshidk6’s blog2024-12-09T10:06:17+09:00yoshidk6Hatena::Bloghatenablog://blog/10328749687250563788Aquileo | Simulating parametric survival model with parametric bootstrap to capture uncertaintyhatenablog://entry/260066134857151362020-01-28T00:40:18+09:002020-01-27T15:40:18+09:00I recently released an R package on CRAN calledsurvParamSim for parametric survival simulation, and here want to describe a bit more on details & motivations behind developing this package. Parametric Survival Simulation with Parameter Uncertainty • survParamSim The purpose of survParamSim is to pac…<p>I recently released an R package on CRAN called<code>survParamSim</code> for parametric survival simulation, and here want to describe a bit more on details &amp; motivations behind developing this package.</p> <p><a href="https://yoshidk6.github.io/survParamSim/index.html">Parametric Survival Simulation with Parameter Uncertainty &bull; survParamSim</a></p> <p>The purpose of <code>survParamSim</code> is to package the R functions developed by a great scientist, the late Laurent Claret, and streamline the function definition/APIs to make these widely available to the public. Laurent developed these functions for his pioneering works in applying tumor growth inhibition models for clinical outcome predictions of oncology studies, such as the one below. <iframe src="https://hatenablog-parts.com/embed?url=https%3A%2F%2Fclincancerres.aacrjournals.org%2Fcontent%2Fearly%2F2018%2F05%2F30%2F1078-0432.CCR-17-3662" title="A Model of Overall Survival Predicts Treatment Outcomes with Atezolizumab versus Chemotherapy in Non–Small Cell Lung Cancer Based on Early Tumor Kinetics" class="embed-card embed-webcard" scrolling="no" frameborder="0" style="display: block; width: 100%; height: 155px; max-width: 500px; margin: 10px 0px;"></iframe><cite class="hatena-citation"><a href="https://clincancerres.aacrjournals.org/content/early/2018/05/30/1078-0432.CCR-17-3662">clincancerres.aacrjournals.org</a></cite></p> <p>Explaining the theory behind his works is well beyond the scope of this short blog post, but the key technical highlight is that we are using parametric survival models with various covariates for predicting survival profiles and treatment benefits. In applying these workflows in the context of drug development (or maybe in most of other contexts), we want to make a prediction of not only the mean survival profiles but also the uncertainties associated with model predictions.</p> <p>There are mainly two sources of uncertainty for such survival predictions. The first one is due to the nature of the survival prediction: because survival (or rather failure) is modeled as probabilistic processes, a predicted survival curve for a group will never be identical after multiple simulations. There is another source of uncertainty, which is the uncertainty associated with model parameters, such as coefficients associated with covariates in the model.</p> <p>The first element is fairly straightforward to evaluate - we can run simulations many times and quantify how variable the survival curves can be by looking at prediction intervals. For example, we expect the prediction interval to generally get narrower if we make a prediction with larger number of patients. However, for the second part, there is not a readily available method for exploration to my (limited) knowledge. <code>survival::predict.survreg()</code> has <code>se.fit</code> option, but it is not straightforward to carry the output from this option for making predicted survival profiles of a group.</p> <p>These are the main reasons for putting together this package - to enable the incorporation of above two sources of uncertainty in survival predictions, using parametric models obtained with <code>survival::survreg()</code>. For the second element of the uncertainty, we are using parametric bootstrap from variance-covariance matrix of the model parameter estimates.</p> <h1>Example</h1> <p>The code below is essentially the same as the one on <a href="https://yoshidk6.github.io/survParamSim/">GitHub pages</a>, just showing a simple work flow to give you a sense of what to expect in terms of outputs. <a href="https://yoshidk6.github.io/survParamSim/articles/survParamSim.html">Vignette here</a> has a bit more explanations on the workflow.</p> <p>For this example, we use <code>colon</code> dataset in <code>survival</code> package, where the treatment benefit of Levamisole+5-FU was compared with control (called as Obs). Various covariates were further explored, such as extent of tumor local spread.</p> <p>First, you would run <code>survreg</code> to fit parametric survival model:</p> <pre class="code {r prep}" data-lang="{r prep}" data-unlink>library(tidyverse) library(survival) library(survParamSim) set.seed(12345) # ref for dataset https://vincentarelbundock.github.io/Rdatasets/doc/survival/colon.html colon2 &lt;- as_tibble(colon) %&gt;% # recurrence only and not including Lev alone arm filter(rx != &#34;Lev&#34;, etype == 1) %&gt;% # Same definition as Lin et al, 1994 mutate(rx = factor(rx, levels = c(&#34;Obs&#34;, &#34;Lev+5FU&#34;)), depth = as.numeric(extent &lt;= 2))</pre> <p>Here we have two covariates, <code>node4</code> and <code>depth</code>, in addition to the treatment assignment <code>rx</code>.</p> <pre class="code {r fit}" data-lang="{r fit}" data-unlink>fit.colon &lt;- survreg(Surv(time, status) ~ rx + node4 + depth, data = colon2, dist = &#34;lognormal&#34;)</pre> <p>Next, parametric bootstrap simulation can be run with <code>surv_param_sim()</code> function from this package:</p> <pre class="code {r sim}" data-lang="{r sim}" data-unlink>sim &lt;- surv_param_sim(object = fit.colon, newdata = colon2, censor.dur = c(1800, 3000), # Simulating only 100 times to make the example go fast n.rep = 100)</pre> <p>The output object merely has raw simulated survival data with 100 repetition (defined by <code>n.rep</code> argument). You can find what options you have on this output by simply printing the object.</p> <pre class="code {r}" data-lang="{r}" data-unlink>sim #&gt; ---- Simulated survival data with the following model ---- #&gt; survreg(formula = Surv(time, status) ~ rx + node4 + depth, data = colon2, #&gt; dist = &#34;lognormal&#34;) #&gt; #&gt; * Use `extract_sim()` function to extract individual simulated survivals #&gt; * Use `calc_km_pi()` function to get survival curves and median survival time #&gt; * Use `calc_hr_pi()` function to get hazard ratio #&gt; #&gt; * Settings: #&gt; #simulations: 100 #&gt; #subjects: 619 (without NA in model variables)</pre> <p>To visualize simulated survival curves with uncertainly, you can use <code>calc_km_pi()</code> and <code>plot_km_pi()</code> functions sequentially. You can also specify faceting/grouping with <code>trt</code> and <code>group</code> input arguments.</p> <pre class="code {r km_pi_group}" data-lang="{r km_pi_group}" data-unlink>km.pi &lt;- calc_km_pi(sim, trt = &#34;rx&#34;, group = c(&#34;node4&#34;, &#34;depth&#34;)) km.pi #&gt; ---- Simulated and observed (if calculated) survival curves ---- #&gt; * Use `extract_median_surv()` to extract median survival times #&gt; * Use `extract_km_pi()` to extract prediction intervals of K-M curves #&gt; * Use `plot_km_pi()` to draw survival curves #&gt; #&gt; * Settings: #&gt; trt: rx #&gt; group: node4 #&gt; pi.range: 0.95 #&gt; calc.obs: TRUE plot_km_pi(km.pi) + theme(legend.position = &#34;bottom&#34;) + labs(y = &#34;Recurrence free rate&#34;) + expand_limits(y = 0)</pre> <p><figure class="figure-image figure-image-fotolife" title="Predicted (shaded area) and observed (line) survival curves"><span itemscope itemtype="http://schema.org/Photograph"><img src="https://cdn-ak.f.st-hatena.com/images/fotolife/y/yoshidk6/20200124/20200124163724.png" alt="f:id:yoshidk6:20200124163724p:plain" title="f:id:yoshidk6:20200124163724p:plain" class="hatena-fotolife" itemprop="image"></span><figcaption>Predicted (shaded area) and observed (line) survival curves, per depth and node4 groups</figcaption></figure></p> <p>Similarly, you can visualize hazard ratios and prediction intervals with <code>calc_hr_pi()</code> and <code>plot_hr_pi()</code> functions.</p> <pre class="code {r hr_pi}" data-lang="{r hr_pi}" data-unlink>hr.pi &lt;- calc_hr_pi(sim, trt = &#34;rx&#34;, group = c(&#34;depth&#34;)) hr.pi #&gt; ---- Simulated and observed (if calculated) hazard ratio ---- #&gt; * Use `extract_hr_pi()` to extract prediction intervals and observed HR #&gt; * Use `extract_hr()` to extract individual simulated HRs #&gt; * Use `plot_hr_pi()` to draw histogram of predicted HR #&gt; #&gt; * Settings: #&gt; trt: rx #&gt; (Lev+5FU as test trt, Obs as control) #&gt; group: depth #&gt; pi.range: 0.95 #&gt; calc.obs: TRUE plot_hr_pi(hr.pi)</pre> <p><figure class="figure-image figure-image-fotolife" title="Histogram of predicted hazard ratios, prediction intervals (dashed lines), and observed hazard ratios (red line), per depth groups"><span itemscope itemtype="http://schema.org/Photograph"><img src="https://cdn-ak.f.st-hatena.com/images/fotolife/y/yoshidk6/20200124/20200124163843.png" alt="f:id:yoshidk6:20200124163843p:plain" title="f:id:yoshidk6:20200124163843p:plain" class="hatena-fotolife" itemprop="image"></span><figcaption>Histogram of predicted hazard ratios, prediction intervals (dashed lines), and observed hazard ratios (red line), per depth groups</figcaption></figure></p> yoshidk6Aquileo | Using purrr's map family functions in dplyr::mutatehatenablog://entry/102578461326190051322018-09-05T22:22:48+09:002018-09-05T13:22:48+09:00map family functions of the purrr package are very useful for using non-vectorized functions in dplyr::mutate chain (see GitHub - jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse or https://www.jimhester.com/2018/04/12/vectorize/). I encounter the needs for this especia…<p><code>map</code> family functions of the <code>purrr</code> package are very useful for using non-vectorized functions in <code>dplyr::mutate</code> chain (see <a href="https://github.com/jennybc/row-oriented-workflows">GitHub - jennybc/row-oriented-workflows: Row-oriented workflows in R with the tidyverse</a> or <a href="https://www.jimhester.com/2018/04/12/vectorize/">https://www.jimhester.com/2018/04/12/vectorize/</a>).<br/> I encounter the needs for this especially when dealing with nested data frames.</p> <p>One of the drawbacks is that name/input argument assignments become confusing when you want to use more than two columns of your data frames (and using <code>pmap</code> family) for the function of interest. This post first briefly review how <code>mutate</code> works in combination with <code>map</code> or <code>map2</code>, then provide two approaches to avoid confusions around name assignments when using <code>pmap</code>.</p> <ul class="table-of-contents"> <li><a href="#How-mutate-works-with-vectorized-functions">How mutate works with vectorized functions</a></li> <li><a href="#Non-vectorized-function-with-one-or-two-input-arguments-map-or-map2">Non-vectorized function with one or two input arguments (map or map2)</a></li> <li><a href="#Non-vectorized-function-with-three-or-more-input-arguments-pmap">Non-vectorized function with three or more input arguments (pmap)</a><ul> <li><a href="#Case-example">Case example</a></li> <li><a href="#A-simple-case">A simple case</a></li> <li><a href="#Number-of-columns--Number-of-input-arguments">Number of columns > Number of input arguments</a><ul> <li><a href="#Make-a-small-list-on-the-fly">Make a small list on the fly</a></li> <li><a href="#Use--to-ignore-unused-columns">Use ... to ignore unused columns</a></li> </ul> </li> <li><a href="#Column-names-are-different-from-the-input-argument-names">Column names are different from the input argument names</a></li> </ul> </li> </ul> <h1 id="How-mutate-works-with-vectorized-functions">How <code>mutate</code> works with vectorized functions</h1> <p>In most cases, the processes you want to do in <code>mutate</code> is vectorized and there is no need to use <code>map</code> family function. This works because the output from the function of interest (<code>c</code> in the example below) has the same length as the original data frame, and <code>mutate</code> only need to append one column to the data frame.</p> <pre class="code {r}" data-lang="{r}" data-unlink>library(tidyverse) df0 &lt;- tibble(a = 1:3, b = 4:6) df0 %&gt;% mutate(c = a + b)</pre> <h1 id="Non-vectorized-function-with-one-or-two-input-arguments-map-or-map2">Non-vectorized function with one or two input arguments (<code>map</code> or <code>map2</code>)</h1> <p>Imagine that we want to create a new column containing arithmetic progressions in each row [<a href="https://notchained.hatenablog.com/entry/2017/11/15/212117">ref (in Japanese)</a>]. Since <code>seq</code> function is not vectorized, we cannot directly use this in <code>mutate</code> chain.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df1 &lt;- tibble(a = c(1, 2), b = c(3, 6), c = c(8, 10)) df1 %&gt;% mutate(d = seq(a, b)) # Error in mutate_impl(.data, dots) : Evaluation error: &#39;from&#39; must be of length 1.</pre> <p>Instead, we can use <code>map</code> family function here. <code>map</code> family function take list(s) as input arguments and apply the function of interest using each element of the given lists. Because each column of data frames in R is a list, <code>map</code> works very well in combination.</p> <p>In this example, we want to provide two input arguments to the <code>seq</code> function, <code>from</code> and <code>to</code>. <code>map2</code> is the appropriate function for this.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df2 &lt;- df1 %&gt;% mutate(d = map2(a, b, seq)) as.data.frame(df2) # a b c d #1 1 3 8 1, 2, 3 #2 2 6 10 2, 3, 4, 5, 6</pre> <p>The figure below shows how <code>map</code> function handles this process in <code>mutate</code> chain.</p> <p><span itemscope itemtype="http://schema.org/Photograph"><img src="https://cdn-ak.f.st-hatena.com/images/fotolife/y/yoshidk6/20180806/20180806153334.png" alt="f:id:yoshidk6:20180806153334p:plain" title="f:id:yoshidk6:20180806153334p:plain" class="hatena-fotolife" itemprop="image"></span></p> <p>Like you do with <code>map</code> function outside <code>mutate</code>, we can use <code>map_dbl</code> or <code>map_chr</code> to create columns with <code>double</code> or <code>character</code> types.</p> <p>If we want to explicitly specify names of the argument, <code>.x</code> and <code>.y</code> can be used. See what happens with this:</p> <pre class="code {r}" data-lang="{r}" data-unlink>df2 &lt;- mutate(df1, d = map2(a, b, ~seq(.y, .x))) as.data.frame(df2)</pre> <h1 id="Non-vectorized-function-with-three-or-more-input-arguments-pmap">Non-vectorized function with three or more input arguments (<code>pmap</code>)</h1> <p>Assignment of column names become confusing when using three or more columns, because we don't have shorthand like <code>.x</code> or <code>.y</code> any more. Let's take a look at the following example using <code>rnorm</code> function.</p> <h2 id="Case-example">Case example</h2> <blockquote><p>Generate a list of random numbers for each row with <code>rnorm</code> function. Each row of the original data frame contain different value of <code>mean</code>, <code>sd</code>, <code>n</code>.</p></blockquote> <p>We will first prepare a data frame with columns corresponding to <code>mean</code>, <code>sd</code>, <code>n</code>, and apply <code>rnorm</code> function for each row using <code>pmap</code>. Each element of the new column <code>data</code> contains a vector of random samples <a href="#f-b3737115" name="fn-b3737115" title="In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors">*1</a>. This type of structure is called as "nested data frames" and there are many resources on this, such as <a href="http://r4ds.had.co.nz/many-models.html">25 Many models | R for Data Science</a>.</p> <h2 id="A-simple-case">A simple case</h2> <p>If your data frame has <u><span style="color: #d32f2f">the exact same names and numbers of columns</span></u> to the input arguments of the function of interest, a simple syntax like the one below works <a href="#f-76427676" name="fn-76427676" title="This works even if the order of the columns is different from the order of input arguments">*2</a>.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df4 &lt;- tribble(~mean, ~sd, ~n, 1, 0.03, 2, 10, 0.1, 4, 5, 0.1, 4) df4.2 &lt;- df4 %&gt;% mutate(data = pmap(., rnorm)) as.data.frame(df4.2)</pre> <p>One caution is that the syntax like the one below doesn't work. <code>pmap</code> thinks that you are calling <code>rnorm(df4$n, df4$mean, df4$sd)</code> for each row, and each element of the new column contain three random samples from the same list of <code>mean</code> and <code>sd</code>. <a href="#f-1cfadec7" name="fn-1cfadec7" title="This happens because rnorm is actually vectorized. See ?rnorm: If length(n) &gt; 1, the length is taken to be the number required.">*3</a></p> <pre class="code {r}" data-lang="{r}" data-unlink>df4 %&gt;% mutate(data = pmap(., ~rnorm(n, mean, sd))) %&gt;% as.data.frame() # Wrong answer</pre> <h2 id="Number-of-columns--Number-of-input-arguments">Number of columns > Number of input arguments</h2> <p>In most cases, however, you will have more columns than the input arguments. <code>pmap</code> complains in this case, saying that you have unused argument.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5 &lt;- tribble(~mean, ~sd, ~dummy, ~n, 1, 0.03, &#34;a&#34;, 2, 10, 0.1, &#34;b&#34;, 4, 5, 0.1, &#34;c&#34;, 4) df5 %&gt;% mutate(data = pmap(., rnorm))  # Error</pre> <p>There are two ways to avoid this error.</p> <h3 id="Make-a-small-list-on-the-fly">Make a small list on the fly</h3> <p>The first method is to create a small list that only contains the necessary columns (Ref: <a href="https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/29">Dplyr: Alternatives to rowwise - tidyverse - RStudio Community</a> )</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5.2 &lt;- df5 %&gt;% mutate(data = pmap(list(n=n, mean=mean, sd=sd), rnorm)) as.data.frame(df5.2)</pre> <p>Here, <code>list(n=n, mean=mean, sd=sd)</code> create a new list with three vectors named <code>n</code>, <code>mean</code>, and <code>sd</code>, which serves the same purpose as the <code>df4</code> data frame in the above example.</p> <p>Mind that if you don't give names to the elements of the new list, the order of the list items will be used to associate with input arguments of <code>rnorm</code>. My recommendation is to always assign names to the list elements.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5 %&gt;% mutate(data = pmap(list(n, mean, sd), rnorm)) # Correct but not recommended df5 %&gt;% mutate(data = pmap(list(mean, sd, n), rnorm)) # Wrong answer</pre> <h3 id="Use--to-ignore-unused-columns">Use <code>...</code> to ignore unused columns</h3> <p>The second method is to absorb unused columns with <code>...</code> (Ref: <a href="https://purrr.tidyverse.org/reference/map2.html">Map over multiple inputs simultaneously. &mdash; map2 &bull; purrr</a>). A syntax like the one below works because <code>pmap</code> automatically associate names of the input list and names in <code>function()</code>. In other word, columns names of the data frame must match the variable names in the <code>function()</code>.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5.3 &lt;- df5 %&gt;% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n=n, mean=mean, sd=sd))) as.data.frame(df5.3)</pre> <p>Input arguments of <code>function()</code> and <code>rnorm()</code> are <u><span style="color: #d32f2f">not</span></u> automatically associated with names. It is recommended to explicitly associate input argument name for the function of interest (<code>rnorm</code> in this case).</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5 %&gt;% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(n, mean, sd))) # Correct but not recommended df5 %&gt;% mutate(data = pmap(., function(n, mean, sd, ...) rnorm(mean, sd, n))) # Wrong answer df5 %&gt;% mutate(data = pmap(., function(mean, sd, n, ...) rnorm(mean, sd, n))) # Wrong answer</pre> <p>A syntax like the one below gives unexpected outputs, as you saw in the <code>df4</code> example.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df5 %&gt;% mutate(data = pmap(., function(...) rnorm(n=n, mean=mean, sd=sd))) # Wrong answer</pre> <h2 id="Column-names-are-different-from-the-input-argument-names">Column names are different from the input argument names</h2> <p>You can use either of the two approaches above.</p> <pre class="code {r}" data-lang="{r}" data-unlink>df6 &lt;- tribble(~mean1, ~sd1, ~dummy, ~n1, 1, 0.03, &#34;a&#34;, 2, 10, 0.1, &#34;b&#34;, 4, 5, 0.1, &#34;c&#34;, 4) df6.2 &lt;- df6 %&gt;% mutate(data = pmap(list(mean=mean1, sd=sd1, n=n1), rnorm)) as.data.frame(df6.2) df6.3 &lt;- df6 %&gt;% mutate(data = pmap(., function(n1, mean1, sd1, ...) rnorm(n=n1, mean=mean1, sd=sd1))) as.data.frame(df6.3) </pre> <div class="footnote"> <p class="footnote"><a href="#fn-b3737115" name="f-b3737115" class="footnote-number">*1</a><span class="footnote-delimiter">:</span><span class="footnote-text">In the examples below (and above), we further use as.data.frame function to exposure actual numbers of vectors</span></p> <p class="footnote"><a href="#fn-76427676" name="f-76427676" class="footnote-number">*2</a><span class="footnote-delimiter">:</span><span class="footnote-text">This works even if the order of the columns is different from the order of input arguments</span></p> <p class="footnote"><a href="#fn-1cfadec7" name="f-1cfadec7" class="footnote-number">*3</a><span class="footnote-delimiter">:</span><span class="footnote-text">This happens because rnorm is actually vectorized. See ?rnorm: <i>If length(n) > 1, the length is taken to be the number required.</i></span></p> </div>yoshidk6