Another example of something I do a lot and forget how to do.
I often have a situation where I am testing many hypotheses.
How I tested many models in R in the past was to use lapply
or a loop. I don’t think there is any problem with that approach, I just really like using a pattern these days with grouped data using tidyverse packages.
The general pattern is
There is a great vignette on this topic from the {broom} package.
First, load some packages.
library(tidyr, quietly = TRUE) # manipulating data and nesting
library(dplyr, quietly = TRUE) # general data and piping
library(purrr, quietly = TRUE) # i will use purrr::map
library(broom) # very good at reformatting model objects
library(palmerpenguins) # for more fun data
library(survival) # for time to event models
library(ggplot2) # to make our plots
theme_set(theme_minimal(base_family = "Avenir")) # for plot appearance
Let’s say we are interested in the association between all the numeric variables in the {palmerpenguins} penguin dataset and the species.
If you are not familiar with this dataset, the three penguin species have different features like bill depth, bill length, body mass, and flipper length.
head(penguins, 4)
# A tibble: 4 x 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
# … with 3 more variables: body_mass_g <int>, sex <fct>, year <int>
Here, we can see that Gentoo are some big penguins and that Adelie penguins have shorter bill length.
penguins %>%
pivot_longer(cols = where(is.numeric)) %>%
mutate(name = stringr::str_replace_all(name, "_", " "),
name = stringr::str_wrap(name, width = 10)) %>%
ggplot(aes(x = species, y = value, fill = species)) +
geom_boxplot() +
scale_fill_manual(values = c("#0E89BA","#85BAA1","#C16E70")) +
facet_wrap(~name, scales = "free", nrow = 1) +
labs(title = "The penguin species are different",
x = NULL) +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)
As a reminder, I will follow the same general pattern above
Below, I calculated p-values and R-squared values using
stats::lm()
stats::anova()
broom::tidy()
penguins %>%
# tidyr functions to select all the numeric columns and
# create a `name` and `value` column
pivot_longer(cols = where(is.numeric)) %>%
group_by(name) %>%
# tidyr::nest to create a data frame where each level of the
# grouped variable has a single row, and all the other
# rows and columns are now in a single nested column, `data`
nest() %>%
# use purrr::map to create new nested columns with the objects
# returned from `lm`, `anova`, `broom::tidy`
mutate(lm_fit = map(data,
~ lm(value ~ species, data = .x)),
r2 = map_dbl(lm_fit, ~summary(.x)$r.squared),
anova = map(lm_fit, anova),
tidied = map(anova, tidy)) %>%
unnest(tidied) %>%
# this filter removes the rows with "Residuals"
filter(term == "species") %>%
select(-data, -lm_fit, -anova) %>%
knitr::kable(digits = 3)
name | r2 | term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|---|---|
bill_length_mm | 0.708 | species | 2 | 7.194317e+03 | 3597.159 | 410.600 | 0.00 |
bill_depth_mm | 0.680 | species | 2 | 9.039670e+02 | 451.984 | 359.789 | 0.00 |
flipper_length_mm | 0.778 | species | 2 | 5.247328e+04 | 26236.642 | 594.802 | 0.00 |
body_mass_g | 0.670 | species | 2 | 1.468642e+08 | 73432107.078 | 343.626 | 0.00 |
year | 0.003 | species | 2 | 6.010000e-01 | 0.300 | 0.447 | 0.64 |
We could also do this with group_modify
in dplyr.
From the documentation:
group_map(), group_modify() and group_walk() are purrr-style functions that can be used to iterate on grouped tibbles.
penguins %>%
pivot_longer(cols = where(is.numeric)) %>%
group_by(name) %>%
# there is a litle extra work here to return r.squared
# group_modify needs the returned value to be a data.frame!
# so you need to create one
group_modify( ~cbind(tibble(summary(lm(value ~ species, data = .))$r.squared,
.name_repair = ~c("r2")),
tidy(anova(lm(value ~ species, data = .))) %>%
filter(term == "species"))) %>%
knitr::kable(digits = 3)
name | r2 | term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|---|---|
bill_depth_mm | 0.680 | species | 2 | 9.039670e+02 | 451.984 | 359.789 | 0.00 |
bill_length_mm | 0.708 | species | 2 | 7.194317e+03 | 3597.159 | 410.600 | 0.00 |
body_mass_g | 0.670 | species | 2 | 1.468642e+08 | 73432107.078 | 343.626 | 0.00 |
flipper_length_mm | 0.778 | species | 2 | 5.247328e+04 | 26236.642 | 594.802 | 0.00 |
year | 0.003 | species | 2 | 6.010000e-01 | 0.300 | 0.447 | 0.64 |
I often work with time to event models (survival models). You can also follow this same pattern.
Take for example the survival::lung
dataset that has some variables like age, sex, performance status (ECOG and Karnofsky), etc.
glimpse(lung)
Rows: 228
Columns: 10
$ inst <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12,…
$ time <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166…
$ status <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ age <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68…
$ sex <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
$ ph.ecog <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1…
$ ph.karno <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 9…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 9…
$ meal.cal <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271,…
$ wt.loss <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32…
We can repeat the same pattern to test cox proportional hazards models for these variables individually in univariate models.
lung %>%
# tidyr to make the data long
pivot_longer(cols = -c(status, time)) %>%
group_by(name) %>%
# group the data
nest() %>%
# use purrr::map to create new nested columns with the objects
mutate(cox_fit = map(data,
~ coxph(Surv(time, status) ~ value, data = .x)),
tidied = map(cox_fit, tidy, conf.int = TRUE)) %>%
unnest(tidied) %>%
select(-data, -cox_fit) %>%
knitr::kable(digits = 3)
name | term | estimate | std.error | statistic | p.value | conf.low | conf.high |
---|---|---|---|---|---|---|---|
inst | value | -0.010 | 0.010 | -0.942 | 0.346 | -0.030 | 0.010 |
age | value | 0.019 | 0.009 | 2.035 | 0.042 | 0.001 | 0.037 |
sex | value | -0.531 | 0.167 | -3.176 | 0.001 | -0.859 | -0.203 |
ph.ecog | value | 0.476 | 0.113 | 4.198 | 0.000 | 0.254 | 0.698 |
ph.karno | value | -0.016 | 0.006 | -2.810 | 0.005 | -0.028 | -0.005 |
pat.karno | value | -0.020 | 0.005 | -3.631 | 0.000 | -0.031 | -0.009 |
meal.cal | value | 0.000 | 0.000 | -0.535 | 0.593 | -0.001 | 0.000 |
wt.loss | value | 0.001 | 0.006 | 0.217 | 0.828 | -0.011 | 0.013 |
Note: edited on 2022-04-08 to fix mistake with group_modify()
pander::pander(sessionInfo())
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: ggplot2(v.3.3.5), survival(v.3.2-10), palmerpenguins(v.0.1.0), broom(v.0.7.6), purrr(v.0.3.4), dplyr(v.1.0.5) and tidyr(v.1.1.3)
loaded via a namespace (and not attached): tidyselect(v.1.1.0), xfun(v.0.30), bslib(v.0.2.5.1), pander(v.0.6.3), splines(v.4.0.5), lattice(v.0.20-41), colorspace(v.2.0-0), vctrs(v.0.3.7), generics(v.0.1.0), htmltools(v.0.5.1.1), yaml(v.2.2.1), utf8(v.1.2.1), rlang(v.0.4.10), jquerylib(v.0.1.4), pillar(v.1.6.0), glue(v.1.4.2), withr(v.2.4.2), DBI(v.1.1.1), lifecycle(v.1.0.0), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), memoise(v.2.0.0), evaluate(v.0.14), labeling(v.0.4.2), knitr(v.1.37), fastmap(v.1.1.0), fansi(v.0.4.2), highr(v.0.9), Rcpp(v.1.0.7), backports(v.1.2.1), scales(v.1.1.1), cachem(v.1.0.5), jsonlite(v.1.7.2), farver(v.2.1.0), distill(v.1.3), digest(v.0.6.29), stringi(v.1.5.3), grid(v.4.0.5), cli(v.3.1.0), tools(v.4.0.5), magrittr(v.2.0.1), sass(v.0.4.0), tibble(v.3.1.0), crayon(v.1.4.1), pkgconfig(v.2.0.3), downlit(v.0.4.0), ellipsis(v.0.3.1), Matrix(v.1.3-2), assertthat(v.0.2.1), rmarkdown(v.2.11), rstudioapi(v.0.13), R6(v.2.5.0) and compiler(v.4.0.5)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Alice (2022, March 20). Alice Walsh: Testing many models with grouped data. Retrieved from https://awalsh17.github.io/posts/2022-03-20-more-models/
BibTeX citation
@misc{alice2022testing, author = {Alice, }, title = {Alice Walsh: Testing many models with grouped data}, url = {https://awalsh17.github.io/posts/2022-03-20-more-models/}, year = {2022} }