Alice Walsh: Testing many models with grouped data

Alice

Many models!

I often have a situation where I am testing many hypotheses.

How I tested many models in R in the past was to use lapply or a loop. I don’t think there is any problem with that approach, I just really like using a pattern these days with grouped data using tidyverse packages.

The general pattern is

Make the data long (if not already long)
Group and nest
Mutate to calculate your statistics
Un-nest, filter, or select to get your desired output

There is a great vignette on this topic from the {broom} package.

First, load some packages.

library(tidyr, quietly = TRUE) # manipulating data and nesting
library(dplyr, quietly = TRUE) # general data and piping
library(purrr, quietly = TRUE) # i will use purrr::map
library(broom)                 # very good at reformatting model objects
library(palmerpenguins)        # for more fun data
library(survival)              # for time to event models
library(ggplot2)               # to make our plots
theme_set(theme_minimal(base_family = "Avenir")) # for plot appearance

An example: penguin linear models

Let’s say we are interested in the association between all the numeric variables in the {palmerpenguins} penguin dataset and the species.

If you are not familiar with this dataset, the three penguin species have different features like bill depth, bill length, body mass, and flipper length.

head(penguins, 4)

# A tibble: 4 x 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm
  <fct>   <fct>              <dbl>         <dbl>             <int>
1 Adelie  Torgersen           39.1          18.7               181
2 Adelie  Torgersen           39.5          17.4               186
3 Adelie  Torgersen           40.3          18                 195
4 Adelie  Torgersen           NA            NA                  NA
# … with 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Here, we can see that Gentoo are some big penguins and that Adelie penguins have shorter bill length.

Show code

penguins %>% 
  pivot_longer(cols = where(is.numeric)) %>% 
  mutate(name = stringr::str_replace_all(name, "_", " "),
         name = stringr::str_wrap(name, width = 10)) %>%
  ggplot(aes(x = species, y = value, fill = species)) + 
  geom_boxplot() + 
  scale_fill_manual(values = c("#0E89BA","#85BAA1","#C16E70")) +
  facet_wrap(~name, scales = "free", nrow = 1) + 
  labs(title = "The penguin species are different",
       x = NULL) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
  )

As a reminder, I will follow the same general pattern above

Make the data long - pivot all the numeric variables
Group and nest - group by the variable name
Mutate to calculate your statistics - variable ~ species
Un-nest, filter, or select to get your desired output

Below, I calculated p-values and R-squared values using

stats::lm()
stats::anova()
broom::tidy()

Note that you could also use broom::glance() to get R-squared

penguins %>% 
  # tidyr functions to select all the numeric columns and 
  # create a `name` and `value` column
  pivot_longer(cols = where(is.numeric)) %>% 
  group_by(name) %>% 
  # tidyr::nest to create a data frame where each level of the 
  # grouped variable has a single row, and all the other
  # rows and columns are now in a single nested column, `data`
  nest() %>% 
  # use purrr::map to create new nested columns with the objects
  # returned from `lm`, `anova`, `broom::tidy`
  mutate(lm_fit = map(data, 
                      ~ lm(value ~ species, data = .x)),
         r2 = map_dbl(lm_fit, ~summary(.x)$r.squared),
         anova = map(lm_fit, anova),
         tidied = map(anova, tidy)) %>% 
  unnest(tidied) %>%
  # this filter removes the rows with "Residuals"
  filter(term == "species") %>%
  select(-data, -lm_fit, -anova) %>% 
  knitr::kable(digits = 3)

name	r2	term	df	sumsq	meansq	statistic	p.value
bill_length_mm	0.708	species	2	7.194317e+03	3597.159	410.600	0.00
bill_depth_mm	0.680	species	2	9.039670e+02	451.984	359.789	0.00
flipper_length_mm	0.778	species	2	5.247328e+04	26236.642	594.802	0.00
body_mass_g	0.670	species	2	1.468642e+08	73432107.078	343.626	0.00
year	0.003	species	2	6.010000e-01	0.300	0.447	0.64

We could also do this with group_modify in dplyr.

From the documentation:

group_map(), group_modify() and group_walk() are purrr-style functions that can be used to iterate on grouped tibbles.

penguins %>% 
  pivot_longer(cols = where(is.numeric)) %>% 
  group_by(name) %>% 
  # there is a litle extra work here to return r.squared
  # group_modify needs the returned value to be a data.frame!
  # so you need to create one
  group_modify( ~cbind(tibble(summary(lm(value ~ species, data = .))$r.squared,
                              .name_repair = ~c("r2")),
                       tidy(anova(lm(value ~ species, data = .))) %>%
                         filter(term == "species"))) %>%
  knitr::kable(digits = 3)

name	r2	term	df	sumsq	meansq	statistic	p.value
bill_depth_mm	0.680	species	2	9.039670e+02	451.984	359.789	0.00
bill_length_mm	0.708	species	2	7.194317e+03	3597.159	410.600	0.00
body_mass_g	0.670	species	2	1.468642e+08	73432107.078	343.626	0.00
flipper_length_mm	0.778	species	2	5.247328e+04	26236.642	594.802	0.00
year	0.003	species	2	6.010000e-01	0.300	0.447	0.64

Another example: survival models

I often work with time to event models (survival models). You can also follow this same pattern.

Take for example the survival::lung dataset that has some variables like age, sex, performance status (ECOG and Karnofsky), etc.

glimpse(lung)

Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12,…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68…
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 9…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 9…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32…

We can repeat the same pattern to test cox proportional hazards models for these variables individually in univariate models.

lung %>% 
  # tidyr to make the data long
  pivot_longer(cols = -c(status, time)) %>% 
  group_by(name) %>% 
  # group the data
  nest() %>% 
  # use purrr::map to create new nested columns with the objects
  mutate(cox_fit = map(data, 
                      ~ coxph(Surv(time, status) ~ value, data = .x)),
         tidied = map(cox_fit, tidy, conf.int = TRUE)) %>% 
  unnest(tidied) %>% 
  select(-data, -cox_fit) %>% 
  knitr::kable(digits = 3)

name	term	estimate	std.error	statistic	p.value	conf.low	conf.high
inst	value	-0.010	0.010	-0.942	0.346	-0.030	0.010
age	value	0.019	0.009	2.035	0.042	0.001	0.037
sex	value	-0.531	0.167	-3.176	0.001	-0.859	-0.203
ph.ecog	value	0.476	0.113	4.198	0.000	0.254	0.698
ph.karno	value	-0.016	0.006	-2.810	0.005	-0.028	-0.005
pat.karno	value	-0.020	0.005	-3.631	0.000	-0.031	-0.009
meal.cal	value	0.000	0.000	-0.535	0.593	-0.001	0.000
wt.loss	value	0.001	0.006	0.217	0.828	-0.011	0.013

You could easily add a mutate() here to calculate p-values adjusted for multiple comparisons.

Note: edited on 2022-04-08 to fix mistake with group_modify()

sessionInfo

pander::pander(sessionInfo())

R version 4.0.5 (2021-03-31)

Platform: x86_64-apple-darwin17.0 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: ggplot2(v.3.3.5), survival(v.3.2-10), palmerpenguins(v.0.1.0), broom(v.0.7.6), purrr(v.0.3.4), dplyr(v.1.0.5) and tidyr(v.1.1.3)

loaded via a namespace (and not attached): tidyselect(v.1.1.0), xfun(v.0.30), bslib(v.0.2.5.1), pander(v.0.6.3), splines(v.4.0.5), lattice(v.0.20-41), colorspace(v.2.0-0), vctrs(v.0.3.7), generics(v.0.1.0), htmltools(v.0.5.1.1), yaml(v.2.2.1), utf8(v.1.2.1), rlang(v.0.4.10), jquerylib(v.0.1.4), pillar(v.1.6.0), glue(v.1.4.2), withr(v.2.4.2), DBI(v.1.1.1), lifecycle(v.1.0.0), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), memoise(v.2.0.0), evaluate(v.0.14), labeling(v.0.4.2), knitr(v.1.37), fastmap(v.1.1.0), fansi(v.0.4.2), highr(v.0.9), Rcpp(v.1.0.7), backports(v.1.2.1), scales(v.1.1.1), cachem(v.1.0.5), jsonlite(v.1.7.2), farver(v.2.1.0), distill(v.1.3), digest(v.0.6.29), stringi(v.1.5.3), grid(v.4.0.5), cli(v.3.1.0), tools(v.4.0.5), magrittr(v.2.0.1), sass(v.0.4.0), tibble(v.3.1.0), crayon(v.1.4.1), pkgconfig(v.2.0.3), downlit(v.0.4.0), ellipsis(v.0.3.1), Matrix(v.1.3-2), assertthat(v.0.2.1), rmarkdown(v.2.11), rstudioapi(v.0.13), R6(v.2.5.0) and compiler(v.4.0.5)

Testing many models with grouped data

Many models!

An example: penguin linear models

Another example: survival models

sessionInfo

Corrections

Citation