Testing many models with grouped data

Another example of something I do a lot and forget how to do.

Alice true
2022-03-20

Many models!

I often have a situation where I am testing many hypotheses.

How I tested many models in R in the past was to use lapply or a loop. I don’t think there is any problem with that approach, I just really like using a pattern these days with grouped data using tidyverse packages.

The general pattern is

  1. Make the data long (if not already long)
  2. Group and nest
  3. Mutate to calculate your statistics
  4. Un-nest, filter, or select to get your desired output


There is a great vignette on this topic from the {broom} package.

First, load some packages.

library(tidyr, quietly = TRUE) # manipulating data and nesting
library(dplyr, quietly = TRUE) # general data and piping
library(purrr, quietly = TRUE) # i will use purrr::map
library(broom)                 # very good at reformatting model objects
library(palmerpenguins)        # for more fun data
library(survival)              # for time to event models
library(ggplot2)               # to make our plots
theme_set(theme_minimal(base_family = "Avenir")) # for plot appearance

An example: penguin linear models

Let’s say we are interested in the association between all the numeric variables in the {palmerpenguins} penguin dataset and the species.

If you are not familiar with this dataset, the three penguin species have different features like bill depth, bill length, body mass, and flipper length.

head(penguins, 4)
# A tibble: 4 x 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm
  <fct>   <fct>              <dbl>         <dbl>             <int>
1 Adelie  Torgersen           39.1          18.7               181
2 Adelie  Torgersen           39.5          17.4               186
3 Adelie  Torgersen           40.3          18                 195
4 Adelie  Torgersen           NA            NA                  NA
# … with 3 more variables: body_mass_g <int>, sex <fct>, year <int>

Here, we can see that Gentoo are some big penguins and that Adelie penguins have shorter bill length.

Show code
penguins %>% 
  pivot_longer(cols = where(is.numeric)) %>% 
  mutate(name = stringr::str_replace_all(name, "_", " "),
         name = stringr::str_wrap(name, width = 10)) %>%
  ggplot(aes(x = species, y = value, fill = species)) + 
  geom_boxplot() + 
  scale_fill_manual(values = c("#0E89BA","#85BAA1","#C16E70")) +
  facet_wrap(~name, scales = "free", nrow = 1) + 
  labs(title = "The penguin species are different",
       x = NULL) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
  )

As a reminder, I will follow the same general pattern above

  1. Make the data long - pivot all the numeric variables
  2. Group and nest - group by the variable name
  3. Mutate to calculate your statistics - variable ~ species
  4. Un-nest, filter, or select to get your desired output


Below, I calculated p-values and R-squared values using

  1. stats::lm()
  2. stats::anova()
  3. broom::tidy()
penguins %>% 
  # tidyr functions to select all the numeric columns and 
  # create a `name` and `value` column
  pivot_longer(cols = where(is.numeric)) %>% 
  group_by(name) %>% 
  # tidyr::nest to create a data frame where each level of the 
  # grouped variable has a single row, and all the other
  # rows and columns are now in a single nested column, `data`
  nest() %>% 
  # use purrr::map to create new nested columns with the objects
  # returned from `lm`, `anova`, `broom::tidy`
  mutate(lm_fit = map(data, 
                      ~ lm(value ~ species, data = .x)),
         r2 = map_dbl(lm_fit, ~summary(.x)$r.squared),
         anova = map(lm_fit, anova),
         tidied = map(anova, tidy)) %>% 
  unnest(tidied) %>%
  # this filter removes the rows with "Residuals"
  filter(term == "species") %>%
  select(-data, -lm_fit, -anova) %>% 
  knitr::kable(digits = 3)
name r2 term df sumsq meansq statistic p.value
bill_length_mm 0.708 species 2 7.194317e+03 3597.159 410.600 0.00
bill_depth_mm 0.680 species 2 9.039670e+02 451.984 359.789 0.00
flipper_length_mm 0.778 species 2 5.247328e+04 26236.642 594.802 0.00
body_mass_g 0.670 species 2 1.468642e+08 73432107.078 343.626 0.00
year 0.003 species 2 6.010000e-01 0.300 0.447 0.64

We could also do this with group_modify in dplyr.

From the documentation:

group_map(), group_modify() and group_walk() are purrr-style functions that can be used to iterate on grouped tibbles.

penguins %>% 
  pivot_longer(cols = where(is.numeric)) %>% 
  group_by(name) %>% 
  # there is a litle extra work here to return r.squared
  # group_modify needs the returned value to be a data.frame!
  # so you need to create one
  group_modify( ~cbind(tibble(summary(lm(value ~ species, data = .))$r.squared,
                              .name_repair = ~c("r2")),
                       tidy(anova(lm(value ~ species, data = .))) %>%
                         filter(term == "species"))) %>%
  knitr::kable(digits = 3)
name r2 term df sumsq meansq statistic p.value
bill_depth_mm 0.680 species 2 9.039670e+02 451.984 359.789 0.00
bill_length_mm 0.708 species 2 7.194317e+03 3597.159 410.600 0.00
body_mass_g 0.670 species 2 1.468642e+08 73432107.078 343.626 0.00
flipper_length_mm 0.778 species 2 5.247328e+04 26236.642 594.802 0.00
year 0.003 species 2 6.010000e-01 0.300 0.447 0.64

Another example: survival models

I often work with time to event models (survival models). You can also follow this same pattern.

Take for example the survival::lung dataset that has some variables like age, sex, performance status (ECOG and Karnofsky), etc.

glimpse(lung)
Rows: 228
Columns: 10
$ inst      <dbl> 3, 3, 3, 5, 1, 12, 7, 11, 1, 7, 6, 16, 11, 21, 12,…
$ time      <dbl> 306, 455, 1010, 210, 883, 1022, 310, 361, 218, 166…
$ status    <dbl> 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ age       <dbl> 74, 68, 56, 57, 60, 74, 68, 71, 53, 61, 57, 68, 68…
$ sex       <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
$ ph.ecog   <dbl> 1, 0, 0, 1, 0, 1, 2, 2, 1, 2, 1, 2, 1, NA, 1, 1, 1…
$ ph.karno  <dbl> 90, 90, 90, 90, 100, 50, 70, 60, 70, 70, 80, 70, 9…
$ pat.karno <dbl> 100, 90, 90, 60, 90, 80, 60, 80, 80, 70, 80, 70, 9…
$ meal.cal  <dbl> 1175, 1225, NA, 1150, NA, 513, 384, 538, 825, 271,…
$ wt.loss   <dbl> NA, 15, 15, 11, 0, 0, 10, 1, 16, 34, 27, 23, 5, 32…

We can repeat the same pattern to test cox proportional hazards models for these variables individually in univariate models.

lung %>% 
  # tidyr to make the data long
  pivot_longer(cols = -c(status, time)) %>% 
  group_by(name) %>% 
  # group the data
  nest() %>% 
  # use purrr::map to create new nested columns with the objects
  mutate(cox_fit = map(data, 
                      ~ coxph(Surv(time, status) ~ value, data = .x)),
         tidied = map(cox_fit, tidy, conf.int = TRUE)) %>% 
  unnest(tidied) %>% 
  select(-data, -cox_fit) %>% 
  knitr::kable(digits = 3)
name term estimate std.error statistic p.value conf.low conf.high
inst value -0.010 0.010 -0.942 0.346 -0.030 0.010
age value 0.019 0.009 2.035 0.042 0.001 0.037
sex value -0.531 0.167 -3.176 0.001 -0.859 -0.203
ph.ecog value 0.476 0.113 4.198 0.000 0.254 0.698
ph.karno value -0.016 0.006 -2.810 0.005 -0.028 -0.005
pat.karno value -0.020 0.005 -3.631 0.000 -0.031 -0.009
meal.cal value 0.000 0.000 -0.535 0.593 -0.001 0.000
wt.loss value 0.001 0.006 0.217 0.828 -0.011 0.013

Note: edited on 2022-04-08 to fix mistake with group_modify()

sessionInfo

pander::pander(sessionInfo())

R version 4.0.5 (2021-03-31)

Platform: x86_64-apple-darwin17.0 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: ggplot2(v.3.3.5), survival(v.3.2-10), palmerpenguins(v.0.1.0), broom(v.0.7.6), purrr(v.0.3.4), dplyr(v.1.0.5) and tidyr(v.1.1.3)

loaded via a namespace (and not attached): tidyselect(v.1.1.0), xfun(v.0.30), bslib(v.0.2.5.1), pander(v.0.6.3), splines(v.4.0.5), lattice(v.0.20-41), colorspace(v.2.0-0), vctrs(v.0.3.7), generics(v.0.1.0), htmltools(v.0.5.1.1), yaml(v.2.2.1), utf8(v.1.2.1), rlang(v.0.4.10), jquerylib(v.0.1.4), pillar(v.1.6.0), glue(v.1.4.2), withr(v.2.4.2), DBI(v.1.1.1), lifecycle(v.1.0.0), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), memoise(v.2.0.0), evaluate(v.0.14), labeling(v.0.4.2), knitr(v.1.37), fastmap(v.1.1.0), fansi(v.0.4.2), highr(v.0.9), Rcpp(v.1.0.7), backports(v.1.2.1), scales(v.1.1.1), cachem(v.1.0.5), jsonlite(v.1.7.2), farver(v.2.1.0), distill(v.1.3), digest(v.0.6.29), stringi(v.1.5.3), grid(v.4.0.5), cli(v.3.1.0), tools(v.4.0.5), magrittr(v.2.0.1), sass(v.0.4.0), tibble(v.3.1.0), crayon(v.1.4.1), pkgconfig(v.2.0.3), downlit(v.0.4.0), ellipsis(v.0.3.1), Matrix(v.1.3-2), assertthat(v.0.2.1), rmarkdown(v.2.11), rstudioapi(v.0.13), R6(v.2.5.0) and compiler(v.4.0.5)

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Alice (2022, March 20). Alice Walsh: Testing many models with grouped data. Retrieved from https://awalsh17.github.io/posts/2022-03-20-more-models/

BibTeX citation

@misc{alice2022testing,
  author = {Alice, },
  title = {Alice Walsh: Testing many models with grouped data},
  url = {https://awalsh17.github.io/posts/2022-03-20-more-models/},
  year = {2022}
}