Clinical Trial Visualizations

Re-analysis of publicly-available clinical trial data

Alice Walsh true
2022-10-02

I work in drug development, where we often discuss two things:

This observation is sometimes referred to as Eroom’s law (Moore’s law backward).

In this post, I look at clinical trial data and make a couple of visualizations.

Background

Many researchers have studied drug development trends and examined possible approaches to increase the probability of success and decrease the time and costs of bringing a good drug to market.

For example:

The overall likelihood of approval (LOA) from Phase I for all developmental candidates was 9.6%, and 11.9% for all indications outside of Oncology.

The overall likelihood of approval (LOA) from Phase I for all developmental candidates over 2011–2020 was 7.9%. On average, it takes 10.5 years for a Phase I asset to progress to regulatory approval. Disease areas with above-average LOAs tend to have the shortest development timelines.

What data is available?

Most of the papers and reports cited above use data from proprietary databases that are not available for free. Or they performed some challenging (or even heroic) data integration.

I wanted to explore what was publicly available and at least re-create the plots in those reports.

How long does each phase of development take?

I will use data from clinicaltrials.gov. There are many ways to interact with this data, such as the API. Some R packages exist to work with this data in R, but I have not tested them.

Here, I downloaded a snapshot of the data from the AACT database, which provides a relational database that downloads data from clinicaltrials.gov daily. These snapshots are relatively large (~1.4 GB).

Load packages

library(dplyr)
library(readr)
library(ggplot2)

# set up plot preferences
theme_set(theme_minimal(base_family = "Avenir Next"))
palette_colors <- c("#c1e7ff", "#6996b3", "#004c6d")

Load clinicaltrials.gov data

I loaded a small set of tables (out of 50 total) that I need from the 2022-08-30 snapshot as a list of data.frames in R.

path_to_data <- "~/AACT/20220830/"
tables_to_read <- rlang::set_names(c("studies", "interventions", "conditions",
                                     "sponsors", "designs"))
db <- lapply(tables_to_read, 
             function(table) read_delim(paste0(path_to_data, table, ".txt"),
                                        delim = "|"))

Filter to the desired studies

The data contained information on over 400,000 studies with the earliest entries dating to 1999.

The code below filters to the studies I am most interested in.

Show code
selected_trials <- 
  db$studies %>% 
  filter(overall_status %in% c("Completed", "Terminated"),
         phase %in% c("Phase 1", "Phase 2", "Phase 3"),
         study_type == "Interventional",
         completion_date >= as.Date("2000-01-01"),
         completion_date <= as.Date("2019-12-31")) %>% 
  select(nct_id, phase, number_of_arms, source, enrollment,
         overall_status, study_type, 
         completion_date, completion_date_type,
         start_date, start_date_type,
         brief_title, official_title,
         results_first_posted_date, study_first_posted_date) %>% 
  semi_join({db$interventions %>%
      filter(intervention_type %in% c("Drug", "Biological"))},
      by = "nct_id") %>%
  inner_join({db$conditions %>% 
      tidyr::nest(conds = c(-nct_id))},
      by = "nct_id") %>% 
  semi_join({db$sponsors %>%
      filter(lead_or_collaborator == "lead" & agency_class == "INDUSTRY")},
      by = "nct_id") %>%
  inner_join({db$designs %>%
      filter(primary_purpose %in% c("Treatment", "Prevention")) %>%
      select(nct_id, intervention_model, primary_purpose)},
      by = "nct_id") %>%
  mutate(completion_days = 
           as.numeric(completion_date - start_date),
         completion_months = completion_days / 30.417) %>% 
  filter(completion_days > 0)

The filtered data is a fraction of the original database, but still contains a large number of trials to further investigate.

n_distinct(selected_trials$nct_id)
[1] 31955

Group trials by therapeutic area

I wanted to split the data by therapeutic area (e.g., cancer, infectious diseases). The database has MeSH terms in the browse_conditions table that are populated by the National Library of Medicine using an algorithm (according to the data dictionary). The conditions table has the data as submitted to clinicaltrials.gov (not drawn from a controlled vocabulary).

MeSH is a controlled vocabulary for medical terms created by the US National Library of Medicine. One can use MeSH RDF to query MeSH terms. Initially, I thought this would be an efficient way to categorize trials by therapeutic area, but I did not end up using that data.

After looking through this data for some time, it seems that the MeSH terms in browse_conditions are sometimes overly liberal and add terms that don’t seem applicable. Therefore, I used the conditions table and came up with a list of terms to search for instead of more systematic filtering by MeSH terms.

terms_tas <- data.frame(
  "therapeutic_area" = c("oncology", "cardiovascular", "metabolic"),
  "pattern" = c(
    "carcinoma|glioblastoma|glioma|tumor|cancer|leukemia|
    leukaemia|lymphoma|melanoma|neoplasm|myeloma|malignancy|metastasis",
    "cardio|artery|ventricular|ischemia|angina|
    aortic|arteriosclerosis|aneurysm|myocardial|hypertension|
    stroke|coronary|heart[^b]|cardiovascular|atrial",
    "obesity|diabetes|metabolic|overweight|insulin"
  )
)

Create some plots

Some notes on the plot:

Show code
# make helper functions to calc and plot
convert_duration_text <- function(days) {
  results <- rep(NA_character_, length(days))
  
  results[days < 31] <- {
    day_label <- ifelse(days[days < 31] == 1, "day", "days")
    paste(days[days < 31], day_label)
  }
  
  results[days >= 31 & days < 365] <- {
    months <- round(days[days >= 31 & days < 365] / 30.417, 1)
    month_label <- ifelse(months == 1, "mo", "mos")
    paste(months, month_label)
  }
  
  results[days >= 365] <- {
    years <- floor(days[days >= 365] / 365.25)
    left_days <- days[days >= 365] %% 365.25
    months <- round(left_days / 30.417)
    year_label <- ifelse(years == 1, "yr", "yrs")
    month_label <- ifelse(months == 1, "mo", "mos")
    paste(years, year_label, months, month_label)
  }
  
  return(results)
}

# Plotting function
plot_time_by_ta <- function(data, therapeutic_area) {
  
  ta_pattern <- terms_tas$pattern[terms_tas$therapeutic_area == therapeutic_area]
  
  filt_data <- data %>%
    tidyr::unnest(conds) %>% 
    filter(grepl(ta_pattern, downcase_name)) %>% 
    distinct(nct_id, phase, completion_date, completion_days,
             completion_months, start_date) %>% 
    filter(!is.na(completion_days)) %>% 
    group_by(phase) %>% 
    mutate(groups = paste0(phase, "\nn = ", n())) %>% 
    ungroup()
  
  filt_medians <- filt_data %>% 
    group_by(groups) %>% 
    summarise(medians = median(completion_months),
              peak_at_med = sum(between(completion_months,
                                        medians - 12, medians)),
              median_days = median(completion_days)) %>% 
    # set the position for the labels
    mutate(x1 = medians + 40, 
           x2 = medians, 
           y1 = max(peak_at_med) * 0.85, 
           y2 = max(peak_at_med) * 0.8, 
           median_label = paste0("median:\n",
                                 convert_duration_text(median_days)))
  
  filt_data %>% 
    ggplot(aes(x = completion_months, fill = groups)) + 
    geom_histogram(position = "identity", alpha = 1, binwidth = 12) + 
    geom_vline(aes(xintercept = medians),
               data = filt_medians,
               linetype="dashed", color = "gray40") +
    geom_text(aes(label = median_label,
                  x = x1,
                  y = y1),
              data = filt_medians, 
              size = 3,
              nudge_x = 3, nudge_y = 3,
              hjust = 0, vjust = 0,
              family = "Avenir Next") +
    geom_curve(aes(x = x1, y = y1, xend = x2, yend = y2),
               data = filt_medians,
               color = "gray40",
               curvature = -0.3,
               arrow = arrow(length = unit(0.05, "npc"), type = "closed")) + 
    scale_fill_manual(values = palette_colors) + 
    facet_wrap(~groups, nrow=1) +
    labs(x = "time in months",
         y = "# of trials", 
         title = "How long do clinical trials take?",
         subtitle = paste(therapeutic_area, "therapeutic area")) + 
    theme(legend.position = "none",
          panel.grid.minor = element_blank(),
          text = element_text(size=11),
          strip.text.x = element_text(size = 13))
}
plot_time_by_ta(selected_trials, "oncology")

plot_time_by_ta(selected_trials, "cardiovascular")

Overall, these numbers align with expectations from other studies like Wong et al. They show in Supplemental Table 17 the median duration for different therapeutic areas. My understanding is that this analysis included all trials (including non-industry sponsored).

Supplemental Table 17 from Wong et al., 2019

Do biomarkers improve the probability of success?

Clinicaltrials.gov data contains information on the inclusion and exclusion criteria for each study. However, it is currently challenging to extract information automatically on trial design, such as the type of biomarkers used.

Wong et al., 2019 presents data on the probability of success of clinical trials conditioned on biomarker use. The authors used citeline data. I do not have access to this data, but the authors provided some summary data in the paper that I can re-plot. I used the aggregated summary data from Table 3.

Table 3. POS of drug development programs with and without biomarkers, using data from January 1, 2005, to October 31, 2015, computed using the phase-by-phase method. These results consider only trials that use biomarkers in patient stratification. Since for the majority of trials using biomarkers (92.3%) their status is observed only on or after January 1, 2005, the choice of the time period is to ensure a fair comparison between trials using and not using biomarkers. SE denotes standard error.

Here, I am just plotting the oncology results.

Some notes on the plot:

Load and plot the data

Show code
wong_t3 <- read_csv("data/wongetal_all_table3.csv")

wong_t3 <- wong_t3 %>% 
  filter(category != "All",
         TA == "Oncology") %>% 
  select(-total_phase_transitions) %>%
  mutate(phase = stringr::str_wrap(phase, width = 10),
         phase = forcats::fct_inorder(phase)) %>% 
  group_by(phase) %>% 
  mutate(percent_increase = paste0("+", POS[2] - POS[1], "%")) %>% 
  ungroup()

wong_t3 %>% 
  ggplot(aes(x = category, y = POS, fill = category)) +
  geom_col()+
  geom_errorbar(
    aes(ymin = POS - SE, ymax = POS + SE), 
    width = 0.7,
    color = "gray40") + 
  geom_curve(
    aes(x = 1, y = `POS_No biomarker` + 7,
        xend = 1.5, yend = `POS_With biomarker`),
    data = tidyr::pivot_wider(
      wong_t3, 
      names_from = category, 
      values_from = c("POS", "SE")),
    inherit.aes = F,
    color = "gray40",
    linetype = "solid",
    size = 0.5,
    curvature = -0.2,
    arrow = arrow(length = unit(0.1, "npc"))) +
  geom_text(
    aes(label = percent_increase, y = POS + SE),
    data = filter(wong_t3, category == "No biomarker"),
    family= "Avenir Next", 
    size = 3.5,
    nudge_y = 3 ) +
  scale_fill_manual(values = palette_colors) +
  facet_wrap(~phase, nrow = 1, strip.position = "bottom") + 
  labs(x = NULL,
       y = "POS (% +/- SE)", 
       fill = NULL) +
  theme(legend.position = c(0.15, 0.85), 
        legend.background = element_rect(fill="white", color = "white"),
        axis.text.x = element_blank(),
        strip.text.x = element_text(size = 11))

Some thoughts

sessionInfo

R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] ggplot2_3.3.6 readr_1.4.0   dplyr_1.0.9  

loaded via a namespace (and not attached):
 [1] highr_0.9         pillar_1.8.0      bslib_0.2.5.1    
 [4] compiler_4.0.5    jquerylib_0.1.4   forcats_0.5.1    
 [7] tools_4.0.5       digest_0.6.29     downlit_0.4.0    
[10] gtable_0.3.0      jsonlite_1.8.0    evaluate_0.15    
[13] memoise_2.0.0     lifecycle_1.0.1   tibble_3.1.8     
[16] pkgconfig_2.0.3   rlang_1.0.4       cli_3.3.0        
[19] DBI_1.1.1         rstudioapi_0.13   distill_1.3      
[22] yaml_2.3.5        xfun_0.31         fastmap_1.1.0    
[25] withr_2.5.0       stringr_1.4.0     knitr_1.39       
[28] generics_0.1.3    vctrs_0.4.1       sass_0.4.0       
[31] hms_1.0.0         grid_4.0.5        tidyselect_1.1.2 
[34] glue_1.6.2        R6_2.5.1          fansi_1.0.3      
[37] rmarkdown_2.11    farver_2.1.1      tidyr_1.2.0      
[40] purrr_0.3.4       magrittr_2.0.3    scales_1.2.0     
[43] htmltools_0.5.1.1 ellipsis_0.3.2    assertthat_0.2.1 
[46] colorspace_2.0-3  labeling_0.4.2    utf8_1.2.2       
[49] stringi_1.7.8     munsell_0.5.0     cachem_1.0.5     
[52] crayon_1.5.1     

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Citation

For attribution, please cite this work as

Walsh (2022, Oct. 2). Alice Walsh: Clinical Trial Visualizations. Retrieved from https://awalsh17.github.io/posts/2022-10-02-replotting-clinical-trials/

BibTeX citation

@misc{walsh2022clinical,
  author = {Walsh, Alice},
  title = {Alice Walsh: Clinical Trial Visualizations},
  url = {https://awalsh17.github.io/posts/2022-10-02-replotting-clinical-trials/},
  year = {2022}
}