Alice Walsh: Which pharma companies tweet about data science?

Alice Walsh

I am interested in the rise of data science across all industries and the growing demand for data-intensive work. In particular, I am interested in pharmaceutical development (where I work) for a couple of reasons:

Data-intensive work is not “new” to Pharmaceutical development. Biostats teams are essential to designing clinicial trials, analyzing these trials, and reporting the results. Likewise, research teams in preclinical R&D have also generated and processed large data sets for many years.
I get a lot of ads/tweets/etc. about the potential of AI/ML/Data for drug discovery and development. I am in the optimistic camp when it comes to how better technology and methods can unlock scientific breakthroughs. But, I am also weary of unrealistic expectations and the abuse of “AI” to seem innovative or as a marketing tool.

Here I looked at twitter data as a publicly-available data source to confirm or reject my hypotheses around

Which companies are promoting AI/ML/Data as part of their marketing
Whether the use of data science terms has increased in the last few years

Methods

I used the great {rtweet} package to collect recent tweets from companies official twitter accounts. It had been several years since I used this package, but thankfully, my old code worked fine. I would recommend Will Chase’s post on this topic as an intro, which I found very helpful when I got started.

Most companies have multiple twitter accounts. I tried to be fair and use each company’s “main” account, but sometimes there was also a “science at xyz” account that seemed interesting. So I also looked at those.

I chose a list of top Pharma companies based on sales. I manually found their twitter handles.

I also added a representative public biotech, Recursion, to serve as a comparison for a company that I expected to have a lot of content about data and machine learning.

ph_usernames <- c(
  "janssenglobal",
  "pfizer", 
  "roche",
  "abbvie",
  "novartis", "novartisscience",
  "merck",
  "bmsnews", "scienceatbms",
  "gsk",
  "sanofi", "sanofiscience",
  "astrazeneca",
  "takedapharma",
  "lillypad",
  "recursionpharma")

I retrieved the last 3,200 tweets from each account.

ph_tweets <- get_timeline(ph_usernames, n=3200)

This is a lot of tweets, but some are retweets.

nrow(ph_tweets)

[1] 47977

I combined the tweet data for companies where I queried both the “main” account and the “science” account.

ph_tweets <- ph_tweets %>% 
  mutate(screen_name = case_when(
    screen_name %in% c("bmsnews","ScienceAtBMS") ~ "bmsnews|ScienceAtBMS",
    screen_name %in% c("Novartis","NovartisScience") ~ "Novartis|NovartisScience",
    screen_name %in% c("sanofi", "SanofiScience") ~ "Sanofi|SanofiScience",
    TRUE ~ screen_name
  ))

Who tweeted about data science the most?

I can calculate the number of tweets per company that contain certain words.

count_tweets_by_word <- function(data, search_words) {
  data %>% 
    # no retweets
    filter(!is_retweet) %>%
    # remove urls
    mutate(text_filt = 
             stringr::str_replace_all(text,"https?://t.co/[A-Za-z\\d]+|&amp;", "")) %>% 
    mutate(tweet_pos = stringr::str_detect(text_filt, 
                                           paste(search_words, collapse = "|"))) %>% 
    group_by(screen_name) %>% 
    summarise(n_pos = sum(tweet_pos),
              n_neg = sum(!tweet_pos), 
              .groups = "drop") %>%
    mutate(perc_pos = scales::percent(n_pos / (n_pos + n_neg))) %>% 
    arrange(desc(n_pos / (n_pos + n_neg)))
}

Let’s test with “data”.

count_tweets_by_word(ph_tweets, search_words = "[Dd]ata") %>% 
  DT::datatable(rownames = FALSE,
                colnames = c("Twitter Handle", "Tweets w/ words",
                             "Tweets w/o words", "Percent w/ words"))

We can get more specific and count tweets with the key words “data science”, “AI”, and “machine learning”. I went ahead and included “statistics,” but sadly, there were very few tweets containing that word.

search_words <- c("[Dd]ata [Ss]cience",
                  "[Mm]achine [Ll]earning",
                  "[ #]AI[^(DAB)]",
                  "[Ss]tatistics")

count_tweets_by_word(ph_tweets, search_words = search_words) %>% 
  DT::datatable(rownames = FALSE,
                colnames = c("Twitter Handle", "Tweets w/ words",
                             "Tweets w/o words", "Percent w/ words"))

Let’s check out some example tweets. Pfizer only had two data science tweets in our search:

We’re leading the charge in #AI with the development of EstimATTR. This tool aims to educate healthcare providers about combinations of cardiac and non-cardiac conditions known to be associated with wild-type ATTR-CM.
— Pfizer Inc. (@pfizer) January 9, 2021

AstraZeneca had 46 tweets. Let’s check out their most recent tweet about machine learning:

Machine learning allows us to identify new targets for novel medicines. Our scientists are applying these methods and accelerating cancer drug discovery by combining CRISPR and AI. #ICML2022 https://t.co/KXemq7ytc4 pic.twitter.com/IwdxSoLa6h
— AstraZeneca (@AstraZeneca) July 17, 2022

Overall, some companies are tweeting more about data science than other companies! However, the total number and the fraction of total tweets that are about data science are tiny.

For AstraZeneca (the top tweeter), what were the top words they used overall?

plot_top_words("astrazeneca")

When did everyone start tweeting about data science?

I wanted to look at trends over time, but I was guessing that this would not be feasible given the limitation of the twitter API returning the most recent 3,200 tweets.

However, the oldest tweets returned are from 2014, so I decided to do a quick look. This is a bit lazy as the earliest tweet available from each company varies because they tweet at different frequencies: Recursion and Novartis less frequently and Janssen more frequently.

Considering the big pharma companies, the first appearance of the terms “data science”, “machine learning”, “AI”, or “statistics” was in 2017.

The trend for these terms was increasing until 2020. I can speculate that there may have been a lot of COVID-related tweets in 2020 and 2021, but the data science tweets are back more than ever before in 2022!

Show code

ph_tweets %>% 
  # filter for our interesting tweets
  filter(screen_name != "RecursionPharma") %>%
  mutate(text_filt = 
           stringr::str_replace_all(text, "https?://t.co/[A-Za-z\\d]+|&amp;", "")) %>% 
  mutate(text_pos = stringr::str_detect(text_filt, 
                                        paste(search_words, collapse = "|"))) %>%
  # aggregate by year
  group_by(text_pos,
           year = lubridate::floor_date(created_at, unit = "year")) %>% 
  summarise(tweets = n(),
            .groups = "drop") %>%
  tidyr::complete(text_pos, year, 
                  fill = list(tweets = 0)) %>%
  group_by(year) %>% 
  summarise(tweets = tweets[text_pos] / sum(tweets),
            .groups = "drop") %>% 
  mutate(year = as.numeric(substr(as.character(year), 1, 4))) %>%
  filter(year > 2016) %>% 
  ggplot(aes(x = factor(year), y = tweets)) + 
  geom_col() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Prevalence of tweets about data science",
       subtitle = "Recent tweets from 12 pharma companies",
       y = "Percent of all tweets", x = NULL)

Finally, I can break the tweets down by company. These are the results for the companies with at least ten tweets with my key words.

While AstraZeneca was sending a lot of data science flavored tweets in 2019, there aren’t many tweets on the subject for the last 3 years. Meanwhile, Janssen has picked up the torch and is on track to tweet on data science topics more than anyone this year.

These numbers are very small and so you shouldn’t draw any serious conclusions. I am guessing these changes could be explained by a change of staff on the communications team rather than a major company strategy!

Show code

top_cos <- count_tweets_by_word(ph_tweets, 
                                search_words = search_words) %>% 
  filter(n_pos > 10) %>% pull(screen_name)

tweet_by_co <- ph_tweets %>% 
  # filter for our interesting tweets
  filter(screen_name %in% top_cos,
         screen_name != "RecursionPharma") %>%
  mutate(text_filt = 
           stringr::str_replace_all(text, "https?://t.co/[A-Za-z\\d]+|&amp;", "")) %>% 
  filter(stringr::str_detect(text_filt, 
                             paste(search_words, collapse = "|"))) %>%
  # aggregate by year & company
  group_by(
           year = lubridate::floor_date(created_at, unit = "year"),
           screen_name) %>% 
  summarise(tweets = n(),
            .groups = "drop") %>%
  mutate(year = as.numeric(substr(as.character(year), 1, 4))) 

tweet_by_co %>% 
  ggplot(aes(x = year, y = tweets, color = screen_name)) + 
  geom_point(show.legend = F) +
  geom_line(aes(group = screen_name), show.legend = F) + 
  geom_text(aes(label = screen_name, y = tweets, color = screen_name), 
            inherit.aes = FALSE,
            data = filter(tweet_by_co, year == "2022"),
            x = 2022.1, 
            hjust = 0, check_overlap = TRUE,
            position = position_nudge(y = 0.5),
            family = "Avenir",
            show.legend = F) + 
  scale_x_continuous(expand = expansion(add = c(0.1, 3)),
                     breaks = 2015:2022) + 
  scale_y_continuous(breaks = seq(0, 30, 5)) +
  labs(title = "Number of tweets about data science",
       y = "Tweets", x = NULL) +
  theme(panel.grid.minor.x = element_blank(),
        panel.grid.major.x = element_blank())

What’s next?

This project was a fun thing to look at over a long weekend. It confirmed my suspicions that some companies overall tweet more than others, and that some companies are promoting their data science more than others.

I hope to post more on this topic by looking at job descriptions, LinkedIn, and clinical trial data to examine more deeply the impact of data-intensive work on pharma R&D.

This post was updated 2022-10-07 to improve a couple of plots that I didn’t like and improve the regex.

sessionInfo

pander::pander(sessionInfo())

R version 4.0.5 (2021-03-31)

Platform: x86_64-apple-darwin17.0 (64-bit)

locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: tidytext(v.0.3.1), ggplot2(v.3.3.6), dplyr(v.1.0.9) and rtweet(v.0.7.0)

loaded via a namespace (and not attached): tidyselect(v.1.1.2), xfun(v.0.31), bslib(v.0.2.5.1), pander(v.0.6.3), purrr(v.0.3.4), lattice(v.0.20-41), colorspace(v.2.0-3), vctrs(v.0.4.1), generics(v.0.1.3), htmltools(v.0.5.3), SnowballC(v.0.7.0), yaml(v.2.3.5), utf8(v.1.2.2), rlang(v.1.0.4), jquerylib(v.0.1.4), pillar(v.1.8.0), glue(v.1.6.2), withr(v.2.5.0), DBI(v.1.1.1), lifecycle(v.1.0.1), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), htmlwidgets(v.1.5.3), memoise(v.2.0.0), evaluate(v.0.15), labeling(v.0.4.2), knitr(v.1.39), fastmap(v.1.1.0), crosstalk(v.1.1.1), curl(v.4.3.2), fansi(v.1.0.3), highr(v.0.9), tokenizers(v.0.2.1), Rcpp(v.1.0.9), openssl(v.2.0.2), scales(v.1.2.0), DT(v.0.18), cachem(v.1.0.5), jsonlite(v.1.8.0), farver(v.2.1.1), distill(v.1.3), askpass(v.1.1), digest(v.0.6.29), stringi(v.1.7.8), grid(v.4.0.5), cli(v.3.3.0), tools(v.4.0.5), magrittr(v.2.0.3), sass(v.0.4.1), tibble(v.3.1.8), janeaustenr(v.0.1.5), crayon(v.1.5.1), tidyr(v.1.2.0), pkgconfig(v.2.0.3), downlit(v.0.4.0), ellipsis(v.0.3.2), Matrix(v.1.3-2), lubridate(v.1.8.0), assertthat(v.0.2.1), rmarkdown(v.2.11), httr(v.1.4.3), rstudioapi(v.0.13), R6(v.2.5.1) and compiler(v.4.0.5)

Which pharma companies tweet about data science?

Methods

Who tweeted about data science the most?

When did everyone start tweeting about data science?

What’s next?

sessionInfo

Corrections

Citation