Mining some twitter data for trends and differences.
I am interested in the rise of data science across all industries and the growing demand for data-intensive work. In particular, I am interested in pharmaceutical development (where I work) for a couple of reasons:
Here I looked at twitter data as a publicly-available data source to confirm or reject my hypotheses around
I used the great {rtweet} package to collect recent tweets from companies official twitter accounts. It had been several years since I used this package, but thankfully, my old code worked fine. I would recommend Will Chase’s post on this topic as an intro, which I found very helpful when I got started.
Most companies have multiple twitter accounts. I tried to be fair and use each company’s “main” account, but sometimes there was also a “science at xyz” account that seemed interesting. So I also looked at those.
I chose a list of top Pharma companies based on sales. I manually found their twitter handles.
I also added a representative public biotech, Recursion, to serve as a comparison for a company that I expected to have a lot of content about data and machine learning.
ph_usernames <- c(
"janssenglobal",
"pfizer",
"roche",
"abbvie",
"novartis", "novartisscience",
"merck",
"bmsnews", "scienceatbms",
"gsk",
"sanofi", "sanofiscience",
"astrazeneca",
"takedapharma",
"lillypad",
"recursionpharma")
I retrieved the last 3,200 tweets from each account.
ph_tweets <- get_timeline(ph_usernames, n=3200)
This is a lot of tweets, but some are retweets.
nrow(ph_tweets)
[1] 47977
I combined the tweet data for companies where I queried both the “main” account and the “science” account.
ph_tweets <- ph_tweets %>%
mutate(screen_name = case_when(
screen_name %in% c("bmsnews","ScienceAtBMS") ~ "bmsnews|ScienceAtBMS",
screen_name %in% c("Novartis","NovartisScience") ~ "Novartis|NovartisScience",
screen_name %in% c("sanofi", "SanofiScience") ~ "Sanofi|SanofiScience",
TRUE ~ screen_name
))
I can calculate the number of tweets per company that contain certain words.
count_tweets_by_word <- function(data, search_words) {
data %>%
# no retweets
filter(!is_retweet) %>%
# remove urls
mutate(text_filt =
stringr::str_replace_all(text,"https?://t.co/[A-Za-z\\d]+|&", "")) %>%
mutate(tweet_pos = stringr::str_detect(text_filt,
paste(search_words, collapse = "|"))) %>%
group_by(screen_name) %>%
summarise(n_pos = sum(tweet_pos),
n_neg = sum(!tweet_pos),
.groups = "drop") %>%
mutate(perc_pos = scales::percent(n_pos / (n_pos + n_neg))) %>%
arrange(desc(n_pos / (n_pos + n_neg)))
}
Let’s test with “data”.
We can get more specific and count tweets with the key words “data science”, “AI”, and “machine learning”. I went ahead and included “statistics,” but sadly, there were very few tweets containing that word.
Let’s check out some example tweets. Pfizer only had two data science tweets in our search:
We’re leading the charge in #AI with the development of EstimATTR. This tool aims to educate healthcare providers about combinations of cardiac and non-cardiac conditions known to be associated with wild-type ATTR-CM.
— Pfizer Inc. (@pfizer) January 9, 2021
AstraZeneca had 46 tweets. Let’s check out their most recent tweet about machine learning:
Machine learning allows us to identify new targets for novel medicines. Our scientists are applying these methods and accelerating cancer drug discovery by combining CRISPR and AI. #ICML2022 https://t.co/KXemq7ytc4 pic.twitter.com/IwdxSoLa6h
— AstraZeneca (@AstraZeneca) July 17, 2022
Overall, some companies are tweeting more about data science than other companies! However, the total number and the fraction of total tweets that are about data science are tiny.
For AstraZeneca (the top tweeter), what were the top words they used overall?
plot_top_words("astrazeneca")
I wanted to look at trends over time, but I was guessing that this would not be feasible given the limitation of the twitter API returning the most recent 3,200 tweets.
However, the oldest tweets returned are from 2014, so I decided to do a quick look. This is a bit lazy as the earliest tweet available from each company varies because they tweet at different frequencies: Recursion and Novartis less frequently and Janssen more frequently.
Considering the big pharma companies, the first appearance of the terms “data science”, “machine learning”, “AI”, or “statistics” was in 2017.
The trend for these terms was increasing until 2020. I can speculate that there may have been a lot of COVID-related tweets in 2020 and 2021, but the data science tweets are back more than ever before in 2022!
ph_tweets %>%
# filter for our interesting tweets
filter(screen_name != "RecursionPharma") %>%
mutate(text_filt =
stringr::str_replace_all(text, "https?://t.co/[A-Za-z\\d]+|&", "")) %>%
mutate(text_pos = stringr::str_detect(text_filt,
paste(search_words, collapse = "|"))) %>%
# aggregate by year
group_by(text_pos,
year = lubridate::floor_date(created_at, unit = "year")) %>%
summarise(tweets = n(),
.groups = "drop") %>%
tidyr::complete(text_pos, year,
fill = list(tweets = 0)) %>%
group_by(year) %>%
summarise(tweets = tweets[text_pos] / sum(tweets),
.groups = "drop") %>%
mutate(year = as.numeric(substr(as.character(year), 1, 4))) %>%
filter(year > 2016) %>%
ggplot(aes(x = factor(year), y = tweets)) +
geom_col() +
scale_y_continuous(labels = scales::percent_format()) +
labs(title = "Prevalence of tweets about data science",
subtitle = "Recent tweets from 12 pharma companies",
y = "Percent of all tweets", x = NULL)
Finally, I can break the tweets down by company. These are the results for the companies with at least ten tweets with my key words.
While AstraZeneca was sending a lot of data science flavored tweets in 2019, there aren’t many tweets on the subject for the last 3 years. Meanwhile, Janssen has picked up the torch and is on track to tweet on data science topics more than anyone this year.
These numbers are very small and so you shouldn’t draw any serious conclusions. I am guessing these changes could be explained by a change of staff on the communications team rather than a major company strategy!
top_cos <- count_tweets_by_word(ph_tweets,
search_words = search_words) %>%
filter(n_pos > 10) %>% pull(screen_name)
tweet_by_co <- ph_tweets %>%
# filter for our interesting tweets
filter(screen_name %in% top_cos,
screen_name != "RecursionPharma") %>%
mutate(text_filt =
stringr::str_replace_all(text, "https?://t.co/[A-Za-z\\d]+|&", "")) %>%
filter(stringr::str_detect(text_filt,
paste(search_words, collapse = "|"))) %>%
# aggregate by year & company
group_by(
year = lubridate::floor_date(created_at, unit = "year"),
screen_name) %>%
summarise(tweets = n(),
.groups = "drop") %>%
mutate(year = as.numeric(substr(as.character(year), 1, 4)))
tweet_by_co %>%
ggplot(aes(x = year, y = tweets, color = screen_name)) +
geom_point(show.legend = F) +
geom_line(aes(group = screen_name), show.legend = F) +
geom_text(aes(label = screen_name, y = tweets, color = screen_name),
inherit.aes = FALSE,
data = filter(tweet_by_co, year == "2022"),
x = 2022.1,
hjust = 0, check_overlap = TRUE,
position = position_nudge(y = 0.5),
family = "Avenir",
show.legend = F) +
scale_x_continuous(expand = expansion(add = c(0.1, 3)),
breaks = 2015:2022) +
scale_y_continuous(breaks = seq(0, 30, 5)) +
labs(title = "Number of tweets about data science",
y = "Tweets", x = NULL) +
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank())
This project was a fun thing to look at over a long weekend. It confirmed my suspicions that some companies overall tweet more than others, and that some companies are promoting their data science more than others.
I hope to post more on this topic by looking at job descriptions, LinkedIn, and clinical trial data to examine more deeply the impact of data-intensive work on pharma R&D.
This post was updated 2022-10-07 to improve a couple of plots that I didn’t like and improve the regex.
pander::pander(sessionInfo())
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
locale: en_US.UTF-8||en_US.UTF-8||en_US.UTF-8||C||en_US.UTF-8||en_US.UTF-8
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: tidytext(v.0.3.1), ggplot2(v.3.3.6), dplyr(v.1.0.9) and rtweet(v.0.7.0)
loaded via a namespace (and not attached): tidyselect(v.1.1.2), xfun(v.0.31), bslib(v.0.2.5.1), pander(v.0.6.3), purrr(v.0.3.4), lattice(v.0.20-41), colorspace(v.2.0-3), vctrs(v.0.4.1), generics(v.0.1.3), htmltools(v.0.5.3), SnowballC(v.0.7.0), yaml(v.2.3.5), utf8(v.1.2.2), rlang(v.1.0.4), jquerylib(v.0.1.4), pillar(v.1.8.0), glue(v.1.6.2), withr(v.2.5.0), DBI(v.1.1.1), lifecycle(v.1.0.1), stringr(v.1.4.0), munsell(v.0.5.0), gtable(v.0.3.0), htmlwidgets(v.1.5.3), memoise(v.2.0.0), evaluate(v.0.15), labeling(v.0.4.2), knitr(v.1.39), fastmap(v.1.1.0), crosstalk(v.1.1.1), curl(v.4.3.2), fansi(v.1.0.3), highr(v.0.9), tokenizers(v.0.2.1), Rcpp(v.1.0.9), openssl(v.2.0.2), scales(v.1.2.0), DT(v.0.18), cachem(v.1.0.5), jsonlite(v.1.8.0), farver(v.2.1.1), distill(v.1.3), askpass(v.1.1), digest(v.0.6.29), stringi(v.1.7.8), grid(v.4.0.5), cli(v.3.3.0), tools(v.4.0.5), magrittr(v.2.0.3), sass(v.0.4.1), tibble(v.3.1.8), janeaustenr(v.0.1.5), crayon(v.1.5.1), tidyr(v.1.2.0), pkgconfig(v.2.0.3), downlit(v.0.4.0), ellipsis(v.0.3.2), Matrix(v.1.3-2), lubridate(v.1.8.0), assertthat(v.0.2.1), rmarkdown(v.2.11), httr(v.1.4.3), rstudioapi(v.0.13), R6(v.2.5.1) and compiler(v.4.0.5)
If you see mistakes or want to suggest changes, please create an issue on the source repository.
For attribution, please cite this work as
Walsh (2022, Sept. 4). Alice Walsh: Which pharma companies tweet about data science?. Retrieved from https://awalsh17.github.io/posts/2022-09-04-what-pharma-companies-tweet-about-data-science/
BibTeX citation
@misc{walsh2022which, author = {Walsh, Alice}, title = {Alice Walsh: Which pharma companies tweet about data science?}, url = {https://awalsh17.github.io/posts/2022-09-04-what-pharma-companies-tweet-about-data-science/}, year = {2022} }