North Central Weed Science Society Meeting Proceedings, an Overview of the XXI Century Using Text Analysis
This text analysis is part of my poster (co-authored with Rodrigo Werle and Sarah Marinho) presented at the 2020 Virtual North Central Weed Science Society NCWSS annual meeting (December 2020). Here I am presenting part of that abstract focused on weed species ranked amongst the top 100 words from 2001 through 2020 (I am adding the 2020 meeting proceedings to this analysis). I am also running the text analysis with less coding than for the 2020 NCWSS meeting which is what I am going to show in this analysis. If you are only interested in the final figure, please scroll to the bottom of this page.
First we have to load the packages needed for this analysis. Please run the codes below:
library(tidyverse)
library(tidytext)
library(textreadr)
library(pdftools)
library(ggtext)
# if you do not have any of these packages installed, please run install.packages("name_of_the_package")
I have downloaded all NCWSS proceedings and added into a folder named “docs” (you can name the folder as you choose). You can find all PDFs in “code” (see below the post title - folder “docs”).
I used the str_c function to get all PDFs, which are in the folder “docs”. ThePDFs output contains the path for all 20 NCWSS proceedings.
pdfs <- str_c("docs", "/", list.files("docs", pattern = "*.pdf"),
sep = "")
pdfs
## [1] "docs/nc2001.pdf" "docs/nc2002.pdf" "docs/nc2003.pdf" "docs/nc2004.pdf"
## [5] "docs/nc2005.pdf" "docs/nc2006.pdf" "docs/nc2007.pdf" "docs/nc2008.pdf"
## [9] "docs/nc2009.pdf" "docs/nc2010.pdf" "docs/nc2011.pdf" "docs/nc2012.pdf"
## [13] "docs/nc2013.pdf" "docs/nc2014.pdf" "docs/nc2015.pdf" "docs/nc2016.pdf"
## [17] "docs/nc2017.pdf" "docs/nc2018.pdf" "docs/nc2019.pdf" "docs/nc2020.pdf"
Next you will name all PDFs. If you run the code below, list.files will keep the PDFs names as shown in the code above.
pdf_names <- list.files("docs", pattern = "*.pdf")
pdf_names
## [1] "nc2001.pdf" "nc2002.pdf" "nc2003.pdf" "nc2004.pdf" "nc2005.pdf"
## [6] "nc2006.pdf" "nc2007.pdf" "nc2008.pdf" "nc2009.pdf" "nc2010.pdf"
## [11] "nc2011.pdf" "nc2012.pdf" "nc2013.pdf" "nc2014.pdf" "nc2015.pdf"
## [16] "nc2016.pdf" "nc2017.pdf" "nc2018.pdf" "nc2019.pdf" "nc2020.pdf"
Here is where the magic occurs, I will use the function map of the package purrr (tidyverse core). Using map function saves coding and time.
pdfs_text <- map(pdfs, pdftools::pdf_text)
This “magic” is called iteration, so instead of running the analysis by each year we can run it all together. Running pdfs_text alone you get you all proceedings organized as a list. I am not running pdfs_text here because it is a large output. Nonetheless, pdfs_text is not tidy for the analysis yet.
The iteration with map function should be proceeded with a tibble function to organize the proceedings of each year.
pdf <- tibble(document = pdf_names, text = pdfs_text) %>%
mutate(year = 2001:2020) # adding a column for each year
pdf
## # A tibble: 20 x 3
## document text year
## <chr> <list> <int>
## 1 nc2001.pdf <chr [211]> 2001
## 2 nc2002.pdf <chr [211]> 2002
## 3 nc2003.pdf <chr [205]> 2003
## 4 nc2004.pdf <chr [188]> 2004
## 5 nc2005.pdf <chr [229]> 2005
## 6 nc2006.pdf <chr [216]> 2006
## 7 nc2007.pdf <chr [248]> 2007
## 8 nc2008.pdf <chr [215]> 2008
## 9 nc2009.pdf <chr [165]> 2009
## 10 nc2010.pdf <chr [111]> 2010
## 11 nc2011.pdf <chr [174]> 2011
## 12 nc2012.pdf <chr [145]> 2012
## 13 nc2013.pdf <chr [143]> 2013
## 14 nc2014.pdf <chr [107]> 2014
## 15 nc2015.pdf <chr [120]> 2015
## 16 nc2016.pdf <chr [110]> 2016
## 17 nc2017.pdf <chr [136]> 2017
## 18 nc2018.pdf <chr [125]> 2018
## 19 nc2019.pdf <chr [233]> 2019
## 20 nc2020.pdf <chr [174]> 2020
As you can see in the tibble (data frame) above, each proceeding is stored as a list by each year (e.g., <chr [248]>).
Now that we have a tidy tibble, we can proceed with the tokenization using the function unnest_tokens.
pdf1 <- pdf %>%
unnest(text) %>% # pdfs_text is a list
mutate(text = str_to_lower(text), # making all text lower case
text = str_replace(text, "2,4-d",
"twofourd"), # need to replace it
text = str_replace(text, "marestail",
"horseweed")) %>% # marestail = horseweed
unnest_tokens(word, text, strip_numeric = TRUE)
pdf1 %>%
slice_head(n = 10)
## # A tibble: 10 x 3
## document year word
## <chr> <int> <chr>
## 1 nc2001.pdf 2001 industry
## 2 nc2001.pdf 2001 donations
## 3 nc2001.pdf 2001 of
## 4 nc2001.pdf 2001 intellectual
## 5 nc2001.pdf 2001 property
## 6 nc2001.pdf 2001 rights
## 7 nc2001.pdf 2001 to
## 8 nc2001.pdf 2001 universities
## 9 nc2001.pdf 2001 thomas
## 10 nc2001.pdf 2001 s
Notice that I used mutate function to change 2,4-D to “twofourd” because tokenization would split it in 2, 4 and D. Because the species has more than one common name, I treat marestail = horseweed.
Next we need to remove the “stopwords”. Stopwords are words like “in”, “and”, “at”, “their”, “about” etc. The function get_stopwords from tidytext package has five “stopword” sources, I will add them all and stored in stopwords. See below:
stopwords <- get_stopwords("en", source = c("smart")) %>%
bind_rows(get_stopwords("en", source = c("marimo"))) %>%
bind_rows(get_stopwords("en", source = c("nltk"))) %>%
bind_rows(get_stopwords("en", source = c("stopwords-iso"))) %>%
bind_rows(get_stopwords("en", source = c("snowball")))
stopwords %>%
slice_head(n=10)
## # A tibble: 10 x 2
## word lexicon
## <chr> <chr>
## 1 a smart
## 2 a's smart
## 3 able smart
## 4 about smart
## 5 above smart
## 6 according smart
## 7 accordingly smart
## 8 across smart
## 9 actually smart
## 10 after smart
Now that I have a tibble called “stopwords”, I will use anti_join function to remove the stopwords from pdf1
pdf2 <- pdf1 %>%
anti_join(stopwords, by = "word")
The get_stopwords function with all sources attributes is not enough to remove all words needed for my goal in this analysis. For example, I do not want to have words like “virtual”, “kansas”, “werle”, “proceedings” etc. I have manually made a random “stopwords” for weed science meetings, please check the WSSA text analysis. I am bringing a “stopword” that I made in my previous analysis in a source code “stop_words.R”. You can find “stop_words.R” in the “code” below the post title.
source("stop_words.R")
I have saved it as stop_tibble, which is used also with anti_join function. The anti_join function as described above will remove all “stopwords” in stop_tibble from pdf2. Notice that here I am also using mutate to bring back 2,4-D.
pdf3 <- pdf2 %>%
anti_join(stop_tibble, by = c("word")) %>% # stop_tibble is in the source code
mutate(word = str_replace(word, "twofourd", "2,4-d")) # bring back 2,4-d
Next I will use functions to count words over the years, arrange it as descending, group_by year, rank top 100 words (row_number) and filter the top 100 words by year.
pdf4 <- pdf3 %>%
count(year, word) %>%
arrange(year, -n) %>%
group_by(year) %>%
mutate(rank = row_number()) %>%
filter(rank <= 100)
Now I have the top 100 words for each year (NCWSS proceedinds):
pdf4 %>%
slice_head(n = 10)
## # A tibble: 200 x 4
## # Groups: year [20]
## year word n rank
## <int> <chr> <int> <int>
## 1 2001 control 716 1
## 2 2001 weed 632 2
## 3 2001 glyphosate 475 3
## 4 2001 applied 427 4
## 5 2001 herbicide 369 5
## 6 2001 corn 358 6
## 7 2001 treatments 339 7
## 8 2001 soybean 257 8
## 9 2001 common 240 9
## 10 2001 yield 226 10
## # … with 190 more rows
In this analysis I am interested only on weeds present in the top 100 words in 2001 and 2020. Therefore, I am using if_else function to create new columns for highlighting selected weed species. You can change and select any word if want as I did it with herbicides in my poster at the 2020 NCWSS meeting.
pdf5 <- pdf4 %>%
mutate(highlight = if_else(word %in% c("amaranth", "palmer",
"kochia", "horseweed",
"grass", "nightshade",
"waterhemp", "velvetleaf",
"ragweed", "sunflower",
"foxtail"), TRUE, FALSE),
variable_col = if_else(highlight == TRUE, word, "NA"))
pdf5 %>%
slice_head(n = 5)
## # A tibble: 100 x 6
## # Groups: year [20]
## year word n rank highlight variable_col
## <int> <chr> <int> <int> <lgl> <chr>
## 1 2001 control 716 1 FALSE NA
## 2 2001 weed 632 2 FALSE NA
## 3 2001 glyphosate 475 3 FALSE NA
## 4 2001 applied 427 4 FALSE NA
## 5 2001 herbicide 369 5 FALSE NA
## 6 2002 weed 834 1 FALSE NA
## 7 2002 control 658 2 FALSE NA
## 8 2002 glyphosate 624 3 FALSE NA
## 9 2002 corn 445 4 FALSE NA
## 10 2002 applied 330 5 FALSE NA
## # … with 90 more rows
Now the tibble is ready. Then, I will proceed with data visualization. First I will set the font family, colors and theme.
#Set theme
library(extrafont)
extrafont::loadfonts()
font_family <- 'Helvetica'
title_family <- ".New York"
background <- "#1D1D1D"
text_colour <- "white"
axis_colour <- "white"
plot_colour <- "black"
theme_style <- theme(text = element_text(family = font_family),
rect = element_rect(fill = background),
plot.background = element_rect(fill = background, color = NA),
plot.title = element_markdown(family = title_family,
face = 'bold', size = 80, colour = text_colour),
plot.subtitle = element_markdown(family = title_family,
size = 40, colour = text_colour),
plot.caption = element_markdown(family = title_family,
size = 25, colour = text_colour, hjust = 0),
panel.background = element_rect(fill = background, color = NA),
panel.border = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
plot.margin = unit(c(3, 0.5, 0.5, 0.5), "cm"), # top, left, bottom, right
axis.title.y = element_text(face = 'bold', size = 40,
colour = text_colour),
axis.title.x = element_blank(),
axis.text.x.bottom = element_text(size = 45, colour= axis_colour,
vjust = 17),
axis.text.x.top = element_text(size = 45, colour= axis_colour,
vjust = -14),
axis.text.y = element_text(size = 30, colour = text_colour),
axis.ticks = element_blank(),
axis.line = element_blank(),
legend.text = element_text(size = 20, colour= text_colour),
legend.title = element_text(size = 25, colour= text_colour),
legend.position="none")
theme_set(theme_classic() + theme_style)
#Set colour palette
cols <- c("#F2D9F3", "#F2D9F3", "#00E5E5", "#DEB887",
"#FAC8C8", "#39393A", "#FA9664",
"#FF4040", "#48DE7A", "#942DC7",
"#F5F5DC", "#FAFA00")
Then I will plot the data. The idea here is to see the trend in weeds within the top 100 words from 2001 through 2020.
figure <- pdf5 %>%
ggplot(aes(x = year, y = rank, group = word)) +
geom_line(data = pdf5 %>% filter(variable_col == "NA"),
color = "#39393A", size = 4) +
geom_point(data = pdf5 %>% filter(variable_col == "NA"),
color = "#39393A", size = 10) +
geom_line(data = pdf5 %>% filter(variable_col != "NA"),
aes(color = variable_col), size = 4) +
geom_point(data = pdf5 %>% filter(variable_col != "NA"),
aes(color = variable_col), size = 10) +
scale_y_reverse(breaks = 100:1, sec.axis = dup_axis()) +
scale_x_continuous(breaks = seq(2001, 2020, 2), limits= c(1999.8, 2021.2),
expand = c(.05, .05), sec.axis = dup_axis()) +
geom_text(data = pdf5 %>% filter(year == "2001"),
aes(label = word, x = 2000.8, color = variable_col),
hjust = "right",
fontface = "bold",
size = 11) +
geom_text(data = pdf5 %>% filter(year == "2020"),
aes(label = word, x = 2020.2, color = variable_col),
hjust = "left",
fontface = "bold",
size = 11) +
coord_cartesian(ylim = c(101,1)) +
scale_color_manual(values = cols) +
labs(title = "<b style='color:red;'>NCWSS</b> annual meeting
proceedinds text analysis from 2001 through 2020",
subtitle = "Figure shows the rank of top 100 words of 2001 (left)
and 2020 (right) <b style='color:red;'>NCWSS</b> annual meeting proceedings.
Common weed species names are highlighed to <br> describe
their change across 20 years.",
y= "Rank",
caption = "Visualization: @maxwelco adapted from @JaredBraggins | Source: NCWSS")
#Export plot
ggsave("top_weeds.png", width = 40, height = 60, dpi=400, limitsize = FALSE, figure)
Check the figure carefully. What were scientists in the society focused in 2001? What has changed in 20 years? What hasn’t? Draw your own conclusions.
This figure was adapted from one of JaredBraggins Tidy Tuesday visualizations.
Click here to learn more about Tidy Text with Julia Silge.
- This post is also available at Open Weed Science
Thanks to Rodrigo Werle for reviewing this post. |