Here are my original graphics for this project. Let’s talk about their strengths:
And that’s about it. These graphics get the job done, but they definitely don’t turn any heads (unless your favorite color is orange). I think for now, I’m going to focus on revamping the bar charts for this project. For the updated graphics, I hope to implement these changes.
Let’s get to work!
Once I fired up my old project, the first thing I did was combine my two clean dataframes into one document term matrix. In my original graphs, I was trying to compare the top bigrams (word pairs) in the two sources. To have a clearer comparison, though, I’ll have to stick to single word frequency. Here’s what the first six rows look like.
head(dtm) # A tibble: 6 x 4 document term count document_name <chr> <chr> <dbl> <chr> 1 1 abingdon 1 Arlington Now 2 1 absolut 1 Arlington Now 3 1 accomplish 1 Arlington Now 4 1 accord 1 Arlington Now 5 1 acknowledg 1 Arlington Now 6 1 acquir 1 Arlington Now
A document term matrix is a great way to organize text data, especially if you’re analyzing word frequencies. Because I wanted to visualize a comparison between my two documents, this data structure was a no brainer for me.
The next thing I did was prepare the data for graphing by altering the scales. On our previous bar graphs the scales were different, which doesn’t make for great comparison. Even if I graphed my new combined document term matrix using the count variable, I still wouldn’t get a totally fair comparison since there was more data overall from Nextdoor. To fix this, I created a new variable for the percent term frequency. Whenever you’re comparing data from two sources, it is generally a better idea to scale according to percentages rather than raw count to achieve a fair comparison.
Before I could move on to graphing, I still had one more variable I wanted to create. As I mentioned earlier, one of my goals for the new visuals was to incorporate some kind of inferential analysis into the visual. One thing that struck me when I first completed this analysis was the frequency of terms referring to schools on both platforms. For my new graphs I decided to create a variable to differentiate school terms from the rest.
Finally, it was time to begin graphing! To make the graph less busy, I filtered out terms that were mentioned less than 10 times. Like the previous graphs, I ordered the bars so that the greatest frequency terms were first. Because my data was now in a single document term matrix, comparing the two sources was as easy as setting a ggplot parameter to color the bars according to document name. My next step was so set a parameter for the school variable, which would be shown as a black outline on the colored bars. Lastly, I added back my excellent labels and captions, even adding a note to help make reading the graph a little easier.
Here’s the final code and viz!
dtm %>% group_by(document_name) %>% mutate(perc = count/sum(count)) %>% ungroup() %>% mutate(school = ifelse(term %in% schools, "yes", "no")) %>% filter(count > 10) %>% ggplot(aes(x = reorder(term, count), y = perc*100, fill = document_name, color = school)) + geom_col() + coord_flip() + scale_color_manual(name = "Does the term refer to a school?", values = c("white", "black")) + scale_fill_discrete(name = "Comment source") + labs(title = "How are commenters talking about the No School Moves movement in Arlington?", subtitle = "A comparitive analysis of local user generated content", caption = "Only terms with an overall frequency greater than are 10 shown Data sources: https://www.arlnow.com/2020/01/28/petition-against-school-swap-garners-1700-signatures/;https://www.nextdoor.com") + xlab("Terms") + ylab("Percent term frequency (e.g. 'mckinley' accounts for 2% of all terms in Nextdoor comments)") + theme_classic()
Awesome! I really like the updates I made to this project, especially taking the step to incorporate some inferential analysis into the viz. Of course, there’s always more to be done with a project. In the future I could see myself heading back into Observable to try to add some interactive features to the viz. Or, I could even try a new method of analysis, like text networks or sentiment analysis.
Thanks for sticking with me and I hope you learned something!