UPDATE 3/29/2023: I have updated this project for tweets all the way up to March 13th, 2023
This was a project that I undertook after Graduation in 2021. I downloaded my Twitter Data, extracted the tweets, and cut out the individual words. I then visualized the most commonly tweeted words using wordcloud2. The data below includes tweets all the way up to Mid-2021. I do plan on adding in an update for 2023.
In the chunk below, the tweets.js file is read into R
raw_Tweets <- read_file(file = here::here("tweets.js"))
json <- sub("window.YTD.tweets.part0 = ", "", raw_Tweets)
raw_Tweets <- fromJSON(json)
raw_Tweets <- raw_Tweets$tweet
Using several commands from the stringr package in the Tidyverse, the tweet column (full_text) is cleaned of my retweets, tweets that start off or include a t.co link (Which maybe a tweet I’m quotetweeting or a media file), or other handles.
clean_Tweets <- raw_Tweets %>%
filter(retweeted == FALSE) %>%
filter(!str_detect(full_text, "^RT "),
!str_detect(full_text, "^https")) %>%
mutate(retweet_count = as.integer(retweet_count),
full_text = str_replace_all(full_text, "https://t.co/[a-zA-Z0-9]*", ""),
full_text = str_replace_all(full_text, "@[a-zA-Z0-9_]*", ""))
Out of curiosity, the Top 5 most retweeted tweets on the profile.
clean_Tweets %>%
select(retweet_count, full_text) %>%
arrange(desc(retweet_count)) %>%
head(n = 5)
## retweet_count
## 1 13
## 2 9
## 3 7
## 4 4
## 5 3
## full_text
## 1 Protect this man at all costs
## 2 This pandemic really turning some of y'all into authoritarians
## 3 Mad props to Twitter for training their AI so well it can’t identify obvious sarcasm
## 4 Best baseball stadium in the MLB and it isn't even close
## 5 “let’s be out to Rutgers bars yo”
Also out of curiosity, the platforms I used most to send out tweets. I sure love using my phone!
clean_Tweets %>%
mutate(`tweet source` = str_extract(source, "Twitter [A-Za-z ]*")) %>%
filter(!is.na(`tweet source`)) %>%
ggplot() +
geom_bar(aes(`tweet source`))
In the chunk below I counted each individual word and ordered it by number of occurences descending. The first Ten rows have been printed below.
3/29: I filtered out gonna, which is slang for “Going To”. Going is a stop word and is filtered out. Wordle has also been filtered out. As much as how I enjoyed tweeting out my Wordle Scores, they were not tweets I tweeted but rather generated by the wordle share button.
tweets <- clean_Tweets %>%
select(full_text) %>%
unnest_tokens(words, full_text)
tweets <- tweets %>%
mutate(words = str_replace_all(words, "’", "'")) %>%
anti_join(stop_words, by = c("words" = "word")) %>%
filter(!str_detect(words, "^[0-9]+") & words != c("wordle", "gonna", "gonna", "wordle")) %>% #As of March 2023, I have filtered "Wordle" out of my most commonly tweeted words.
count(words) %>%
arrange(desc(n))
## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!str_detect(words, "^[0-9]+") & ...`.
## Caused by warning in `words != c("wordle", "gonna", "gonna", "wordle")`:
## ! longer object length is not a multiple of shorter object length
tweets %>% head(n=10)
## words n
## 1 time 36
## 2 twitter 28
## 3 rutgers 27
## 4 game 25
## 5 people 25
## 6 giants 23
## 7 hell 22
## 8 day 20
## 9 football 20
## 10 god 20
Finally, we include the final wordcloud generated using the wordcloud2 package
twitter_wordcloud2 <- wordcloud2(data = tweets)
twitter_wordcloud2