Twitter WordCloud · Samir's Website

UPDATE 3/29/2023: I have updated this project for tweets all the way up to March 13th, 2023

This was a project that I undertook after Graduation in 2021. I downloaded my Twitter Data, extracted the tweets, and cut out the individual words. I then visualized the most commonly tweeted words using wordcloud2. The data below includes tweets all the way up to Mid-2021. I do plan on adding in an update for 2023.

In the chunk below, the tweets.js file is read into R

raw_Tweets <- read_file(file = here::here("tweets.js"))

json <- sub("window.YTD.tweets.part0 = ", "", raw_Tweets)

raw_Tweets <- fromJSON(json)

raw_Tweets <- raw_Tweets$tweet

Using several commands from the stringr package in the Tidyverse, the tweet column (full_text) is cleaned of my retweets, tweets that start off or include a t.co link (Which maybe a tweet I’m quotetweeting or a media file), or other handles.

clean_Tweets <- raw_Tweets %>%
  filter(retweeted == FALSE) %>%
  filter(!str_detect(full_text, "^RT "),
         !str_detect(full_text, "^https")) %>%
  mutate(retweet_count = as.integer(retweet_count),
         full_text = str_replace_all(full_text, "https://t.co/[a-zA-Z0-9]*", ""),
         full_text = str_replace_all(full_text, "@[a-zA-Z0-9_]*", ""))

Out of curiosity, the Top 5 most retweeted tweets on the profile.

clean_Tweets %>%
  select(retweet_count, full_text) %>%
  arrange(desc(retweet_count)) %>%
  head(n = 5)

##   retweet_count
## 1            13
## 2             9
## 3             7
## 4             4
## 5             3
##                                                                               full_text
## 1                                                        Protect this man at all costs 
## 2                        This pandemic really turning some of y'all into authoritarians
## 3 Mad props to Twitter for training their AI so well it can’t identify obvious sarcasm 
## 4                            Best baseball stadium in the MLB and it isn't even close  
## 5                                                    “let’s be out to Rutgers bars yo”

Also out of curiosity, the platforms I used most to send out tweets. I sure love using my phone!

clean_Tweets %>%
  mutate(`tweet source` = str_extract(source, "Twitter [A-Za-z ]*")) %>%
  filter(!is.na(`tweet source`))  %>%
  ggplot() +
  geom_bar(aes(`tweet source`))

In the chunk below I counted each individual word and ordered it by number of occurences descending. The first Ten rows have been printed below.

3/29: I filtered out gonna, which is slang for “Going To”. Going is a stop word and is filtered out. Wordle has also been filtered out. As much as how I enjoyed tweeting out my Wordle Scores, they were not tweets I tweeted but rather generated by the wordle share button.

tweets <- clean_Tweets %>%
  select(full_text) %>%
  unnest_tokens(words, full_text)



tweets <- tweets %>%
  mutate(words = str_replace_all(words, "’", "'")) %>%
  anti_join(stop_words, by = c("words" = "word")) %>% 
  filter(!str_detect(words, "^[0-9]+") & words != c("wordle", "gonna", "gonna", "wordle")) %>%   #As of March 2023, I have filtered "Wordle" out of my most commonly tweeted words. 
  count(words) %>%
  arrange(desc(n))

## Warning: There was 1 warning in `filter()`.
## ℹ In argument: `!str_detect(words, "^[0-9]+") & ...`.
## Caused by warning in `words != c("wordle", "gonna", "gonna", "wordle")`:
## ! longer object length is not a multiple of shorter object length

tweets %>% head(n=10)

##       words  n
## 1      time 36
## 2   twitter 28
## 3   rutgers 27
## 4      game 25
## 5    people 25
## 6    giants 23
## 7      hell 22
## 8       day 20
## 9  football 20
## 10      god 20

Finally, we include the final wordcloud generated using the wordcloud2 package

twitter_wordcloud2 <- wordcloud2(data = tweets)

twitter_wordcloud2