(Creation Date of Top 15 Accounts in the IRA Dataset)

(Creation Date of Top 15 Accounts in the IRA Dataset)

A bit over a week ago, Twitter’s new-ish Elections integrity team released two datasets with “all the accounts and related content associated with potential information operations that we have found on our service since 2016.”

In particular, this is what we are talking about:

“Our initial disclosures cover two previously disclosed campaigns, and include information from 3,841 accounts believed to be connected to the Russian Internet Research Agency [also known as IRA], and 770 accounts believed to originate in Iran.” (Twitter’s Election Integrity Team)

For a bit of a context on the IRA’s activities and the Russian Influence Operation in general, Mashable offers a nice overview.

The IRA zip alone is 1.24 GB big! Let’s dive in and explore. Before we can start with any analysis, automated or not, we have to inspect and prepare the data - remember: EDA FTW!

Anyways, a dataset of this size is a perfect exercise in data wrangling and exploratory analysis with tools from the galactic tidyverse. So what I’m aiming to highlight with this post, is my more or less systematic approach to turning an granular dataset with millions of observations into something more useful (and reliable!) for further in-depth analysis.

1 Data Preperation

You can get the data from here:

library(tidyverse)

First, we need to unpack the .zip file and then read the .csv file into R.

csvfile <- "ira_tweets_csv_hashed.csv"
data_raw <- read_csv("ira_tweets_csv_hashed.csv")

# Just for comparison
data2_raw <- data.table::fread(csvfile,
                          encoding = "UTF-8",
                          # na.strings = ",,",
                          verbose = TRUE)

At first – I (like others) – was getting an CRC error message when unpacking the zip file. The resulting CSV was only 1.17 GB big, so that were not all the Tweets from the IRA dataset. After upgrading 7-zip to v18.05 unpacking the zip results in a 5.3 GB big CSV, which is quite a difference I’d say :)

data.table is by light years the fastest method, but read_csv() gives me NA’s without an extra hassle while data.table::fread() seems to be ignoring any patterns for na.strings. Therefore I’ll be working with the data object from read_csv() here. Since read_csv() is rather slow with a CSV this big, you can speed up your exploration if you save the data object as an .rds or .RData file.

data_path <- here::here("data", "IRA_Tweets", "/")
# if (!dir.exists(data_path)) dir.create(data_path)
# saveRDS(data_raw, str_c(data_path, "infoops_data.rds"))
data_raw <- readRDS(str_c(data_path, "infoops_data.rds"))

With a dataset this big, skimr::skim() is just perfect (and it’s output much more functional in RStudio)!

data_raw %>% skimr::skim_to_wide() %>% knitr::kable("html")
type variable missing complete n min max empty n_unique median mean sd p0 p25 p50 p75 p100 hist
character account_language 0 9041308 9041308 2 5 0 11 NA NA NA NA NA NA NA NA NA
character hashtags 2378775 6662533 9041308 2 156 0 244865 NA NA NA NA NA NA NA NA NA
character in_reply_to_userid 8503421 537887 9041308 2 64 0 82809 NA NA NA NA NA NA NA NA NA
character is_retweet 0 9041308 9041308 4 5 0 2 NA NA NA NA NA NA NA NA NA
character latitude 9036529 4779 9041308 3 19 0 2752 NA NA NA NA NA NA NA NA NA
character longitude 9036529 4779 9041308 3 18 0 2807 NA NA NA NA NA NA NA NA NA
character poll_choices 9040172 1136 9041308 5 80 0 708 NA NA NA NA NA NA NA NA NA
character retweet_userid 5708124 3333184 9041308 2 64 0 204289 NA NA NA NA NA NA NA NA NA
character tweet_client_name 40341 9000967 9041308 3 32 0 333 NA NA NA NA NA NA NA NA NA
character tweet_language 296106 8745202 9041308 2 3 0 58 NA NA NA NA NA NA NA NA NA
character tweet_text 0 9041308 9041308 1 710 0 6598905 NA NA NA NA NA NA NA NA NA
character urls 1640484 7400824 9041308 2 2057 0 2760160 NA NA NA NA NA NA NA NA NA
character user_display_name 0 9041308 9041308 5 64 0 3664 NA NA NA NA NA NA NA NA NA
character user_mentions 5047231 3994077 9041308 3 808 0 570293 NA NA NA NA NA NA NA NA NA
character user_profile_description 1371499 7669809 9041308 1 160 0 2597 NA NA NA NA NA NA NA NA NA
character user_profile_url 7146952 1894356 9041308 22 23 0 200 NA NA NA NA NA NA NA NA NA
character user_reported_location 1526654 7514654 9041308 2 30 0 608 NA NA NA NA NA NA NA NA NA
character user_screen_name 0 9041308 9041308 6 64 0 3667 NA NA NA NA NA NA NA NA NA
character userid 0 9041308 9041308 9 64 0 3667 NA NA NA NA NA NA NA NA NA
Date account_creation_date 0 9041308 9041308 2009-04-24 2018-04-03 NA 653 2014-03-28 NA NA NA NA NA NA NA NA
integer follower_count 0 9041308 9041308 NA NA NA NA NA 8670.2 22146.39 0 346 842 4486 257638 ▇▁▁▁▁▁▁▁
integer following_count 0 9041308 9041308 NA NA NA NA NA 2522.47 5028.83 0 284 618 2014 74664 ▇▁▁▁▁▁▁▁
integer like_count 2673 9038635 9041308 NA NA NA NA NA 4 290.31 0 0 0 0 325826 ▇▁▁▁▁▁▁▁
integer quote_count 2673 9038635 9041308 NA NA NA NA NA 0.2 13.07 0 0 0 0 11633 ▇▁▁▁▁▁▁▁
integer reply_count 2673 9038635 9041308 NA NA NA NA NA 0.28 7.41 0 0 0 0 3249 ▇▁▁▁▁▁▁▁
integer retweet_count 2673 9038635 9041308 NA NA NA NA NA 3.46 140.33 0 0 0 0 123617 ▇▁▁▁▁▁▁▁
numeric in_reply_to_tweetid 8775100 266208 9041308 NA NA NA NA NA 6.1e+17 1.3e+17 0 5.7e+17 6.3e+17 6.6e+17 1e+18 ▁▁▁▁▆▇▁▁
numeric quoted_tweet_tweetid 8853395 187913 9041308 NA NA NA NA NA 8e+17 8.7e+16 1.8e+09 7.7e+17 8.1e+17 8.5e+17 1e+18 ▁▁▁▁▁▂▇▂
numeric retweet_tweetid 5708124 3333184 9041308 NA NA NA NA NA 6.7e+17 1.2e+17 100 5.7e+17 6.5e+17 7.9e+17 1e+18 ▁▁▁▁▇▅▆▁
numeric tweetid 0 9041308 9041308 NA NA NA NA NA 6.4e+17 1.6e+17 1.7e+09 5.3e+17 6.2e+17 7.8e+17 1e+18 ▁▁▁▃▇▅▅▁
POSIXct tweet_time 0 9041308 9041308 2009-05-09 2018-06-21 NA 1788062 2015-07-17 NA NA NA NA NA NA NA NA

We can already make some interesting observations from this summary alone:

  • The IRA dataset consist of 1.899.595 9.041.308 Tweets in 51 58 languages, from 3.460 3.667 unique accounts and 11 account languages. That’s pretty “diverse” but also quite complex.
  • $is_retweet has only 2 unique values, so it’s obviously a Boolean -> mutate()
  • There’s 1.899.595 9.041.308 observations for $tweet_text, but only 1.634.942 6.598.905 are unique. This huge delta just screams: spam bots and/or coordinated campaigns!
  • only 50K 266K Tweets are replies -> rather few interactions
  • there are some prominent accounts with up to 257K followers
  • 743.828 2.760.160 URLs to explore
  • we can see from $retweet_userid that apparently, 703.467 3.333.184 Tweets are just Retweets and not unique/original Tweets.
  • if we were to try to classify accounts by profile description, there’s a corpus of 2.451 2.597 unique profile descriptions ($user_profile_description) and 200 unique $user_profile_urls
  • all the $*_tweetid vars were read as numeric, which we’ll also need to change, as IDs are supposed to be unique identifiers and not continuous values -> mutate()
  • the Tweets were posted in the period from 2009-05-09 (!) to 2018-06-21, with the median around 2015-07-17
  • half of the accounts were created on or after 2014-03-28. Like there was an upcoming election or a referendum or something :)
  • there are 608 unique account locations (shared by an unknown number of those 3.667 accounts at this point), and there are 4.779 geolocated Tweets. That’s not much, but we could try to double-check these locations with the respective $account_language values.

That should give us enough leads for an initial inquiry. Let’s continue with the data preparation and address what we have discovered so far.

1.1 Change Variable Types

convert $is_retweet into a boolean

data_raw$is_retweet <- as.logical(data_raw$is_retweet)

convert $*_tweetid vars into strings

data_raw <- data_raw %>% 
  mutate_at(vars(ends_with("tweetid")),
            funs(as.character))

Now we can skim just the $*_tweetid vars and $is_retweet

data_raw %>% 
  select(is_retweet, ends_with("tweetid")) %>% 
  skimr::skim_to_wide(noten_raw) %>%
  knitr::kable("html")
type variable missing complete n min max empty n_unique mean count
character in_reply_to_tweetid 8775100 266208 9041308 1 19 0 236322 NA NA
character quoted_tweet_tweetid 8853395 187913 9041308 10 18 0 144609 NA NA
character retweet_tweetid 5708124 3333184 9041308 3 19 0 1725841 NA NA
character tweetid 0 9041308 9041308 10 19 0 9035946 NA NA
logical is_retweet 0 9041308 9041308 NA NA NA NA 0.37 FAL: 5708124, TRU: 3333184, NA: 0

Ok, now we know that 3.333.184 (~1/3) Tweets are Retweets (of 1.725.841 unique Tweets). Good to know for any Natural Language Processing Method which depends on statistics, i.e. Topic Modelling, or when building a corpus for descriptive analyses.

1.2 Remove Duplicates

It’s probably more efficient to remove duplicates as the first step. This reduces the data object we’re working with.

data_unique <- data_raw %>% filter(is_retweet == FALSE)

Now we’re down to 5707611 unique Tweets in 57 Tweet languages by 3548 Users with 11 account languages.

1.3 Reduce/Recode Language Variables

$tweet_language

data_unique %>% 
  group_by(tweet_language) %>% 
  count() %>% 
  filter(n > 1000) %>% 
  arrange(desc(n)) %>%
  knitr::kable(format = "html")
tweet_language n
ru 2839622
en 2178309
NA 285539
und 119098
de 86259
uk 49889
bg 36576
ar 35811
es 8402
in 7543
fr 7501
sr 5604
tl 5176
ht 4944
et 4920
sk 2982
tr 2903
da 2699
ro 2517
it 2493
nl 2292
cy 2214
sl 2123
pt 1857
fi 1414
pl 1325
sv 1191
lt 1190
no 1135
ja 1117

That’s quite a lot, even if we only consider Tweet languages with n > 1000 Tweets.

merge NA and und[defined]

data_unique <- data_unique %>% 
  mutate(tweet_language = if_else(is.na(tweet_language), "und", tweet_language))

recode all langs with n < 5000 as “other”

We need to reduce the scope of Tweet languages for now. 5000 is only slightly less than 1% of the unique Tweets, so this sounds like a good threshold.

# I'm very sure that there's a more elegant solution for mutating observations row-wise based on grouped counts... However, whatever works, works.

other_langs <- data_unique %>% 
  group_by(tweet_language) %>% 
  count() %>% 
  filter(n < 5000) %>% 
  select(tweet_language)
data_unique <- data_unique %>% 
  mutate(tweet_language = 
           if_else(tweet_language %in% other_langs$tweet_language, "other_44",
                                  tweet_language))
n_distinct(data_unique$tweet_language)
## [1] 12

We’re down to 13 language categories for the Tweets. That’s far better!

$account_language

data_unique %>% 
  group_by(account_language) %>% 
  distinct(userid) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable("html")
account_language n
en 2214
ru 982
es 194
de 107
ar 21
fr 11
it 7
en-gb 6
zh-cn 3
uk 2
id 1

So there’s 11 account languages. I’m going to focus on languages with n > 100 accounts, and recode the rest as “other_6” (since we can assume “uk” == “en-gb”, that’s one language less).

other_langs_acc <- data_unique %>% 
  distinct(userid, .keep_all = TRUE) %>% 
  group_by(account_language) %>% 
  count() %>% 
  filter(n < 100) %>% 
  select(account_language)

data_unique <- data_unique %>% 
  mutate(account_language = 
           if_else(account_language %in% other_langs_acc$account_language, "other_6",
                                  account_language))
n_distinct(data_unique$account_language)
## [1] 5

Great! Now we’re down to 5 account languages.

1.4 IRA Dataset Languages: Summary

Let’s see how many different languages we have by now.

unique(c(data_unique$tweet_language, data_unique$account_language))
##  [1] "ru"       "bg"       "en"       "und"      "other_44" "de"      
##  [7] "uk"       "tl"       "es"       "ar"       "fr"       "in"      
## [13] "sr"       "other_6"

Now let’s create a preliminary overview.

data_count_by_tweet_lang <- data_unique %>%
  # filter(is_retweet == TRUE) %>% 
  group_by(tweet_language) %>%
  distinct(tweetid) %>% 
  count() %>%
  rename(Tweets = n)

data_count_by_account_lang <- data_unique %>%
  # filter(is_retweet == TRUE) %>% 
  group_by(account_language) %>%
  distinct(userid) %>% 
  count() %>%
  rename(Accounts = n)

lang_stats <- full_join(data_count_by_tweet_lang, data_count_by_account_lang,
  by = c("tweet_language" = "account_language")) %>%
  rename(Language = tweet_language) %>%
  mutate(
    T.Share = round(Tweets / sum(.$Tweets, na.rm = TRUE) * 100, 2),
    A.Share = round(Accounts / sum(.$Accounts, na.rm = TRUE) * 100, 2)
  ) %>%
  select(Language, Tweets, T.Share, Accounts, A.Share) %>%
  arrange(desc(Accounts))
lang_stats %>%
  knitr::kable("html",
    format.args = list(
      big.mark = ".",
      decimal.mark = ","),
    caption = "#TwitterDump 2018 – Russian InfoOP Dataset: Languages (unique Tweets)"
  )
Table 1.1: #TwitterDump 2018 – Russian InfoOP Dataset: Languages (unique Tweets)
Language Tweets T.Share Accounts A.Share
en 2.178.206 38,16 2.214 62,40
ru 2.839.362 49,75 982 27,68
es 8.402 0,15 194 5,47
de 86.258 1,51 107 3,02
other_6 NA NA 51 1,44
ar 35.811 0,63 NA NA
bg 36.575 0,64 NA NA
fr 7.501 0,13 NA NA
in 7.543 0,13 NA NA
other_44 42.795 0,75 NA NA
sr 5.604 0,10 NA NA
tl 5.176 0,09 NA NA
uk 49.887 0,87 NA NA
und 404.557 7,09 NA NA

That’s already interesting, but let’s not jump to conclusion about who tweeted in what language, yet… This summary alone does enable us to claim that, for instance, Russian accounts where responsible for all the Russian Tweets, and so on.

What’s really striking is that 50% of the Tweets from this 9M Tweets IRA dataset are in Russian (or at least labelled as such), which does not quite fit the dominant narrative of a solely US-centric information operation. These numbers show that Russia’s activities were concered with Russian-speaking people as much as with an English-speaking audience (among German and Spanish).

Check for any remaining language NA’s

data_unique %>% filter(is.na(tweet_language) | is.na(account_language)) %>% 
  count()
## # A tibble: 1 x 1
##       n
##   <int>
## 1     0

That’s great news so far!

(Optional: recode if(tweet_lang == NA &/| user_lang == NA))

Since we’ve reduced our dataset and already recoded the NAs, this step is not necessary anymore (before that, I worked without is_retweet == FALSE and things looked a bit different). However, I’m just leaving the syntax here, since it might be useful to others (and to myself).

data_recoded <- data_unique %>%
  mutate(tweet_language = if_else(is.na(tweet_language) & is.na(account_language),
                                  "und",
                            if_else(is.na(tweet_language) & !is.na(account_language),
                                    account_language,
                          tweet_language)
                          )
         ) %>% 
  mutate(account_language = if_else(is.na(tweet_language) & is.na(account_language),
                                  "und",
                              if_else(is.na(account_language) & !is.na(tweet_language),
                                      tweet_language,
                            account_language)
                            )
         )

This is a good moment in our data preparation cycle to create a hardcopy of our processed data_unique object in a .rds file. This way, we won’t have to redo all the wrangling and recoding we did so far, and can just start with any in-depth analysis by loading the object with data_unique <- readRDS(file). ANd we can reduce our local in-memory load by rm(data_raw)

# data_path <- here::here("data", "IRA_Tweets", "/")
# if (!dir.exists(data_path)) dir.create(data_path)
saveRDS(data_unique, str_c(data_path, "infoops_data_processed.rds"))
rm(data_raw)
data_unique <- readRDS(str_c(data_path, "infoops_data_processed.rds"))

2 Who tweeted in what language?

Now it’s about time to look into which account language groups tweeted in what languages.

2.1 Create Language-specific Subsets

This is also a good moment to create language-specific or other intereting subsets from our refined dataset.

German subset

german_subset <- data_unique %>% filter(tweet_language == "de" | account_language == "de")

The German subset has 98064 Tweets by 782 users.

Undefined Subset

undefined_subset <- data_unique %>% 
  filter(tweet_language == "und" | account_language == "und" | 
           is.na(tweet_language) | is.na(account_language))

The undefined (“und” & “NA”) subset has 404557 Tweets by 2867 users.

2.2 Summary Plots: Languages & General Activity

data_unique %>%
  group_by(tweet_language) %>%
  summarise(n = n()) %>%
  mutate(
    share = n / sum(n),
    tweet_language = case_when(
      tweet_language == "" ~ "unspec.",
      tweet_language == "und" ~ "undef.",
      TRUE ~ tweet_language
    )
  ) %>%
  arrange(desc(n)) %>%
  ggplot(aes(area = share)) +
  treemapify::geom_treemap(aes(fill = n), alpha = 0.8) +
  treemapify::geom_treemap_text(
    aes(label = paste0(tweet_language, "\n(", round(share * 100, 1), "%)"))
  ) +
  scale_fill_viridis_c(direction = -1, option = "D") +
  labs(
    title = "#TwitterDump 2018 – Russian InfoOP Dataset: Shares of Languages (Tweets)",
    subtitle = str_c("Total of ", 
      n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
      n_distinct(data_unique$userid), " unique Users"
    ),
    caption = str_c("@fubits")
  ) +
  guides(fill = FALSE)

data_unique %>%
  group_by(account_language) %>%
  summarise(n = n()) %>%
  mutate(
    share = n / sum(n),
    account_language = case_when(
      account_language == "" ~ "unspec.",
      account_language == "und" ~ "undef.",
      TRUE ~ account_language
    )
  ) %>%
  arrange(desc(n)) %>%
  ggplot(aes(area = share)) +
  treemapify::geom_treemap(aes(fill = n), alpha = 0.8) +
  treemapify::geom_treemap_text(
    aes(label = paste0(account_language, "\n(", round(share * 100, 1), "%)"))
  ) +
  scale_fill_viridis_c(direction = -1, option = "D") +
  labs(
    title = "#TwitterDump 2018 – Russian InfoOP Dataset: Shares of Languages (Accounts)",
    subtitle = str_c("Total of ", 
      n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
      n_distinct(data_unique$userid), " unique Users"
    ),
    caption = str_c("@fubits")
  ) +
  guides(fill = FALSE)

2.2.1 Consolidating Tweet Languages per Account

data_counts <- data_unique %>%
  group_by(userid) %>%
  mutate(Account_Lang = account_language) %>% 
  summarise(
    Created = unique(account_creation_date),
    Account_Lang = unique(Account_Lang),
    Tweets = n(),
    RT = sum(retweet_count),
    Follower = unique(follower_count),
    Following = unique(following_count),
    Influence = (((Follower + 1) / (Following + 1)) + (Follower + 1)),
    Tweet_Langs = list(tweet_language),
    Tweet_Langs_Counts = list(unlist(Tweet_Langs) %>% fct_count())
  ) %>%
  arrange(desc(Tweets))

Now the $Tweet_Langs var contains a list of all Tweet Languages from every single Tweet posted by an Account. Compare the number of $Tweets with the length of the vector (in the list) further below.

For $Tweet_Langs_Counts, we have utilized the quite elegant forcats::fct_count() which gives us a list of the aggregated language counts.

So the first User in our dataset - who has tweeted a total of 95210 times - has been busy in 6 languages. Impressive language skills :)

data_counts[1,]$Tweet_Langs_Counts[[1]] %>% knitr::kable("html")
f n
bg 536
en 2
ru 93478
sr 26
uk 1093
und 75

And this is what this tibble looks like (without $userid and $Tweet_Langs for better Website readability):

data_counts %>%  
  select(-userid, -Tweet_Langs) %>% 
  head(10) %>% knitr::kable("html", digits = 0)
Created Account_Lang Tweets RT Follower Following Influence Tweet_Langs_Counts
16237 ru 95210 188332 15753 6862 15756 list(f = 1:6, n = c(536, 2, 93478, 26, 1093, 75))
14553 ru 68784 11700 1543 34 1588 list(f = 1:13, n = c(3, 488, 14, 290, 12, 10, 1, 153, 26633, 90, 9, 686, 40395))
16625 en 64495 1898 791 2 1056 list(f = 1:11, n = c(17, 1105, 44410, 766, 753, 2561, 13267, 1, 1474, 6, 135))
16309 en 59397 44258 66980 10500 66987 list(f = 1:8, n = c(64, 58607, 94, 113, 44, 339, 42, 94))
16204 en 53781 29179 23595 13665 23598 list(f = 1:8, n = c(75, 53067, 90, 135, 28, 308, 9, 69))
16227 en 52838 19670 29357 6720 29362 list(f = 1:8, n = c(129, 51643, 148, 147, 68, 463, 92, 148))
17093 ru 50224 1526 423 187 426 list(f = 1:7, n = c(247, 1, 2, 49436, 8, 442, 88))
16431 en 46738 3619 13358 13851 13360 list(f = 1:8, n = c(46, 45943, 149, 115, 48, 293, 43, 101))
16195 en 46572 25119 35988 11010 35992 list(f = 1:8, n = c(72, 45890, 56, 163, 58, 264, 34, 35))
16626 ru 46440 24685 13913 1650 13922 list(f = 1:7, n = c(344, 1, 4, 45398, 19, 560, 114))

I’ll get back to this in a minute. Let’s now visualize the general characteristics of all accounts in the IRA dataset.

2.2.2 General Activity Plots

ggplot(data_counts, aes(x = Follower, y = Following)) +
  geom_point(aes(size = Tweets, color = userid, alpha = Influence)) +
  scale_color_viridis_d(direction = -1) +
  scale_alpha_continuous(range = c(0.3,1),
                         breaks = scales::pretty_breaks(5)) +
  scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  scale_x_continuous(breaks = scales::pretty_breaks(6),
                     labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  scale_y_continuous(breaks = scales::pretty_breaks(6),
                     labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  coord_fixed() +
  theme_minimal() +
  labs(
    title = "#TwitterDump 2018 – Russian InfoOP Dataset: Account Stats",
    subtitle = str_c("Total of ", 
      n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
      n_distinct(data_unique$userid), " unique Users"
    ),
    caption = str_c("@fubits"),
    #x = "",
    # y = "",
    size = "# of Tweets",
    alpha = "Alpha: # Retweets"
  ) +
    guides(color = FALSE, alpha = FALSE)

From what we can see here, we have quite an amount of “influencers” - accounts with lots of followers and low rates of following others.

What if we look at the account languages?

ggplot(data_counts, aes(x = Follower, y = Following)) +
  geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = Influence)) +
  scale_color_viridis_d(option = "B", direction = 1) +
  scale_alpha_continuous(range = c(0.3,1),
                         labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  scale_x_continuous(breaks = scales::pretty_breaks(6),
                     labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  scale_y_continuous(breaks = scales::pretty_breaks(6),
                     labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
  coord_fixed() +
  theme_minimal() +
  labs(
    title = "#TwitterDump 2018 – Russian InfoOP Dataset: Activity by Account Language",
    subtitle = str_c("Total of ", 
      n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
      n_distinct(data_unique$userid), " unique Users"
    ),
    caption = str_c("@fubits"),
    #x = "",
    # y = "",
    size = "# of Tweets",
    color = "Account Language"
  ) +
    guides(alpha = FALSE,
           colour = guide_legend(override.aes = list(size = 5, stroke = 1.5))
           )

And now let’s just look at when the most influential accounts were created.

influencers <- data_counts %>% 
  arrange(desc(Influence)) %>% 
  top_n(15, Influence)

data_counts %>% 
  arrange(desc(Influence)) %>% 
  ggplot(data = ., aes(x = Follower, y = Following)) +
    geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = Influence)) +
    ggrepel::geom_label_repel(data = influencers,
                            aes(label = lubridate::year(Created),
                                fill = Account_Lang),
                            alpha = 0.7) +
    scale_color_viridis_d(option = "B", direction = 1) +
    scale_alpha_continuous(range = c(0.3,1),
                           labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_x_continuous(breaks = scales::pretty_breaks(6),
                       labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_y_continuous(breaks = scales::pretty_breaks(6),
                       labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    coord_fixed() +
    theme_minimal() +
    labs(
      title = "#TwitterDump 2018 – Russian InfoOP Dataset: Top 15 (Influence) - Account Creation Date",
      subtitle = str_c("Total of ", 
        n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
        n_distinct(data_unique$userid), " unique Users"
      ),
      caption = str_c("@fubits"),
      #x = "",
      # y = "",
      size = "# of Tweets",
      color = "Account Language"
    ) +
      guides(alpha = FALSE,
             colour = FALSE)

Interesting, one of the most influential accounts is neither Russian nor English-speaking!

data_counts %>% 
  arrange(desc(Influence)) %>% 
  select(userid, Account_Lang, Follower, Following) %>% 
  head(15) %>% knitr::kable("html")
userid Account_Lang Follower Following
2527472164 en 257638 544
508761973 en 149672 1024
4224729994 en 147767 74664
449689677 ru 123989 10
2808833544 ru 134805 2796
2648734430 en 106462 386
890237781737435138 ru 51964 0
3676820373 en 85293 316
3729867851 en 84167 143
2665564544 ru 84642 2575
2882331822 en 79152 22607
4272870988 en 72121 42080
2752677905 en 66980 10500
2518710111 other_6 60869 1246
2573726278 en 59724 556

2518710111 is not hashed :) Let’s find out who this is!

data_raw %>% filter(userid == 2518710111) %>% select(userid, user_display_name, user_screen_name, account_language) %>% head(1) %>% knitr::kable("html")
userid user_display_name user_screen_name account_language
2518710111 Вестник Новосибирска NovostiNsk en-gb

“Вестник Новосибирска” - Newspaper of Novosibirsk - doesn’t sound too British :) the account has been suspended by Twitter, btw.

What about accounts with the most Tweets?

data_counts %>% 
  ggplot(data = ., aes(x = Follower, y = Following)) +
    geom_point(aes(size = Tweets, color = fct_infreq(Account_Lang), alpha = RT)) +
    ggrepel::geom_label_repel(data = data_counts[1:15,],
                            aes(label = lubridate::year(Created),
                                fill = Account_Lang),
                            alpha = 0.7) +
    scale_color_viridis_d(option = "B", direction = 1) +
    scale_alpha_continuous(range = c(0.3,1),
                           labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_size(range = c(1,5), labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_x_continuous(breaks = scales::pretty_breaks(6),
                       labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    scale_y_continuous(breaks = scales::pretty_breaks(6),
                       labels = scales::number_format(big.mark = ".",
                                                      decimal.mark = ",")) +
    coord_fixed() +
    theme_minimal() +
    labs(
      title = "#TwitterDump 2018 – Russian InfoOP Dataset: Top 15 (# Tweets) - Account Creation Date",
      subtitle = str_c("Total of ", 
        n_distinct(data_unique$tweetid), " unique Tweets (no RTs) from ",
        n_distinct(data_unique$userid), " unique Users"
      ),
      caption = str_c("@fubits"),
      #x = "",
      # y = "",
      size = "# of Tweets",
      color = "Account Language"
    ) +
      guides(alpha = FALSE,
             colour = FALSE)

We can see from both plots that the most influential or active accounts were created in 2014 or later, and that the relation between Russian- and English-labelled accounts is rather balanced in terms of max. Tweet numbers. However, English-speaking accounts are more dominant in terms of numeric dominance (Following/Followers).

2.3 Languages: Accounts vs Tweets

Now it’s time to have look at the account language to tweet languages relations.

This is what the Top 20 (of 57) language combinations look like:

data_unique %>% 
  group_by(account_language, tweet_language) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  knitr::kable("html", caption = "Top 20 Language Combinations from the IRA Dataset")
Table 2.1: Top 20 Language Combinations from the IRA Dataset
account_language tweet_language n
ru ru 1985869
en en 1842222
en ru 836074
ru und 310192
es en 275732
en und 80810
de de 78301
other_6 en 51105
en other_44 35542
en ar 33013
ru uk 28038
ru bg 21923
en uk 21520
other_6 ru 17679
en bg 14539
de und 10387
ru en 8500
en es 7259
en de 6637
en in 6407

Ok, so we probably could have expected that Russian and English speaking accounts would mostly tweet in their respective languages. But who would have suspected that 185.803 Russian language tweets were posted by English-speaking accounts? Right :)

Now let’s visualize all the 57 Account Language -> Tweet Language combinations. For an easier understanding of these plots, just keep in mind that Tweets are posted by accounts, so the Tweet language (bottom) is our dependent variable here.

data_unique %>% 
  group_by(account_language, tweet_language) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
    ggplot() +
    geom_tile(aes(x = tweet_language,
                  y = account_language,
                  fill = n)) +
    scale_fill_viridis_c(option = "B", direction = -1,
                         breaks = scales::pretty_breaks(4),
                         labels = scales::number_format(big.mark = ".",
                                                        decimal.mark = ",")) + 
    coord_fixed() +
    theme_minimal() +
    labs(
      title = "#TwitterDump 2018 – Russian InfoOP Dataset: Language Matrix",
      subtitle = str_c(
        "Subset of ", n_distinct(data_unique$tweetid),
        " unique Tweets (no RTs) from ",
        n_distinct(data_unique$userid), " unique Users"
      ),
      caption = str_c("@fubits"),
      fill = "Language Combo:\nTotals",
      x = "Tweet Language",
      y = "Account Language"
    )

data_unique %>% 
  group_by(account_language, tweet_language) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
    ggplot() +
    geom_tile(aes(x = tweet_language, 
                  y = account_language,
                  fill = n/sum(n))) +
    scale_fill_viridis_c(option = "B", direction = -1,
                         breaks = scales::pretty_breaks(6),
                         labels = scales::percent_format(accuracy = 1)) + 
    coord_fixed() +
    theme_minimal() +
  # theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
    labs(
      title = "#TwitterDump 2018 – Russian InfoOP Dataset: Language Matrix",
      subtitle = str_c(
        "Subset of ", n_distinct(data_unique$tweetid), 
        " unique Tweets (no RTs) from ",
        n_distinct(data_unique$userid), " unique Users"
      ),
      caption = str_c("@fubits"),
      fill = "Language Combo:\nShare of Total"
    )

Alright, that’s enough exploration and heavy data wrangling for today. Stay tuned for Part 2: Content Analysis

Here’s just a teaser of what is expecting us:

data_unique %>%
  filter(tweet_language == "ru") %>% 
  select(tweet_text) %>% 
  head(10) %>%
  knitr::kable("html")
tweet_text
Серебром отколоколило http://t.co/Jaa4v4IFpM
Предлагаю судить их за поддержку нацизма, т.к. они отказались его осуждать!! #STOPNazi
Двойная утопия, или Нет у Европы Трампа, кроме Путина https://t.co/MbxCpuLdDl
Генпрокуратура назвала взносы на капремонт неконституционными https://t.co/IOiSR4kssB
Кировский пасхальный обед один из самых дешевых в стране https://t.co/11se2vaZJS
В Кирове на два месяца будет ограничено движение по улице Кольцова https://t.co/iRT0jWIS9z
Пятница в Кирове: солнечный чай и любимое занятие https://t.co/jATZrjoAdy
ИнстаКиров: топ прокачанных парней в спортзале http://t.co/q45lGEaHla
Что обсуждают в Кирове: убийство женщины и концерт “Арии” https://t.co/MUhZaTSXAA
В музее Циолковского открылась выставка, посвященная нашему земляку Юрию Тухаринову https://t.co/T3BR53yu7B https://t.co/WDDkBworPE

Or what about “undefined”

data_unique %>%
  filter(tweet_language == "und") %>% 
  select(tweet_text) %>% 
  head(10) %>% 
  knitr::kable("html")
tweet_text
#волгоград #сталинград https://t.co/RCoCV12FP0
*** http://t.co/qpmvojRvYx
@stratosathens Strat, Every time we are in the state of burning chemical plants! Why is it happened with us?!
http://t.co/RFTj00YJXU
https://t.co/pDSEWw50AH
http://t.co/yYLD9IhvYO
@b059c98801ff33056eb46bee9088256f2b6b85dc8d6926579bf42dbf1b2d9c1a Curiosity «завис» на Марсе из-за сбоя в компьютере
@bee7d3ea2fb7ece347e636f3c21b9340e1e17626bfd7543d1d36b51d00d0a4ce Чехия испытывает дефицит российской нефти
Автомобиль Дворковича попал в ДТП
@050f6f287ff564463cb290ec9933a045e07d0eea41352e3ed893cd570e97adbb “Доктор Питер”: Петербургские инвесторы могут построить в Белоруссии фармацевтический завод #СПб #spb

Btw, have you already discovered the US-Republican-sponsored (sic!) RussiaTweets.com Project?