Abstract / TL;DR

This is the sceond part of my "Internet Research Agency" series. We'll encounter in-memory computing problems, address them with tweaks on Windows and Linux, and finally succeed at building a 5.7M x 1.5M big sparse Document-Feature Matrix.

1 Packages & Data
2 International Corpus: An Exersise in “Too Big for In-Memory”

(Weekly per-language activity of specific keywords)

Now that we are done with pre-processing the Internet Research Agency dataset, we can unleash the full R powers and dive into automated content analysis. For a primer, see the following excellent intros and posts re: core NLP concepts, R packages, and tools:

Emmanuel Ameisen: How to solve 90% of NLP problems: a step-by-step guide (this is a very beginner-friendly overview of key NLP concepts. Not R related, but a great TL;DR primer.)
Julia Silge: The game is afoot! Topic modeling of Sherlock Holmes stories
Julia Silge and David Robinson: Text Mining with R - A Tidy Approach
Cornelius Puschmann: Automatisierte Inhaltsanalyse mit R (in German)
Kenneth Benoit et al.: quanteda: An R package for the quantitative analysis of textual data
Benoit et al.? Quanteda vignette: Getting Started with Quanteda
Nicolas Merz: Using the Manifesto Corpus with quanteda
Bonus: Carsten Schwemmer (who introduced me to the Tidyverse and automated content analysis with R, btw!) has build an Shiny app, which enables you to explore Structural Topic Models with a browser-based GUI: stminsights - A Shiny Application for Structural Topic Models

Just a word of caution: Topic Modelling is a statistical method, and doing it the way I’m doing here - on the full 5.7M Tweets corpus OR even on a subset of 2M Tweets without any frequency-based feature reduction (i.e. with quanteda::dfm_trim()) - will take some time. A lot of time actually. I have a decent Gamer’s Laptop with 16GB RAM, a quad-core Intel i7-7500 CPU and a SSD (SATA tho) - and computing the Topic Model based on the English subset (2M Tweets, with auto-induced K=73 topics, based on 15K features -> cf. upcoming 3rd part of this series) took a whole day (22,5 hours). Don’t try this at home, or without some free AWS credits :)

1 Packages & Data

Let’s get started. Here are some tweaks/tricks/adjustments I discovered while writing this post (i.e., in this 8 years old Stack Overflow thread) in order to increase the computing performance ;)

i.e.: memory.limit(size = NA) sets the maximum memory limit to what’s ever possible on your system (it’s a bit more complicated on Windows)

You can enforce R’s garabage collection at any time with gc() (see the vignette for the params); this is esp. helpful after rm(huge_object)

Save any large output as an .rds or .RData (if you can afford) to your hard-drive, clear your environment and start the next step with as low of a RAM footprint as possible.

My key-learning of this post, however: If you’re on Windows, set-up a Linux distro to dual-boot. R is so much more performant on Ubuntu et al. (also with regards to Unicode, path length, memory management and much more) You’ll start the environment with at least 1GB more of free RAM; you’ll have full CLI control over the swap drive/file; and then some more.

Let’s start. Here are the packages we’ll need:

library(tidyverse)
library(quanteda)
quanteda_options("threads" = 4) # 4 is the limit on my Laptop
print(str_c("Threads for Quanteda: ", quanteda_options("threads")))

## [1] "Threads for Quanteda: 4"

library(stm)
library(lubridate)

Since we’ve already pre-processed the Internet Research Agency dataset, we can just load the exported data object and go on with corpus-building.

data_path <- here::here("data", "IRA_Tweets", "/")
data_unique <- readRDS(str_c(data_path, "infoops_data_processed.rds"))
# devtools::install_github("brooke-watson/BRRR")
BRRR::skrrrahh("flava") # play sound when ready

2 International Corpus: An Exersise in “Too Big for In-Memory”

2.1 Data Exploration: Time-Series Plots

As I didn’t do this in my last post, let’s have a quick look at the time-wise distribution of Tweets in total, and grouped by Tweet and Account languages. This way, we might see what time periods and what languages might be worth special attention.

First, we’ll reduce the observations to language groups and further aggregate on a weekly basis.

IRA_TS <- data_unique %>%
  mutate(tweet_time = as_date(tweet_time),
         week = lubridate::floor_date(tweet_time, "week")) %>%
  group_by(week, account_language, tweet_language) %>%
  count() %>% 
  arrange(desc(n))

IRA_TS %>% 
  head() %>%
  knitr::kable("html")

week	account_language	tweet_language	n
2014-07-13	ru	ru	89198
2015-03-15	en	en	71421
2017-08-13	es	en	71037
2017-08-06	es	en	55256
2014-06-22	ru	ru	49209
2014-08-10	ru	ru	46044

Now we’ll plot time series grouped by account languages.

IRA_TS %>% 
    ggplot() +
      geom_line(aes(x = week, y = n, color = fct_infreq(account_language)),
                alpha = 0.5) +
      scale_color_brewer(palette = "Paired", direction = -1) +
      scale_x_date(limits = c(as_date("2009-01-01"), NA),
                   breaks = scales::date_breaks(width = "1 year"),
                   labels = scales::date_format(format = "%Y"),
                   minor_breaks = "6 months") +
      scale_y_continuous(breaks = scales::pretty_breaks(9),
                         labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
      theme_minimal() +
      labs(
        title = "Russian Internet Agency Dataset: Weekly Activity per Account Language",
        subtitle = str_c(
          "Subset of ", n_distinct(data_unique$tweetid), 
          " unique Tweets (no RTs) from ",
          n_distinct(data_unique$userid), " unique Users"
        ),
        caption = "@fubits",
        color = "Account Language",
        x = "",         y = "Tweets"        ) +
      guides(colour = guide_legend(override.aes = list(size = 5, stroke = 1.5)))

Well that’s some interesting peaks - first “ru”, then “en”, than a little phase with “ru” followed by “es” (as in Español)! We can neglect everything before 2012 for now, so the following plots should become a bit more detailed.

Let’s see what we can observe by plotting the Tweet Languages and inspect the top peaks’ dates. To further reduce noise, we’ll filter out languages with less than 500 Tweets per week from 2012 on (as we have aggregated the Tweets on a weekly basis, in fact, only week-wise values will be filtered out).

IRA_TS %>% 
  filter(week > "2011-12-25" & n > 500) %>% 
    ggplot() +
      geom_line(aes(x = week, y = n, color = fct_infreq(tweet_language)),
                alpha = 0.5) +
      scale_color_brewer(palette = "Paired", direction = -1) +
      scale_x_date(limits = c(as_date("2011-11-01"), NA),
                   breaks = scales::date_breaks(width = "1 year"),
                   labels = scales::date_format(format = "%Y"),
                   minor_breaks = "6 months") +
      scale_y_continuous(breaks = scales::pretty_breaks(9),
                         labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
      theme_minimal() +
      labs(
        title = "Russian Internet Agency Dataset: Weekly Activity per Tweet Language",
        subtitle = str_c(
          "Subset of ", n_distinct(data_unique$tweetid), 
          " unique Tweets (no RTs) from ",
          n_distinct(data_unique$userid), " unique Users"
        ),
        caption = "@fubits",
        color = "Tweet Language",
        x = "",
        y = "Tweets"
        ) +
      guides(colour = guide_legend(override.aes = list(size = 5, stroke = 1.5)))

Those patterns are really telling! And we can add one more helper: Text labels (the Mondays for the respective weeks) for those top peaks (global maxima), colored by the respective Account languages. Plus, the real activity starts to gain momentum in 2014, so we should neglect everything beforehand for now

IRA_TS %>% 
  filter(week > "2013-12-23" & n > 500) %>% 
    ggplot() +
      geom_line(aes(x = week, y = n, color = fct_infreq(tweet_language)),
                alpha = 0.5) +
      geom_point(data = IRA_TS[1:10,], aes(x = week, y = n, 
                                          color = fct_infreq(tweet_language)), size = 2) +
      scale_color_brewer(palette = "Paired", direction = -1, aesthetics = c("color", "fill")) +
      ggrepel::geom_label_repel(data = IRA_TS[1:10,],
                                aes(x = week, y = n,
                                    label = str_c(month(week, label = T, abbr = T), "-", day(week)),
                                    fill = account_language),
                                    size = 3) +
      scale_x_date(limits = c(as_date("2013-10-01"), NA),
                   breaks = scales::date_breaks(width = "1 year"),
                   labels = scales::date_format(format = "%Y"),
                   minor_breaks = "6 months") +
      scale_y_continuous(breaks = scales::pretty_breaks(9),
                         labels = scales::number_format(big.mark = ".",
                                                    decimal.mark = ",")) +
      theme_minimal() +
      labs(
        title = "Russian Internet Agency Dataset: Weekly Activity per Language (Top 10 Peaks)",
        subtitle = str_c(
            "Subset of ", n_distinct(data_unique$tweetid), 
            " unique Tweets (no RTs) from ",
            n_distinct(data_unique$userid), " unique Users", "; Label: week's Monday; Label fill: Account language"
          ),
        caption = "@fubits",
        color = "Language",
        x = "",
        y = "Tweets"
        ) +
      guides(color = guide_legend(override.aes = list(size = 5, stroke = 0)),
             fill = FALSE)

Technically, we could re-do this for all peaks of any particular Tweet or Account language and find out more about the particular events (i.e. German-language Tweets peaking around the Federal Elections in autum 2017 - barely visibly with this color palette) with the help of Wikipedia’s super-handsome current events portal.

For instance, we can observe Russian-language Tweets starting to peak at the beginning of 2014:

post-Maidan, the Ukraine-Russia conflict just entered a full-fledged armed conflict period, complemented by Russia’s Annexation of Crimea et al.
on 17 July Malaysia Airlines Flight MH17 was shot down over Ukraine (cf. Bellingcat’s OSINT work).

As for the English-language Tweets (attributed to the Russian “Internet Research Agency” by Twitter’s Election Intergrity Team in this dataset), we see them gain momentum starting in mid 2014, shortly after the shot-down of MH17…

Beyond that, in 2015:

the Greeks elected a new Parliament on 4 January
the Charlie Hebdo attack took place on 7 January,
the Ferguson unrest had another bloody month in March
well… and on 16 June 2015, Trump announced his presidential campaign.

Post election, English-language Tweets calmed down and didn’t re-peak before August 2017:

on 3 August 2017, “Special counsel Robert Mueller has impaneled a grand jury in Washington, D.C. to investigate allegations of Russia’s interference in the 2016 elections”¹
(and Neymar was bought by PSG for €222 million on the same day, which might explain why Spanish Accounts where most active here)
on 12 August 2017, “fights break out between white supremacists and counterprotesters in Charlottesville, Virginia, over the removal of the Robert Edward Lee Sculpture.”².
on 13 August 2017 the Barcelona attacks happened

As we can see, the activities really peak around certain events, either concerning US-American left-right clashes and other intra-state affairs, or they concern topics with Russian involvement. What’s striking, however, is that the days around the infamous Brexit referendum (23/24 June 2016) are among those time periods with the overall lowest activies.

Just for comparison (of plain numbers to the usefulness of exploration by data visualization), here’s some tables of the data we just visualized. You can easily identify the dates of the events from the numbers alone. But in order to recognize patterns, esp. language patterns, a visual representation is really helpful. However, without this table I probably would have missed the Spanish-language Accounts peaking with Tweets in English in two subsequent weeks…

IRA_TS %>% 
  group_by(week, account_language, tweet_language) %>%
  summarise(n = sum(n)) %>% 
  arrange(desc(n)) %>% 
  head(20) %>% 
  knitr::kable("html")

week	account_language	tweet_language	n
2014-07-13	ru	ru	89198
2015-03-15	en	en	71421
2017-08-13	es	en	71037
2017-08-06	es	en	55256
2014-06-22	ru	ru	49209
2014-08-10	ru	ru	46044
2015-01-04	en	en	45167
2014-10-05	ru	ru	44925
2014-07-20	en	ru	41167
2014-07-13	en	ru	40059
2017-07-30	es	en	35927
2014-06-29	ru	ru	33153
2014-08-17	ru	ru	33068
2014-07-27	en	ru	31603
2014-06-15	ru	ru	31592
2015-01-18	en	en	30791
2014-10-19	ru	ru	30689
2015-01-11	en	en	30341
2014-08-24	en	en	29403
2014-08-17	en	en	29123

Alright. Point made. Let’s continue with preparing our automated content analysis.

rm(IRA_TS) # free up some RAM first

2.2 Building a Corpus with `Quanteda`

We’re going to build a corpus from our dataset, turn the corpus into a sparse Document-Feature Matrix (DFM), reduce unnecessary features (words/tokens) from the DFM and then use the DFM for Structural Topic Modelling.

Now let’s try to do something really naive: We’ll build a corpus from all Tweets - with all languages. This will take a minute ot two (actually: each step took me hours…) with corpus(), and dfm(), but this way we can explore a lot of presumed topics in various languages early on, and test where the limits of in-memory computing with text as data might be with R (“Big Data calling”)…

Corpus from all Tweets

system.time({ # just measuring run-time
IRA_corpus <- data_unique %>% 
  corpus(text_field = "tweet_text",
         docid_field = "tweetid",
         docvars = data_unique, # probably obsolete when docid_field is set
         compress = FALSE # compression is better for in-memory, but will slow down processing  
         )
})
BRRR::skrrrahh("flava")

2.2.1 Memory Management

That’s my fairly refined approach (from this post) to work in-memory with as low of a RAM footprint as possible:

save the data object to the hard-drive with saveRDS(obj)
clear your R environment with rm() and gc() ( gc := garbage collection)
restart R with .rs.restartR() or rstudioapi::restartSession() (not sure yet, where they really differ)
load the data object from the hard-drive with obj <- readRDS()
go on with the next task

save IRA_corpus to HDD

saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive.rds"))
# saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive_compressed.rds"))

clear environment and restart R

rm(data_unique)
rm(IRA_corpus)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()

Linux: down to 1.4 GiB

load IRA_corpus from HDD

data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive.rds"))
# IRA_corpus <- readRDS(stringr::str_c(data_path, "IRA_corpus_naive_compressed.rds"))
BRRR::skrrrahh("flava")

2.3 Corpus Exploration: Keyword-in Context

Now we can explore some specific keywords and their full-text conetxt, i.e. those where we can already assume that they should take up a prominent spot in the corpus.

In particular, I’m interested in the following terms:

компромат | kompromat (just for you, @RidT)
soros
fake news: phrase(“fake news”)
pee tape: phrase(“pee tape”)
msm
hillary | clinton
nazi
antifa
zion
civil war
deep state

test kwic() functionality

IRA_corpus %>% 
  corpus_subset(tweet_language == "ru" | tweet_language == "en") %>% 
  corpus_sample(1000000) %>% # always test with smaller sample
  kwic(c("компромат","kompromat"), window = 10, case_insensitive = TRUE)  %>%
  # vs. kwic(x, phrase("term1 term2"))
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>%
  knitr::kable("html")

pre	keyword	post
Trump wants this fountain to be in charge of Russia’s	Kompromat	on him . https : / / t.co / H8P7CMMxzA

We’ll put the keywords into a (kwic()-friendly) vector and then do a single run with all keywords instead of repeating this n-times. This will take time and might not work properly without enough RAM.

kwic_keywords <- c(
  c("компромат","kompromat"),
  "soros",
  phrase("fake news"),
  phrase("pee tape"),
  "msm",
  c("hillary", "clinton"),
  "nazi",
  "antifa",
  "zion",
  phrase("civil war"),
  phrase("deep state")
  )

enforce Garbage Collection to free up RAM

gc(full = TRUE, verbose = TRUE)

now search for Keywords-in-context from kwic_keywords vector

library(magrittr)
system.time({
  kwic_results <- IRA_corpus %>% 
    # corpus_sample(1000000) %>% # Test smale-scale sample first
    quanteda::kwic(kwic_keywords, window = 10, case_insensitive = TRUE)
})

saveRDS(kwic_results, str_c(data_path, "kwic_corpus_naive.rds"))
# this genius thing plays Flavor Flav when the chunk is done
BRRR::skrrrahh("flava")

I was getting this error repeatedly on Windows:Error: cannot allocate vector of size 64.0 Mb Timing stopped at: 1599 562.6 2294

This was the moment where I arrived at my individual “too big for in-memory” point of failure - even with 16GB RAM. After trying out some memory.size() tricks on Windows, among other things, I decided to change my OS (for R at least) to Ubuntu (MATE 18.10) and set up dual-boot. My first motiviation was to reduce some memory load (i.e. Skype, iTunes, Office et al. all helpers running in the background). Eventually, I had to extend the “in-memory” capacity by creating a dedicated 64GB swap drive on a USB 3.0 thumb drive. As soons as I get hold of a new, bigger, and faster SSD, I’ll have a look at the performance of a SSD-based swap drive.

Visualization

kwic_results <- readRDS(str_c(data_path, "kwic_corpus_naive.rds"))

kwic_results %>% 
  mutate(keyword = tolower(keyword)) %>% 
  group_by(keyword) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  knitr::kable("html")

keyword	n
clinton	18216
hillary	17353
antifa	6771
fake news	2897
soros	2673
msm	1567
nazi	1273
deep state	852
civil war	741
компромат	118
zion	67
pee tape	7
kompromat	1

This is not as many results as I would have expected. However, this might have been caused by me by not explicitly defining the search pattern as “glob” or “regex”. I’ll re-do this post with better equipment soon-ish.

kwic_results %>%
  filter(keyword == "soros") %>% 
  select(pre, keyword, post) %>% 
  head(10) %>%
  knitr::kable("html")

pre	keyword	post
Ah_peaceseeker3 AmericanVoterUS KevinDaGunny FoxNews realDonaldTrump Right ? It’s coming from	soros	European ba … https : / / t.co / vdqVmwNR7v
MittRomney I’ve learned they’re both antifa’s . I’m positive .	soros	put out ads 7 d … https : / /
RT apostlelaurinda : howjohns123 CGHallenbeck RealJamesWoods all funded by	soros	and saudi prince $ $ $ . . . birds
TaylorLorenz Another	soros	troll ? Or truly compassionate . These days none of
them on the list too . They’re the colluders with	soros	.
Soros / Killary gonna take place of ANTIFA ANOTHER 501C	soros	/ K … https : / / t.co / U7QhxwnPyN
You guys are pushing the antifa agenda . You know	soros	paid off Gov , Mayor & amp ; Police Chief
glad you agree . We’re now watching phase 2 of	soros	plan . Destroy US infrastructur … https : / /
RT MikeObrigewitch : TeamTrumpAZ No I think obama and	soros	should be removed
the entire movement . These kids are being controlled by	soros	via the M … https : / / t.co /

Now it’s really easy to visualize the distribution of our selected keywords over time and by language and weekly frequency:

kwic_results %>% 
  left_join(data_unique, by = c("docname" = "tweetid")) %>% 
  select(keyword, tweet_time, account_language, tweet_language) %>%
  mutate(keyword = tolower(keyword), date = floor_date(tweet_time, "week")) %>% 
  group_by(keyword, date, account_language) %>% 
  count() %>% 
  ggplot() +
    geom_point(aes(x = date, y = fct_rev(keyword),
                   color = account_language,
                   size = n), alpha = 0.5) +
    scale_color_brewer(palette = "Paired", direction = 1) +
    labs(title = "Russian Internet Agency Dataset: Selected Keywords - Weekly Activity per Account Language",
         subtitle = str_c("Based on n = ", nrow(kwic_results), " hits from the full IRA dataset"),
         caption = "@fubits",
         x = "Weekly Activity",
         y = "",
         color = "Account\nLanguage") +
    theme_minimal() +
    guides(color = guide_legend(override.aes = list(size = 5,
                                                    stroke = 0,
                                                    alpha = 1)))

The Spanish-language dominance in 2017 looks really weird.

kwic_results %>% 
    left_join(data_unique, by = c("docname" = "tweetid")) %>% 
    filter(account_language == "es") %>% 
    select(account_language, tweet_language, pre, keyword, post, tweet_time) %>%
  head(20) %>% 
  knitr::kable("html")

account_language	tweet_language	pre	keyword	post	tweet_time
es	en	SHOCKING : Look What Crooked Favor James Comey Did For	Hillary	Clinton https : / / t.co / weUNOc0m98	2017-09-01 16:55:00
es	en	: Look What Crooked Favor James Comey Did For Hillary	Clinton	https : / / t.co / weUNOc0m98	2017-09-01 16:55:00
es	en	RT Cernovich :	ANTIFA	, a left wing group , would punch solo #UnitetheRight	2017-08-12 19:24:00
es	en	#ade BOMBSHELL STUDY : Voter Fraud May Have Resulted in	Hillary	Winning New Hampshire https : / / t.co / a2KZEMQz1B	2017-07-31 22:31:00
es	en	#darccy Trump Supporters REACT to Mitt Romney PRAISING	ANTIFA	! https : / / t.co / QGyEnhJWfx #darcy https	2017-08-17 00:47:00
es	en	#mar RT comermd :	Antifa	Democrats have successfully aligned with Nazi beliefs . Tomecide is	2017-08-16 17:52:00
es	en	#mar RT comermd : Antifa Democrats have successfully aligned with	Nazi	beliefs . Tomecide is common before destroying a civilization !	2017-08-16 17:52:00
es	en	#emel BREAKING : Masked	Antifa	Thugs Pick a FIGHT With Chicago Police https : /	2017-08-16 15:49:00
es	en	RT stranahan :	Soros	funded spokesman runs from Beth Chapman https : / /	2017-08-01 03:01:00
es	en	#IStandWithHannity #IStandWithEricBolling Free speech & amp ; truth ! NEVER	Soros	facists seanhannity eri … https : / / t.co /	2017-08-08 14:29:00
es	en	: What the Dem #Fake Media Is Hiding About the	Clinton	Email Probe https : / / t.co / JviO83XBFA #just	2017-08-10 11:11:00
es	en	Sorry	Hillary	, Former President Carter Didn’t Vote For You https :	2017-05-09 12:46:00
es	en	VIDEO : BUTSTED ! Bill	Clinton	Caught Ogling Ivanka Trump While Hillary Looks on in Disgust	2017-01-21 01:53:00
es	en	: BUTSTED ! Bill Clinton Caught Ogling Ivanka Trump While	Hillary	Looks on in Disgust https : / / t.co /	2017-01-21 01:53:00
es	en	Trump to Release Unheard	Hillary	Tape … Stunning , Unprecedented https : / / t.co	2017-05-15 18:40:00
es	en	BREAKING : Man Planning to KILL Trump was	Hillary	Clinton’s Close Friend ! His Plan is SICK … https	2017-01-22 17:34:00
es	en	Peter Schweizer :	Clinton	Foundation Probe Must Continue - You Can’t Say ’ She	2016-11-13 01:58:00
es	en	Look What	Antifa	THUGS Did to This Black Student in Charlottesville https :	2017-08-17 16:48:00
es	en	#adsnn Violent	Antifa	THUG Arrested for Assault in Charlottesville https : / /	2017-08-14 01:11:00
es	en	There is no one in the	MSM	who will tell the truth - period . Enlist in	2017-08-18 01:17:00

Hm, does not look too Spanish to me. That’s another topic for further exploration and investigation.

#TODO

2.4 Building a 5.7M x 1.5M Sparse Document-Feature Matrix

From a tactical perspective with regards to large corpora and in-memory processing, this particular Stack Overflow introduced an interesting but actually misleading approach for creating a huge dfm: 000andy8484 - Create dfm step by step with quanteda.

After failing to build a dfm from the untouched corpus repeatedly (on Windows & Linux), this SO-post explained how to reduce the number of features in the corpus by tokenizing the texts and removing unwanted tokens before creating the dfm. This way I really gained some momentum for dfm() (by reducing dimensions), but we would also loose the analytic methods / functionality of a corpus object (i.e. kwic()) as we destroy the basic corpus structure (full Tweets in our case) by replacing the texts with tokens. This experiment was more of a performance evaluation than a recommended practice (for now, at least). I ended up with something like a 60M x 1.5M DFM, which obviously shows that I had lost the document-based corpus structure.

Having said that, here’s our regular strategy:

create untouched corpus (we did this above, already)
~~remove obsolete tokens from corpus by directly tokenizing the texts~~
~~remove stopwords from corpus~~
create dfm from ~~reduced~~ untouched corpus
reduce dfm by removing unwanted features (stopwords, URLs, 2-char tokens and so on)
build dfm subsets for specific analyses
build topic models with stm() based on variations of the dfm

Naive Corpus

# data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(stringr::str_c(data_path, "IRA_corpus_naive.rds"))

2.4.1 Stopwords

First of all, we will need to create list of very frequent tokens/words which we can simply filter out from our corpus. This is my current best-practice with regards to stopwords. I’m using three different sources to compile a super-list of stopwords for the languages concerned. (As for “amp”, this is an artifact I encountered in my previous posts: It’s just a left-over of &amp, which is the HTML entity for string-parsing &, and is created when we use remove_punct = TRUE. Same is true for &quot and probably some other unparsed/escaped entities in the IRA dataset).

# repo: https://github.com/stopwords-iso
en_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-en/master/stopwords-en.txt")
ru_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-ru/master/stopwords-ru.txt")
de_stopwords <- readr::read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt")

custom_stopwords_en <- unique(c(en_stopwords,
                             tidytext::stop_words$word,
                             quanteda::stopwords("english"))) # only keep uniques
custom_stopwords_ru <- unique(c(ru_stopwords,
                             quanteda::stopwords("russian"))) # only keep uniques
custom_stopwords_de <- unique(c(de_stopwords,
                             quanteda::stopwords("german"))) # only keep uniques

stopwords_extra <- c("amp", "http", "https", "quot")
custom_stopwords_en <- c(custom_stopwords_en, stopwords_extra)
custom_stopwords_ru <- c(custom_stopwords_ru, stopwords_extra)
custom_stopwords_ru <- c(custom_stopwords_ru, stopwords_extra)

rm(en_stopwords) # free up RAM. We'll need it soon-ish
rm(ru_stopwords)
rm(de_stopwords)

2.4.2 Deprecated Approach

Caution: This is the taboo-approach, which I tried out and would not recommend re-doing. It performs. But the analytical value/utility of the output is rather limited!

This task alone would first approach taking up more than 20GB “RAM”" on my Windows OS Laptop and eventually crash, so be warned again: proceed with caution!

On Ubuntu Linux we’ll get along with 9GB-11GB RAM for the same task (corpus building, tokenizing and reducing) without even touching the swap file. The difference is insane… However, this still would not work out for creating a dfm from the untouched, all-tweets corpus. If you want to replicate this, I suggest at least 16GB RAM + 20GB swap memory or 32GB RAM and a regular-sized swap file/drive. SSD performance might be better. Or cloud, if you’re willing (and able) to pay 2-3€/hour for the larger instances…

library(quanteda)
quanteda::quanteda_options("threads" = 4)
system.time({
texts(IRA_corpus) <- quanteda::tokens(IRA_corpus,
                                      remove_twitter = TRUE,
                                      remove_url = TRUE,
                                      remove_punct = TRUE, 
                                      remove_numbers = TRUE,
                                      # ngrams = 1:2
                                      )
})
BRRR::skrrrahh("flava")


system.time({
texts(IRA_corpus) <- tokens_remove(texts(IRA_corpus),
                                   c(custom_stopwords_en,custom_stopwords_ru))
})
BRRR::skrrrahh("flava")

saveRDS(IRA_corpus, str_c(data_path, "IRA_corpus_naive_clean.rds"))
BRRR::skrrrahh("flava")

# **Reset session**

rm(data_unique)
rm(IRA_corpus)
rm(custom_stopwords_en)
rm(custom_stopwords_ru)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()

data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive_clean.rds"))
BRRR::skrrrahh("flava")

2.4.3 Regular Approach

Corpus

We already have a plain, untouched corpus object from further above, so we can skip this step.

IRA_corpus <- data_unique %>% 
  corpus(text_field = "tweet_text",
         docid_field = "tweetid",
         docvars = data_unique, # probably obsolete when docid_field is set
         compress = FALSE # compression is better for in-memory, but'll slow down processing  
         )

Instead, we’ll load the clean, untouched corpus from our HDD and proceed with dfm()

data_path <- here::here("data", "IRA_Tweets", "/")
IRA_corpus <- readRDS(str_c(data_path, "IRA_corpus_naive.rds"))
BRRR::skrrrahh("flava")

Sparse Document-Feature Matrix with quanteda::dfm()

Now we’ll turn the international corpus into a sparse Document-Feature Matrix / Document-Term-Matrix with quanteda::dfm() ~~and include n-grams of 1 and 2 words (bigrams)~~.

My apporach for smaller tasks still is to clear the environment after every big step with rm() on the objects and enforcing gc(), then .rs.restartR(), and then readRDS() the last processed object…

On Windows, this task would slow down first, and then crash, no matter what. Eventually, I was able to get this done by:

installing Linux/Ubuntu MATE 18.10
setting up a USB 3.0 thumb drive as a 64GB swap drive (my current SSD is too small for Windows and Ubuntu + swap partition)
calling gcinfo(TRUE) to make the garbage collection (gc()) verbose, so that I was able to monitor the memory states
and finally, starting with a clean session (see the steps described just above)

dfm() pt.1

gcinfo(TRUE) # sets gc() to verbose = TRUE; even when output == "FALSE""
system.time({
IRA_dfm <- dfm(
  # corpus_sample(IRA_corpus, 1000000), #always test with small n first
  IRA_corpus, # full corpus
  tolower = TRUE,
  verbose = TRUE,
  min_nchar = 3,
  remove = c(custom_stopwords_en, custom_stopwords_ru, custom_stopwords_de),
  remove_twitter = FALSE,
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_url = TRUE)
})
BRRR::skrrrahh("flava")

> Argument min_nchar not used

We’ll keep that in mind…

... created a 5,708,124 x 1,534,498 sparse dfm

Success! Thanks Linux! But don’t forget to save this dfm object!!!

dfm() pt.2

Now we’ll make sure that there are no left-over 1- and 2-character tokens (which are called features from now on) and then save the dfm to our HDD.

rm(IRA_corpus)
IRA_dfm <- dfm_select(IRA_dfm, min_nchar = 3) # double-checking
saveRDS(IRA_dfm, str_c(data_path, "IRA_dfm_naive_clean.rds"))
BRRR::skrrrahh("flava")
gcinfo(FALSE) # turn off verbosing for garbage collection

IRA_dfm <- readRDS(str_c(data_path, "IRA_dfm_naive_clean.rds"))

Applying min_nchar = 3 twice actually really helped. Now we are down to a:

5708124 x 1530058 sparse DFM (4.5K features less now).

summary(IRA_dfm)

##        Length         Class          Mode 
## 8733760791192           dfm            S4

Holy sh*t. That’s the power of R + Quanteda + Linux I guess. So my “little” proof-of-concept experiment worked out. You can build a large dfm from a large corpus without resorting to cloud computing / AWS et al.

Now we can explore a bit more and then proceed over to building a topic model.

2.5 International Tweets: Top-Features

Let’s have a quick look at the most frequent features from all ~5.7M Tweets (in all Tweet languages)

topfeatures(IRA_dfm, 40) %>%
  knitr::kable("html")

	x
#news	236677
россии	115430
сша	111463
#sports	96852
trump	93964
украина	87824
#politics	74959
украины	67625
police	58584
новости	57005
украине	56198
#спб	52898
#local	52026
people	49801
путин	47521
мнение	45908
workout	43647
love	41963
obama	39125
#новости	38359
области	37143
life	35277
time	34470
из-за	33244
video	32206
day	30315
#world	29351
фото	27870
президент	27017
breaking	26119
#business	25763
woman	25379
president	25149
москве	24527
порошенко	24517
u.s	23593
сирии	23327
black	22892
shooting	22701
house	22352

Funny! Russian terms dominate the international corpus aka Twitter’s “Internet Research Agency” dataset :)What’s also noteworthy, is the top-ranked news hashtag, probably pointing at the proliferation of the “fake news” term.

Now let’s create a subset of our dfm without #hashtags and @account handles for a purer content analysis. It’s a bit redundant, but due to the (in-memory) size of the full-languages dfm (2.8GB with hashtags/accounts; 2.7GB w/o) this is necessary if we want to work with and without Twitter artifacts.

IRA_dfm_no_twitter <- dfm_select(IRA_dfm, pattern = list('#*',"@*"),
                       selection = "remove")
saveRDS(IRA_dfm_no_twitter, str_c(data_path, "IRA_dfm_naive_clean_no_twitter.rds"))

We’ve further reduced the number of features from 1.530.058 to 965.278 by removing Twitter @handles and #hashtags

This is another good moment to restart our session with a lower in-memory footprint.

Reset session

rm(IRA_dfm)
rm(IRA_dfm_no_twitter)
gc(full = TRUE, verbose = TRUE)
# .rs.restartR()
rstudioapi::restartSession()

re-load packages and dfm

library(tidyverse)
library(quanteda)
quanteda_options("threads" = 4) # 4 is the limit on my Laptop
print(str_c("Threads for Quanteda: ", quanteda_options("threads")))

## [1] "Threads for Quanteda: 4"

library(stm)
data_path <- here::here("data", "IRA_Tweets", "/")

Linux: down to 1.3GB RAM load

data_path <- here::here("data", "IRA_Tweets", "/")
IRA_dfm <- readRDS(str_c(data_path, "IRA_dfm_naive_clean_no_twitter.rds"))

Now we can have another look at the top features from our dfm, this time without hashtags.

While the default scheme = "count" would give us the absolut/total feature frequencies (i.e. 3x news in a single Tweet = 3x news in absolut terms), scheme = "docfreq" returns per-document frequencies (number of Tweets with this feature), which is more precise with regards to Tweets.

topfeatures(IRA_dfm, 40, scheme = "docfreq") %>%
  knitr::kable("html")

	x
россии	114300
сша	109309
trump	91500
украина	86958
украины	66896
police	57166
новости	56866
украине	55187
people	47189
путин	45959
мнение	45762
workout	43296
obama	38371
love	37431
области	36847
life	33475
time	33433
из-за	33112
video	31964
day	29196
фото	27675
президент	26703
breaking	26070
woman	24963
president	24644
москве	24376
порошенко	24092
u.s	23235
сирии	23062
shooting	22473
house	21988
рублей	21941
killed	21906
black	21719
петербурге	20946
политика	20738
white	20385
интересно	20228
видео	19800
дтп	19681

Wow. Russian features concerning Ukraine, Russia, and USA still dominate the top features in terms of frequency!

We should also have a look at these frequencies grouped by Tweet languages:

topfeatures(IRA_dfm, 20, scheme = "docfreq",
            groups = "tweet_language") %>%
  knitr::kable("html")

	x
россии	107017
сша	101932
украина	83053
украины	62503
новости	53946
украине	52553
мнение	45032
путин	42412
области	33446
из-за	29488
президент	24976
фото	24633
порошенко	23273
сирии	21842
рублей	19868
интересно	19818
москве	19581
петербурге	19046
политика	18809
дтп	17573

	x
достойно	835
доброе	641
цитата	518
слова	391
чужие	309
номер	306
теме	303
lol	300
доходчиво	291
путин	284
по-царски	283
благотворно	279
уважения	278
позиция	269
новости	267
материал	264
суховат	262
поощрения	261
фото	258
репост	257

	x
trump	90333
police	56297
people	46701
workout	43263
obama	37492
love	37018
life	33078
time	32824
video	31283
day	28849
breaking	25934
woman	24844
president	24389
u.s	23092
shooting	22376
house	21726
killed	21649
black	21408
white	20299
city	18623

	x
россии	7064
сша	5577
москве	4718
из-за	3505
области	2969
путин	2840
видео	2638
суд	2182
млн	2142
москвы	2080
дтп	2049
рублей	1899
петербурге	1814
погибли	1788
фото	1697
матче	1464
медведев	1414
человека	1408
глава	1308
динамо	1302

	x
cnn	564
police	472
potus	380
twitter	353
michael	331
alex	327
chris	322
ryan	317
russia	314
john	302
android	297
kim	286
facebook	275
special	274
sarah	270
share	268
add	268
prosecutor	262
linkedin	262
enjoy	260

	x
polizei	2009
berlin	1795
guten	1407
hamburg	1374
stuttgart	1183
köln	1155
kölner	1121
frau	1084
schönen	1026
verletzt	998
deutschland	969
berliner	957
nacht	956
merkel	918
unfall	905
euro	855
trump	771
auto	696
wünsche	690
dresden	665

	x
украины	3590
украина	3429
новости	2171
украине	2032
сша	1704
политика	1542
фото	946
критично	920
перепост	722
порошенко	646
украину	631
президент	590
яценюк	567
репост	536
редкость	507
власти	487
владимир	439
александр	416
лнр	409
млн	407

	x
kittens	215
hahaha	194
cute	166
hahahah	160
haha	138
kiwiwang5	129
usvetram	128
ethnicambiguity	120
euroraver	119
abitx2u	95
persian	94
ha-ha	80
life2tweet	75
hahahaha	70
yung	65
rudy	64
poster	64
serge	64
ibaka	64
ahahaha	64

	x
del	262
por	206
rt_america	171
retweet	160
adorable	159
para	151
obama	148
puerto	147
mayor	132
méxico	132
san	125
video	121
police	120
rico	111
venezuela	108
mexico	104
las	96
criminal	93
una	89
sos	89

	x
على	6398
سوريا	2133
السوري	2119
الارهاب	2060
الله	1698
داعش	1593
الجيش	1577
الشعب	1418
التي	1356
الى	1340
بعد	1186
يلي	1140
هذا	1116
إلى	1111
انو	1063
الذي	1006
هذه	825
العالم	819
ولا	796
حتى	785

	x
les	309
police	180
pour	164
une	152
trump	132
detroit	132
dans	131
hot	115
dangerous	114
abercrombie	110
change	108
models	108
ridiculous	103
excellent	100
obama	99
vote	99
senate	92
france	86
climate	85
court	78

	x
sarah	248
selfie	157
retweet	151
rt_america	151
blah	101
susan	81
ryan	75
megan	69
hannah	68
laura	62
sam	60
boo	58
car	57
palestinian	55
lisa	54
emily	53
dan	53
adam	53
rabbi	52
talve	52

	x
рукам	282
заметано	282
славу	272
метко	271
бог	264
где-то	253
посередине	253
велел	240
путина	189
зима	112
андроид	79
путин	74
разлука	74
политика	58
обама	46
лично	44
подруге	42
истина	38
кличко	36
август	35

for orientation: this is the order of languages in this table (can’t get them to render as col.names = with knitr:kable() yet):"ru", "bg", "en", "und", "other_44", "de", "uk", "tl", "es", "ar", "fr", "in", "sr"

Mind that I only removed stopwords for Tweets labelled as English, Russian, and German (by Twitter). As we’ve explored in the previous post, und (=undefined) mostly containes Tweets in Russian. What’s interesting is that Bulgarian as well as Ukrainian both predominantly contain Russian features, with Russian orthography, which should not be the case here.

Let’s have a look at how these features might be connected to each other:

topfeats <- names(topfeatures(IRA_dfm, 100))
textplot_network(dfm_select(IRA_dfm, topfeats), min_freq = 0.2,
                 edge_alpha = 0.7, edge_size = 5, edge_color = "grey", 
                 vertex_size = 1)

The two worlds collide on the topic of USA …

With the mightiness of Quanteda, we could do a lot of further exploratory analysis and so on… However, it’s really about time to focus on a smaller and language-specific subset, as the following task (to be addressed in part #3 of this series) will be even more intense in terms of computing power and memory usage.

However, here’s a preview from Topic Modelling on the English-language subset (~2M Tweets):

(73 Topics from 2M English-language Tweets; modelled with the stm()-package; K auto-induced with t-SNE/PCA)