Last week, I mined almost 5K Tweets from the annual meetings of five German academic societies. Now it’s about time that we dive into the contents with Kenneth Benoit’s powerful quanteda Package. Come for the corpus approaches to text as data, stay for the Turkish Plot-Twist…

library(tidyverse)
library(here)
library(rtweet) # just in case we want to look up something on Twitter
library(quanteda)
quanteda_options("threads" = 4)
# quanteda_options("threads")

1 Import Tweets from .rds

Please see Twitter pt. 1 for Twitter Mining with rtweet and the details of the data handling approach.

Prepare here()-path to the .rds data

data_path <- here("data", "ConferenceTweets", "/")

Bulk-read the .rds files

dvpw_collection <- dir(path = data_path, pattern = "dvpw_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
dvpw_collection <- dvpw_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "PolSci") %>% 
  arrange(created_at)

dgs_collection <- dir(path = data_path, pattern = "dgs_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
dgs_collection <- dgs_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "Sociology") %>% 
  arrange(created_at)

hist_collection <- dir(path = data_path, pattern = "hist_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
hist_collection <- hist_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "History") %>% 
  arrange(created_at)

inf_collection <- dir(path = data_path, pattern = "inf_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
inf_collection <- inf_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "CS") %>% 
  arrange(created_at)

1.1 Subset “Interdisciplinary” Tweets

Something I didn’t account for last time, was the possibility that some Twitter Users might have been mentioning / monitoring multiple conferences, esp. with regards to the interrelation between Political Science, Sociology, and History.

Let’s single them out and assign a “Mixed” label.

joint_collection <- bind_rows(dvpw_collection, dgs_collection,
                              hist_collection, inf_collection)

# build set of distinct 
joint_distinct <- joint_collection %>% 
  distinct(status_id, .keep_all = TRUE)

#subset duplicated
mixed_collection <- subset(joint_collection,
                      duplicated(joint_collection$status_id)) %>% 
  distinct(status_id, .keep_all = TRUE) # find duplicates

mixed_collection$Discipline <- "Mixed"

Only 42 Tweets? Out of a sample of almost 5K? Twitter Silos, anyone?

1.2 Inspect “Interdisciplinary” Tweets

mixed_collection %>% 
  arrange(created_at) %>% 
  select(text) %>% head(10) %>% 
  knitr::kable(format = "html", digits = 2)
text
Gut, dass das Team des @fgf_nrw etwas größer ist: Wir sind diese Woche vertreten: #dgs18, #dvpw18 und ab Mittwoch bei der KEG in Wien https://t.co/jiE6SyfKHe
Hier noch eine wichtige Ansage: #dvpw18 muss unbedingt vor #dgs18 trenden
@daniellambach Das wollen die von den Naturwissenschaften doch nur, dass wir Sozial- und Geisteswissenschaftler*innen uns gegenseitig bekriegen. 😉 #dvpw18 #histag18 #dgs18
Leider findet die #dvpw18 gleichzeitig mit dem #dgs2018 statt. Wäre sehr gern auch bei Euch, @mlewandowsky, @wahlforschung und @thothiel :/
Paar Minuten im 1. Vortrag #infdh2018 reichen schon, um mal ganz deutlich zu sehen um wie viele Größenordnungen die digitalen Geisteswissenschaften den “digitalen” Sozialwissenschaften handwerklich und institutionell voraus sind. Grüße an #dvpw18 und #dgs18 https://t.co/TZU8E9PecT
Meine Wunsch an #ScienceTwitter jetzt wo #dvpw18 &amp; #HisTag18 &amp; #dgs18 parallel laufen: Kann jemand analysieren, wie viel Schnittmenge es zwischen den Programmen gibt?
@DrMichaelHein @dvpwkongress #dgs18 #dvpw18 #histag18 ich fände es ja schöner auf spannende Themen gemeinsam zu blicken, die wir gesellschaftlich gemeinsam diskutieren könnten/sollten, und nicht Trennungslinien und Wettbewerb aufzumachen…
Kleine Pause gefällig? Unter den Hashtags #ddss18, #dgs18 und #HisTag18 kann man sehr interessanten Veranstaltungen aus der Ferne beiwohnen. #gswtud https://t.co/VlfzaYpjrb
Wie wäre es mit einem gemeinsamen Soziohistostaatsrechtspolitolog*innentag 2021? @dvpwkongress @dvpw @DGSoziologie @historikertag @vdstrl #dvpw18 #histag18 #dgs18 #vdstrl18 https://t.co/EgZYmaikqY
während die lieben kolleg(inn)en bei #dvpw18, #dgs18 &amp; #HisTag18 um die besten tweets auf #twitter ringen, schauen wir von #GSA2018 aus pittsburgh zu. good luck, friends! 😇

2 Create Corpus

For further, “Corpus-based”" analysis (and beyond) we’ll use the quanteda package.

I aim to re-do as much as I can from this series with Julia Silge’s tidytext package, soon, btw.

2.1 Build Individual Corpora

As we already have singled out “interdisciplinary” Tweets, we’ll just anti_join() every other tibble with the mixed Tweets.

dvpw_corpus <- dvpw_collection %>%
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(dvpw_corpus, "Discipline") <- "PolSci"

dgs_corpus <- dgs_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(dgs_corpus, "Discipline") <- "Sociology"

hist_corpus <- hist_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(hist_corpus, "Discipline") <- "History"

inf_corpus <- inf_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(inf_corpus, "Discipline") <- "CS"

mixed_corpus <- mixed_collection  %>% 
  corpus(docid_field = "status_id")
docvars(mixed_corpus, "Discipline") <- "Mixed"

2.2 Create Joint Corpus

That’s even easier thanks to Quanteda.

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus + 
                mixed_corpus

3 Create DFM

For most really usuful approaches to “text as data” we’ll need a sparse document-feature matrix (dfm). Doing this with quanteda is straight forward, but there are some less prominent Tweets.

3.1 Naive

joint_dfm <- dfm(joint_corpus,
                # groups = "Discipline",
                 remove_punct = TRUE, 
                 remove_url = TRUE, # it's a mess, without
                 tolower = TRUE,
                 verbose = FALSE) #for website readability
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)
x
der 2001
#dvpw18 1681
und 1661
die 1517
in 1214
#dgs2018 1174
#histag18 1037
auf 765
von 675
für 624
zu 601
#dgs18 584
das 583
mit 566
ist 520
the 515
den 480
des 452
im 437
dem 432

We get nfeat(joint_dfm) = 20832 features, but as we can see from the topfeatures() output, the top features are mostly (and unsurprisingly very common German words which are also know as stopwords in NLP).

3.2 2nd Attempt: Remove stopwords("german")

joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = stopwords("german"),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 1681
#dgs2018 1174
#histag18 1037
#dgs18 584
the 515
of 357
and 337
#informatik2018 321
to 314
amp 224
a 223
heute 200
beim 184
on 175
panel 172
dass 168
@dvpwkongress 166
for 162
is 150
demokratie 145

Apart from English tokens (to, on, is, for, of, a), common German words such as “beim” or “dass” are still included. The latter is rather weird…

3.2.1 Inspect quanteda's built-in Stop Words

Since we’ve seen that “dass” is still included in our corpus, let’s i.e. look at all quanteda::stopwords("german") starting with a “d”:

stopwords("german") %>%
  as_tibble() %>%
  filter(str_detect(value, pattern = "^da.*")) %>% 
  knitr::kable(format = "html", digits = 2)
value
da
damit
dann
das
daß
dasselbe
dazu

Ok, “beim” is missing, and “daß” instead of “dass” suggests that quanteda’s German stopword list terms might need an update… :)

Also, note how stopwords("german") consists of 231 tokens. Just for comparison, tidytext::stop_words has a total of 1149 stopwords for English. So we probably will have to include custom stopword lists repeatedly.

Of course, GitHub has you covered! Gene Diaz is maintaining a super-exhaustive list of stopwords for multiple languages: github.com/stopwords-iso

We’ll use the text file with the German stopwords: stopwords-de.txt

ger_stopwords <- read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt")
# saveRDS(ger_stopwords, "ger_stopwords.rds")
length(ger_stopwords) #> 621 stopwords
c(ger_stopwords, stopwords("german")) %>% length() #> 852
c(ger_stopwords, stopwords("german")) %>% unique() %>% length() #> 621

3.2.2 Include Custom Stopwords and Remove English Stopwords

c("dass", "beim") %in% ger_stopwords #> [1] TRUE TRUE
## [1] TRUE TRUE
# custom_stopwords <- c("dass", "beim")
custom_stopwords <- setdiff(ger_stopwords, stopwords("german")) # only keep left set
joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"), # ONE
                            custom_stopwords, # TWO
                            min_nchar = 2), # THREE
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 1681
#dgs2018 1174
#histag18 1037
#dgs18 584
#informatik2018 321
amp 224
panel 172
@dvpwkongress 166
demokratie 145
@osymbaskanligi 123
dgs 121
innen 114
#dgs 113
@dvpw 112
#dvpw2018 111
@historikertag 109
bu 106
de 97
@dgsoziologie 95
94

Great. But … what is “amp”???

3.2.3 Inspect “Keyword in Context” kwic() for “amp”

kwic(joint_corpus, "amp", window = 3)  %>%
  # vs. kwic(x, phrase("term1 term2"))
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)
pre keyword post
@ifp_tuebingen ) & amp ; I organise
@goetheuni ) & amp ; Jonas Wolff
pu­blic goods & amp ; to pre­vent
die Vorträge & amp ; Panelleitungen von
von #InIIS & amp ; @BIGSSS_Bremen Kolleg
Viel Erfolg & amp ; Spaß den
" #PolComm & amp ; Digital Complexity
, @SFB1342 & amp ; von überall
bestimmt spannende & amp ; inspirierende Tage
nach #Adorno & amp ; Co gehen

“amp” == “&amp” which is the HTML term for & / ampersand (but “&” is removed when we create the corpus with remove_punct = TRUE, so only “amp” remains. Cool.)

Remove “amp”

custom_stopwords <- c(custom_stopwords, "amp")
# This way, we'll keep our custom "amp" and "innen"
joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 2),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 1681
#dgs2018 1174
#histag18 1037
#dgs18 584
#informatik2018 321
panel 172
@dvpwkongress 166
demokratie 145
@osymbaskanligi 123
dgs 121
innen 114
#dgs 113
@dvpw 112
#dvpw2018 111
@historikertag 109
bu 106
de 97
@dgsoziologie 95
94
thema 93

That’s better. But what’s up with “innen” and “bu”?

3.2.4 Inspect the Tokens "innen" and "bu" kwic().

kwic(joint_corpus, "innen", window = 3)  %>% 
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)
pre keyword post
@BIGSSS_Bremen Kolleg * innen . Da ist
| ler * innen vor Ort #powi
#powi Kolleg * innen von @UniBremen ,
. Kolleg / innen der @unihh im
Kolleg / innen vom Institut für
die Kolleg / innen mit dem Peer
1000 Expert * innen sucht , die
loyale Bürger / innen oder auch bürgerliche
mit Rassist * innen , Antisemit *
, Antisemit * innen , und Nazis

Ok, so "< / | * >innen" is part of the gendered forms of plurals of German terms such as Colleagues, Citizens, Rassists et al. We might want to think of a robust solution - which should not be stemming - here. Maybe ngrams=2 or some clever str_c(<regex> + (* | /) + "innen"") might help. But for now we’ll just add "innen" as a stopword.

custom_stopwords <- c(custom_stopwords, "innen")
joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 2),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 1681
#dgs2018 1174
#histag18 1037
#dgs18 584
#informatik2018 321
panel 172
@dvpwkongress 166
demokratie 145
@osymbaskanligi 123
dgs 121
#dgs 113
@dvpw 112
#dvpw2018 111
@historikertag 109
bu 106
de 97
@dgsoziologie 95
94
thema 93
bir 89

4 The Turkish Plot-Twist

4.1 One last Mystery remains…

So what about "bu"?

(Spoiler: @osymbaskanligi seems to be an offical governmental Turkish account… That already points at something bigger.)

kwic(joint_corpus, "bu", window = 3) %>% 
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)
pre keyword post
#dgs2018 keşke bu gece açıklansa 🙄
#dgs2018 bu gecenin en büyük
otomatik olarak yapıyorsa bu bekleyiş neden ?
sonuçlarını açıklamak neden bu kadar uzun sürüyor
Bu yaşta bu kadar
Bu yaşta bu kadar sıkıntı yeter
@OSYMbaskanligi ceza mı bu ne şimdi #dgs2018
, DGS neden bu kadar itibarsızlaştırılıyor ?
halde bekliyorum ve bu beni rahatsiz ediyo
Bu #dgs2018 yüzünden internette

Turkish? Huh? Seems like the Sociologists’ hashtag (#dgs18, #dgs2018) was heaviliy used by the Turkish Twitter community, too.

dgs_collection %>% filter(lang == "tr") %>% count() %>%
  knitr::kable("html")
n
957

Oopsie… From the Sociology Corpus, 957 Tweets out of 1674 are labelled as Turkish…

WHAT ELSE DID I MISS?

dgs_collection %>% group_by(lang) %>% count() %>% arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)
lang n
tr 957
de 570
en 72
und 68
in 4
es 1
fr 1
tl 1

Yikes! So my result from the previous post were totally biased in favour of the Sociology Conference. And what language is "und" and what about the joint collection?

joint_collection %>% group_by(lang) %>% count() %>% 
  filter(n > 2) %>% arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)
lang n
de 3156
tr 957
en 540
und 166
nl 20
sv 6
es 5
in 5
fr 4
no 3

lang == "tr" and lang == "und" definitly need some closer inspection.

But before we do that, let’s subset only the Tweets from the conferences’ week. Maybe the Turkish and the German Sociology #dgs2018 did not overlap temporally…

Fortunatly, rtweet has a really convenient time-series plotting function…

dgs_collection %>% filter(lang == "tr" | lang == "de") %>% 
  group_by(lang) %>% 
  rtweet::ts_plot() +
    theme_minimal()

… unfortunately, I didn’t use it last time…

Key Learning: I could have avoided stepping into this trap if I had plotted the distribution of Tweets over time… Exploratory Data Analysis FTW.

This is what I would’ve seen, I had done it right:

joint_collection %>%
  group_by(Discipline) %>% 
  rtweet::ts_plot() +
    theme_minimal()

See that left peak around the 20th? Arghhhh!

Time to move on. As we have already set an upper time limit above (filter(created_at < "2018-09-30")), we now only have to include a lower time limit:

joint_collection_week <- joint_collection %>% 
  filter(created_at > "2018-09-23")

joint_collection_week %>%
  group_by(lang) %>% 
  count() %>%
  filter(n > 2) %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)
lang n
de 3068
en 522
und 120
tr 92
nl 19
sv 6
es 4
fr 4
no 3

Ok, that’s already better, but there are still 92 Turkish Tweets in the sample, and 120 Tweets for lang == "und".

It turns out, however, that lang == "und" (un)fortunatly means language undefined cf. Twitter

Let’s have a look at the remaining Turkish Tweets and then rebuild our corpora and the document-feature matrix.

joint_collection_week %>%
  filter(lang == "tr") %>%
  select(text) %>% 
  head() %>% # for website readability
  knitr::kable(format = "html", digits = 2) 
text
Çok şükür Allahım… #dgs2018 https://t.co/DAXRFoawzq
Mimar sinan güzel sanatlarlı olduk be 😂 #Dgstercih #dgs2018
#dgs2018 derslerden muaf olmak için en çok 5 yıl önce ilgili programdan mezun olmak gerekir şartı var mı arkadaşlar????? Bilen birisi bilgilendirsin lütfen.
#dgs2018 AÖF lisans dersleri DGS muafiyetinde kullanıyor mu??
Bilgi Üniversitesi’nde 2000li arkadaşlarımla hazırlık okuyacağım. Yaşlı olmak kontenjanından dışlanmam umarım🙄 #dgs2018 @BiLGiOfficial
4 puan ile iktisat kaybettim sizce ek tercihlerde yazsam çıkar mı?#dgs2018 #DGSekyerleştirme #dgs #DGSE #dgsmovie3 #dgs

So all the Tweets are exclusivly in Turkish, and it is more than appropriate to exclude them from our analysis here AND in my preceding blog post. (Update incoming!)

But who’s going to tell the German Sociologists that my report from last week had to be corrected and that they didn’t perform that well, actually …!?

Key learning: Never rely on blind/nummeric analysises only. Always do some qualitative exploration and check for plausibility - even when it’s “only” for a blog!!!

4.2 Removing Turkish accounts

First, let’s get list of all the Users who where part of the Turkish #dgs2018 sample (and since I’ve updated the previous post, I have collected some Turkish hashtags for lang=="und", so we’ll just re-use this filter here.)

tr_user <- dgs_collection %>%
  filter(str_detect(text,"yks2018|yksdil|dgsankara|cumhuriyetüniversitesi|DanceKafe") | lang == "tr") %>% 
  select(user_id) %>% 
  distinct()
tr_user %>% count() %>% 
  knitr::kable(format = "html", digits = 2)
n
529

Now, we’ll anti_join() the dgs_collection with this list

dgs_collection %>% anti_join(tr_user, by = "user_id") %>% count() %>%
  knitr::kable("html")
n
686

Let’s compare with the simpler filter(lang!="tr") approach

dgs_collection %>% 
  filter(lang != "tr") %>% 
  count() %>% 
  knitr::kable(format = "html", digits = 2)
n
717

That’s a difference of another ~30 Tweets. Nice. We’ll use that in a minute.

5 Rebuild the Corpus without Tweets with lang=="tr"

There is two ways to do that. One is lazy and one is more replication-friendly. I’ll mention the lazy one, but will continue with the more robust approach

The Lazy Way

(Two lazy ways, actually)

As we know that the Turkish Tweets are exclusively in the Sociology Corpus, we could just rebuild the dgs_corpus and then rebuild the joint_corpus with:

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus + 
                mixed_corpus

An even easier way would be to filter out Turkish Tweets from the already existing joint_corpus with quanteda::corpus_subset():

joint_corpus %>% corpus_subset(joint_corpus$documents$lang != "tr")

However, if you have been jumping back and forth within this post (or .Rmd Notebook), then you (=me) might have lost track of the various manipulations and state differences (i.e. think of custom_stopwords). Plus, the dgs_corpus would still need attention, too, and as we’ve filtered out a lot of Tweets by setting a lower time limit, the common time period for the corpora would differ, too. BAD!

So let’s rebuild the corpora from scratch.

5.1 Re-Build individual Corpora from Scratch

We’ll just add filter(created_at > "2018-09-23") and for the sake of robustness, we’ll filter all corpora for lang != "tr", and anti_join(dgs_corpus) with tr_user.

As I want both, discipline-specific copora with “multidisciplinary Tweets” and a redundancy-free joint corpus, I’ll build the corpora in two steps.

dvpw_corpus <- dvpw_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>% # combined filter()
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
  ## not totally sure about the effect of setting text_field = "text"
docvars(dvpw_corpus, "Discipline") <- "PolSci"

dgs_corpus <- dgs_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>%
  anti_join(tr_user, by = "user_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(dgs_corpus, "Discipline") <- "Sociology"

hist_corpus <- hist_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(hist_corpus, "Discipline") <- "History"

inf_corpus <- inf_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(inf_corpus, "Discipline") <- "CS"

mixed_corpus <- mixed_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  corpus(docid_field = "status_id", text_field = "text") # 42 docs!
docvars(mixed_corpus, "Discipline") <- "Mixed"

Of course, eventually, we should update and rm() all the obselete *_collection objects or simply consolidate all the “valid” Tweets in a .rds file. However, I’m not so much into editing raw-ish / original data, so doing that is up to you.

5.2 Create Joint Corpus

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus # 3,748 docs

joint_corpus <- joint_corpus %>% corpus_subset(!(docnames(joint_corpus) %in% mixed_collection$status_id)) # 3,706 docs

joint_corpus <- joint_corpus + mixed_corpus #3,748 docs

6 3rd Attempt @ dfm() Building

joint_dfm <- dfm(joint_corpus,
                # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 3),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                # remove_numbers = TRUE,
                 tolower = TRUE,
                 verbose = TRUE)
## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 3,748 documents, 16,371 features
##    ... removed 608 features
##    ... created a 3,748 x 15,763 sparse dfm
##    ... complete. 
## Elapsed time: 0.58 seconds.

We’re down to 15763 features from 3748 documents (Tweets) and were able to get rid of 608 features with the iteratively refined approach.

And as we might want to use a grouped dfm for group comparisons (and dfm_group doesn’t seem to work for me here), we’ll create a custom grouped one, too.

## not working:
# dfm_group(joint_dfm, groups = "Discipline")
# OR
# dfm_group(joint_dfm, groups = c("Discipline"))
# OR
# dfm_group(joint_dfm, groups = docvars(joint_dfm, "Discipline"))
#> Error in qatd_cpp_is_grouped_numeric(as.numeric(x), group) : 
#>  (list) object cannot be coerced to type 'double'

joint_dfm_grouped <- dfm(joint_corpus,
                 groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 3),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                # remove_numbers = TRUE,
                 tolower = TRUE,
                 verbose = FALSE) #for website readability

7 Some quick Analyses

7.1 Top Hashtags

hashtags_dfm <- dfm_select(joint_dfm, ('#*'), selection = "keep")
topfeatures(hashtags_dfm, 20, scheme = "count") %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 1694
#histag18 1043
#dgs18 622
#informatik2018 313
#dgs2018 136
#dvpw2018 113
#historikertag2018 84
#münster 53
#frankfurt 46
#demokratie 45
#berlin 45
#digitalisierung 43
#steinmeier 41
#departmentstruktur 41
#soziologie 38
#zetsche 36
#dvpw 35
#jena 34
#bremen 33
#histocamptag 33

7.2 Top Hashtags per Discipline

topfeatures(hashtags_dfm, 20,
            group = "Discipline",
            scheme = "count") %>% 
  map(knitr::kable, "html")
$PolSci
x
#dvpw18 1604
#dvpw2018 108
#demokratie 43
#steinmeier 41
#departmentstruktur 41
#dvpw 35
#powi 26
#bundespräsident 26
#politikwissenschaft 25
#democracy 23
#takeover 17
#powilehre 15
#frankfurt 14
#energiewende 12
#servicetweet 9
#populism 9
#twittertakeover 9
#janemansbridge 9
#keynote 9
#polisci 7
$Sociology
x
#dgs18 576
#dgs2018 131
#soziologie 38
#dvpw18 29
#sozkon18 26
#göttingen 16
#histag18 12
#dgs 11
#dgskongress 10
#integration 9
#soziologiekongress 9
#mixedmethods 8
#digitalisierung 7
#flucht 7
#migration 6
#forschungsethik 6
#sozkon2018 5
#sfb1265 4
#luhmann 4
#simmel 4
$History
x
#histag18 1004
#historikertag2018 81
#münster 51
#histocamptag 33
#archivtag 26
#dvpw18 21
#historikertag 19
#doktorandenforum 15
#dgs18 14
#twitterstorians 13
#forschungsdaten 11
#geschichte 11
#digitalhumanities 11
#auxhist 11
#digigw 10
#gtshistag18 9
#openaccess 8
#digitalisierung 8
#muenster 8
#histag2018 8
$CS
x
#informatik2018 307
#berlin 42
#zetsche 33
#frankfurt 32
#bremen 32
#jena 32
#dortmund 32
#kontrolle 32
#kanzlerin 32
#hessen 32
#nrwtag2018 32
#bayern 32
#niedersachsen 32
#brandenburg 32
#saarland 32
#baden 32
#württemberg 32
#koeln 32
#muenchen 32
#dresden 32
$Mixed
x
#dvpw18 38
#dgs18 30
#histag18 25
#dgs2018 3
#infdh2018 3
#dvpw2018 2
#daimler 2
#informatik2018 2
#archivtag 2
#gfm2018 2
#historikertag2018 2
#kauder 2
#fcbfca 2
#powi 1
#gew 1
#göttingen 1
#türkei 1
#twitter 1
#hambibleibt 1
#hambacherforst 1

The Sociology Corpus has still some Turkish features. There are ways to adress this, but I have to post-pone this for another post. (cf. this approach)

Most popular Hashtags by shared docfreq / feature-document frequency:

dfm_select(joint_dfm_grouped, ('#*')) %>% 
  topfeatures(30, scheme = "docfreq") %>% 
  knitr::kable(format = "html", digits = 2)
x
#dvpw18 5
#histag18 5
#informatik2018 5
#powi 4
#dvpw2018 4
#göttingen 4
#archivtag 4
#twitter 4
#digitalisierung 4
#afd 4
#sozialwissenschaften 4
#dgs18 4
#infdh2018 4
#gfm2018 4
#servicetweet 3
#gew 3
#türkei 3
#daimler 3
#berlin 3
#hambibleibt 3
#hambacherforst 3
#nachhaltigkeit 3
#openaccess 3
#migration 3
#darkphoenix 3
#facebookdown 3
#innsbrucktirol2018 3
#bigdata 3
#spd 3
#ff 3

7.3 Top Features

Without Hashtags and Twitter @Nicks

# {r results="asis"}
features <- dfm_select(joint_dfm, pattern = list('#*',"@*"),
                       selection = "remove",
                       min_nchar = 3)
topfeatures(features, 40) %>% 
  knitr::kable("html")
x
panel 167
demokratie 142
thema 93
democracy 86
sektion 76
münster 73
diskussion 69
kongress 67
frankfurt 65
danke 64
vortrag 63
spannende 60
digitale 59
soziologie 59
frage 58
findet 55
grenzen 55
gesellschaft 55
political 54
spricht 54
prof 54
historiker 54
stand 53
2018 51
forschung 51
fragen 51
freuen 50
steinmeier 50
deutschen 49
politikwissenschaft 46
geht’s 45
arbeit 45
german 44
new 44
digitalen 43
historikertag 43
rede 42
geschichte 42
democratic 41
eigentlich 40

7.4 Top Features per Discipline

# {r results="asis"}
features <- dfm_select(joint_dfm, pattern = list('#*',"@*"),
                       selection = "remove",
                       min_nchar = 3)
topfeatures(features, 20, groups = "Discipline") %>% 
  map(knitr::kable, "html")
$PolSci
x
demokratie 130
panel 105
democracy 84
frankfurt 60
political 51
grenzen 49
steinmeier 47
politikwissenschaft 45
democratic 41
kongress 38
rede 36
frage 35
representation 34
diskussion 34
thema 31
politik 29
podiumsdiskussion 29
bundespräsident 28
need 28
dvpw 27
$Sociology
x
soziologie 55
vortrag 32
kongress 26
göttingen 23
panel 17
spannende 17
gesellschaft 15
gruppe 15
session 14
raum 14
plenum 14
sektion 13
danke 12
freuen 12
forschung 12
thema 12
diskussion 12
sozialen 12
vogel 12
woche 11
$History
x
münster 67
sektion 56
historiker 49
historikertag 41
digitale 40
geschichte 36
panel 33
geschichtswissenschaft 30
stand 29
poster 29
thema 28
van 25
danke 22
forschung 21
geht’s 21
digitalen 21
history 21
gespaltene 21
freuen 20
findet 19
$CS
x
thema 21
prof 16
zukunft 15
digitale 14
digitalisierung 13
informatik 13
panel 12
stefan 12
data 12
berlin 11
diskussion 11
ullrich 11
science 10
keynote 10
digital 9
amt 9
ranga 9
erklärt 8
arbeit 8
gesellschaft 8
$Mixed
x
woche 3
findet 3
tweets 3
minuten 3
gibts 3
digitalen 3
frankfurt 2
kolleg 2
sehen 2
kongresswoche 2
good 2
laufen 2
gemeinsam 2
interessant 2
begriff 2
deutlich 2
stärker 2
derzeit 2
politikwissenschaftler 2
digitalisierung 2

There are still a couple features which would qualify for being removed (such as “unsere”, “mal”, “schon”), but we will take care of that with the super-useful quanteda::dfm_trim(<term_freq>) threshold-based feature selection.

Plus, removing or parsing hashtags and applying dfm(stem = TRUE) would largely increase descriptive accuracy

7.5 Network of feature co-occurrences

topfeats <- names(topfeatures(joint_dfm, 60))
textplot_network(dfm_select(joint_dfm, topfeats), min_freq = 0.8)

7.6 Grouped wordcloud of features

“wordcloud, where the feature labels are plotted with their sizes proportional to their numerical values in the dfm” (Vignette for quanteda::textplot_wordcloud)

textplot_wordcloud(joint_dfm_grouped,
                   comparison = TRUE,
                   min_size = 0.5,
                   max_size = 3,
                   max_words = 60)

8 What’s next?

Topic Modelling!

Now that I had to spend quite an amount of time on data processing, cleaning, and eventually building a usable dfm, the actual analysis of the contents remains a #TODO

However, I learned a lot doing this today, and I hope that it became obvious that dealing with text as data is not to be underestimated.