1 Import Tweets from `.rds`

Please see Twitter pt. 1 for Twitter Mining with rtweet and the details of the data handling approach.

Prepare here()-path to the .rds data

data_path <- here("data", "ConferenceTweets", "/")

Bulk-read the .rds files

dvpw_collection <- dir(path = data_path, pattern = "dvpw_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
dvpw_collection <- dvpw_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "PolSci") %>% 
  arrange(created_at)

dgs_collection <- dir(path = data_path, pattern = "dgs_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
dgs_collection <- dgs_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "Sociology") %>% 
  arrange(created_at)

hist_collection <- dir(path = data_path, pattern = "hist_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
hist_collection <- hist_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "History") %>% 
  arrange(created_at)

inf_collection <- dir(path = data_path, pattern = "inf_") %>% 
  str_c(data_path, .) %>% 
  map_dfr(readRDS)
inf_collection <- inf_collection %>% distinct(status_id, .keep_all = TRUE) %>%
  filter(created_at < "2018-09-30") %>%
  mutate(Discipline = "CS") %>% 
  arrange(created_at)

1.1 Subset “Interdisciplinary” Tweets

Something I didn’t account for last time, was the possibility that some Twitter Users might have been mentioning / monitoring multiple conferences, esp. with regards to the interrelation between Political Science, Sociology, and History.

Let’s single them out and assign a “Mixed” label.

joint_collection <- bind_rows(dvpw_collection, dgs_collection,
                              hist_collection, inf_collection)

# build set of distinct 
joint_distinct <- joint_collection %>% 
  distinct(status_id, .keep_all = TRUE)

#subset duplicated
mixed_collection <- subset(joint_collection,
                      duplicated(joint_collection$status_id)) %>% 
  distinct(status_id, .keep_all = TRUE) # find duplicates

mixed_collection$Discipline <- "Mixed"

Only 42 Tweets? Out of a sample of almost 5K? Twitter Silos, anyone?

1.2 Inspect “Interdisciplinary” Tweets

mixed_collection %>% 
  arrange(created_at) %>% 
  select(text) %>% head(10) %>% 
  knitr::kable(format = "html", digits = 2)

text
Gut, dass das Team des @fgf_nrw etwas größer ist: Wir sind diese Woche vertreten: #dgs18, #dvpw18 und ab Mittwoch bei der KEG in Wien https://t.co/jiE6SyfKHe
Hier noch eine wichtige Ansage: #dvpw18 muss unbedingt vor #dgs18 trenden
@daniellambach Das wollen die von den Naturwissenschaften doch nur, dass wir Sozial- und Geisteswissenschaftler*innen uns gegenseitig bekriegen. 😉 #dvpw18 #histag18 #dgs18
Leider findet die #dvpw18 gleichzeitig mit dem #dgs2018 statt. Wäre sehr gern auch bei Euch, @mlewandowsky, @wahlforschung und @thothiel :/
Paar Minuten im 1. Vortrag #infdh2018 reichen schon, um mal ganz deutlich zu sehen um wie viele Größenordnungen die digitalen Geisteswissenschaften den “digitalen” Sozialwissenschaften handwerklich und institutionell voraus sind. Grüße an #dvpw18 und #dgs18 https://t.co/TZU8E9PecT
Meine Wunsch an #ScienceTwitter jetzt wo #dvpw18 & #HisTag18 & #dgs18 parallel laufen: Kann jemand analysieren, wie viel Schnittmenge es zwischen den Programmen gibt?
@DrMichaelHein @dvpwkongress #dgs18 #dvpw18 #histag18 ich fände es ja schöner auf spannende Themen gemeinsam zu blicken, die wir gesellschaftlich gemeinsam diskutieren könnten/sollten, und nicht Trennungslinien und Wettbewerb aufzumachen…
Kleine Pause gefällig? Unter den Hashtags #ddss18, #dgs18 und #HisTag18 kann man sehr interessanten Veranstaltungen aus der Ferne beiwohnen. #gswtud https://t.co/VlfzaYpjrb
Wie wäre es mit einem gemeinsamen Soziohistostaatsrechtspolitolog*innentag 2021? @dvpwkongress @dvpw @DGSoziologie @historikertag @vdstrl #dvpw18 #histag18 #dgs18 #vdstrl18 https://t.co/EgZYmaikqY
während die lieben kolleg(inn)en bei #dvpw18, #dgs18 & #HisTag18 um die besten tweets auf #twitter ringen, schauen wir von #GSA2018 aus pittsburgh zu. good luck, friends! 😇

2 Create Corpus

For further, “Corpus-based”" analysis (and beyond) we’ll use the quanteda package.

I aim to re-do as much as I can from this series with Julia Silge’s tidytext package, soon, btw.

2.1 Build Individual Corpora

As we already have singled out “interdisciplinary” Tweets, we’ll just anti_join() every other tibble with the mixed Tweets.

dvpw_corpus <- dvpw_collection %>%
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(dvpw_corpus, "Discipline") <- "PolSci"

dgs_corpus <- dgs_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(dgs_corpus, "Discipline") <- "Sociology"

hist_corpus <- hist_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(hist_corpus, "Discipline") <- "History"

inf_corpus <- inf_collection %>% 
  anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id")
docvars(inf_corpus, "Discipline") <- "CS"

mixed_corpus <- mixed_collection  %>% 
  corpus(docid_field = "status_id")
docvars(mixed_corpus, "Discipline") <- "Mixed"

2.2 Create Joint Corpus

That’s even easier thanks to Quanteda.

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus + 
                mixed_corpus

3 Create DFM

For most really usuful approaches to “text as data” we’ll need a sparse document-feature matrix (dfm). Doing this with quanteda is straight forward, but there are some less prominent Tweets.

3.1 Naive

joint_dfm <- dfm(joint_corpus,
                # groups = "Discipline",
                 remove_punct = TRUE, 
                 remove_url = TRUE, # it's a mess, without
                 tolower = TRUE,
                 verbose = FALSE) #for website readability

topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)

	x
der	2001
#dvpw18	1681
und	1661
die	1517
in	1214
#dgs2018	1174
#histag18	1037
auf	765
von	675
für	624
zu	601
#dgs18	584
das	583
mit	566
ist	520
the	515
den	480
des	452
im	437
dem	432

We get nfeat(joint_dfm) = 20832 features, but as we can see from the topfeatures() output, the top features are mostly (and unsurprisingly very common German words which are also know as stopwords in NLP).

3.2 2nd Attempt: Remove `stopwords("german")`

joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = stopwords("german"),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	1681
#dgs2018	1174
#histag18	1037
#dgs18	584
the	515
of	357
and	337
#informatik2018	321
to	314
amp	224
a	223
heute	200
beim	184
on	175
panel	172
dass	168
@dvpwkongress	166
for	162
is	150
demokratie	145

Apart from English tokens (to, on, is, for, of, a), common German words such as “beim” or “dass” are still included. The latter is rather weird…

3.2.1 Inspect `quanteda's` built-in Stop Words

Since we’ve seen that “dass” is still included in our corpus, let’s i.e. look at all quanteda::stopwords("german") starting with a “d”:

stopwords("german") %>%
  as_tibble() %>%
  filter(str_detect(value, pattern = "^da.*")) %>% 
  knitr::kable(format = "html", digits = 2)

value
da
damit
dann
das
daß
dasselbe
dazu

Ok, “beim” is missing, and “daß” instead of “dass” suggests that quanteda’s German stopword list terms might need an update… :)

Also, note how stopwords("german") consists of 231 tokens. Just for comparison, tidytext::stop_words has a total of 1149 stopwords for English. So we probably will have to include custom stopword lists repeatedly.

Of course, GitHub has you covered! Gene Diaz is maintaining a super-exhaustive list of stopwords for multiple languages: github.com/stopwords-iso

We’ll use the text file with the German stopwords: stopwords-de.txt

ger_stopwords <- read_lines("https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt")
# saveRDS(ger_stopwords, "ger_stopwords.rds")

length(ger_stopwords) #> 621 stopwords
c(ger_stopwords, stopwords("german")) %>% length() #> 852
c(ger_stopwords, stopwords("german")) %>% unique() %>% length() #> 621

3.2.2 Include Custom Stopwords and Remove English Stopwords

c("dass", "beim") %in% ger_stopwords #> [1] TRUE TRUE

## [1] TRUE TRUE

# custom_stopwords <- c("dass", "beim")
custom_stopwords <- setdiff(ger_stopwords, stopwords("german")) # only keep left set

joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"), # ONE
                            custom_stopwords, # TWO
                            min_nchar = 2), # THREE
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	1681
#dgs2018	1174
#histag18	1037
#dgs18	584
#informatik2018	321
amp	224
panel	172
@dvpwkongress	166
demokratie	145
@osymbaskanligi	123
dgs	121
innen	114
#dgs	113
@dvpw	112
#dvpw2018	111
@historikertag	109
bu	106
de	97
@dgsoziologie	95
mı	94

Great. But … what is “amp”???

3.2.3 Inspect “Keyword in Context” `kwic()` for “amp”

kwic(joint_corpus, "amp", window = 3)  %>%
  # vs. kwic(x, phrase("term1 term2"))
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)

pre	keyword	post
@ifp_tuebingen ) &	amp	; I organise
@goetheuni ) &	amp	; Jonas Wolff
public goods &	amp	; to prevent
die Vorträge &	amp	; Panelleitungen von
von #InIIS &	amp	; @BIGSSS_Bremen Kolleg
Viel Erfolg &	amp	; Spaß den
" #PolComm &	amp	; Digital Complexity
, @SFB1342 &	amp	; von überall
bestimmt spannende &	amp	; inspirierende Tage
nach #Adorno &	amp	; Co gehen

“amp” == “&amp” which is the HTML term for & / ampersand (but “&” is removed when we create the corpus with remove_punct = TRUE, so only “amp” remains. Cool.)

Remove “amp”

custom_stopwords <- c(custom_stopwords, "amp")
# This way, we'll keep our custom "amp" and "innen"
joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 2),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	1681
#dgs2018	1174
#histag18	1037
#dgs18	584
#informatik2018	321
panel	172
@dvpwkongress	166
demokratie	145
@osymbaskanligi	123
dgs	121
innen	114
#dgs	113
@dvpw	112
#dvpw2018	111
@historikertag	109
bu	106
de	97
@dgsoziologie	95
mı	94
thema	93

That’s better. But what’s up with “innen” and “bu”?

3.2.4 Inspect the Tokens `"innen"` and `"bu"` `kwic()`.

kwic(joint_corpus, "innen", window = 3)  %>% 
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)

pre	keyword	post
@BIGSSS_Bremen Kolleg *	innen	. Da ist
\| ler *	innen	vor Ort #powi
#powi Kolleg *	innen	von @UniBremen ,
. Kolleg /	innen	der @unihh im
Kolleg /	innen	vom Institut für
die Kolleg /	innen	mit dem Peer
1000 Expert *	innen	sucht , die
loyale Bürger /	innen	oder auch bürgerliche
mit Rassist *	innen	, Antisemit *
, Antisemit *	innen	, und Nazis

Ok, so "< / | * >innen" is part of the gendered forms of plurals of German terms such as Colleagues, Citizens, Rassists et al. We might want to think of a robust solution - which should not be stemming - here. Maybe ngrams=2 or some clever str_c(<regex> + (* | /) + "innen"") might help. But for now we’ll just add "innen" as a stopword.

custom_stopwords <- c(custom_stopwords, "innen")
joint_dfm <- dfm(joint_corpus,
                 # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 2),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                 tolower = TRUE,
                 verbose = FALSE)
topfeatures(joint_dfm, 20) %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	1681
#dgs2018	1174
#histag18	1037
#dgs18	584
#informatik2018	321
panel	172
@dvpwkongress	166
demokratie	145
@osymbaskanligi	123
dgs	121
#dgs	113
@dvpw	112
#dvpw2018	111
@historikertag	109
bu	106
de	97
@dgsoziologie	95
mı	94
thema	93
bir	89

4 The Turkish Plot-Twist

4.1 One last Mystery remains…

So what about "bu"?

(Spoiler: @osymbaskanligi seems to be an offical governmental Turkish account… That already points at something bigger.)

kwic(joint_corpus, "bu", window = 3) %>% 
  as_tibble() %>% # needed for kwic()
  select(pre:post) %>%
  head(10) %>% 
  knitr::kable(format = "html", digits = 2)

pre	keyword	post
#dgs2018 keşke	bu	gece açıklansa 🙄
#dgs2018	bu	gecenin en büyük
otomatik olarak yapıyorsa	bu	bekleyiş neden ?
sonuçlarını açıklamak neden	bu	kadar uzun sürüyor
	Bu	yaşta bu kadar
Bu yaşta	bu	kadar sıkıntı yeter
@OSYMbaskanligi ceza mı	bu	ne şimdi #dgs2018
, DGS neden	bu	kadar itibarsızlaştırılıyor ?
halde bekliyorum ve	bu	beni rahatsiz ediyo
	Bu	#dgs2018 yüzünden internette

Turkish? Huh? Seems like the Sociologists’ hashtag (#dgs18, #dgs2018) was heaviliy used by the Turkish Twitter community, too.

dgs_collection %>% filter(lang == "tr") %>% count() %>%
  knitr::kable("html")

n
957

Oopsie… From the Sociology Corpus, 957 Tweets out of 1674 are labelled as Turkish…

WHAT ELSE DID I MISS?

dgs_collection %>% group_by(lang) %>% count() %>% arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)

lang	n
tr	957
de	570
en	72
und	68
in	4
es	1
fr	1
tl	1

Yikes! So my result from the previous post were totally biased in favour of the Sociology Conference. And what language is "und" and what about the joint collection?

joint_collection %>% group_by(lang) %>% count() %>% 
  filter(n > 2) %>% arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)

lang	n
de	3156
tr	957
en	540
und	166
nl	20
sv	6
es	5
in	5
fr	4
no	3

lang == "tr" and lang == "und" definitly need some closer inspection.

But before we do that, let’s subset only the Tweets from the conferences’ week. Maybe the Turkish and the German Sociology #dgs2018 did not overlap temporally…

Fortunatly, rtweet has a really convenient time-series plotting function…

dgs_collection %>% filter(lang == "tr" | lang == "de") %>% 
  group_by(lang) %>% 
  rtweet::ts_plot() +
    theme_minimal()

… unfortunately, I didn’t use it last time…

Key Learning: I could have avoided stepping into this trap if I had plotted the distribution of Tweets over time… Exploratory Data Analysis FTW.

This is what I would’ve seen, I had done it right:

joint_collection %>%
  group_by(Discipline) %>% 
  rtweet::ts_plot() +
    theme_minimal()

See that left peak around the 20th? Arghhhh!

Time to move on. As we have already set an upper time limit above (filter(created_at < "2018-09-30")), we now only have to include a lower time limit:

joint_collection_week <- joint_collection %>% 
  filter(created_at > "2018-09-23")

joint_collection_week %>%
  group_by(lang) %>% 
  count() %>%
  filter(n > 2) %>% 
  arrange(desc(n)) %>% 
  knitr::kable(format = "html", digits = 2)

lang	n
de	3068
en	522
und	120
tr	92
nl	19
sv	6
es	4
fr	4
no	3

Ok, that’s already better, but there are still 92 Turkish Tweets in the sample, and 120 Tweets for lang == "und".

It turns out, however, that lang == "und" (un)fortunatly means language undefined cf. Twitter…

Let’s have a look at the remaining Turkish Tweets and then rebuild our corpora and the document-feature matrix.

joint_collection_week %>%
  filter(lang == "tr") %>%
  select(text) %>% 
  head() %>% # for website readability
  knitr::kable(format = "html", digits = 2)

text
Çok şükür Allahım… #dgs2018 https://t.co/DAXRFoawzq
Mimar sinan güzel sanatlarlı olduk be 😂 #Dgstercih #dgs2018
#dgs2018 derslerden muaf olmak için en çok 5 yıl önce ilgili programdan mezun olmak gerekir şartı var mı arkadaşlar????? Bilen birisi bilgilendirsin lütfen.
#dgs2018 AÖF lisans dersleri DGS muafiyetinde kullanıyor mu??
Bilgi Üniversitesi’nde 2000li arkadaşlarımla hazırlık okuyacağım. Yaşlı olmak kontenjanından dışlanmam umarım🙄 #dgs2018 @BiLGiOfficial
4 puan ile iktisat kaybettim sizce ek tercihlerde yazsam çıkar mı?#dgs2018 #DGSekyerleştirme #dgs #DGSE #dgsmovie3 #dgs

So all the Tweets are exclusivly in Turkish, and it is more than appropriate to exclude them from our analysis here AND in my preceding blog post. (Update incoming!)

But who’s going to tell the German Sociologists that my report from last week had to be corrected and that they didn’t perform that well, actually …!?

Key learning: Never rely on blind/nummeric analysises only. Always do some qualitative exploration and check for plausibility - even when it’s “only” for a blog!!!

4.2 Removing Turkish accounts

First, let’s get list of all the Users who where part of the Turkish #dgs2018 sample (and since I’ve updated the previous post, I have collected some Turkish hashtags for lang=="und", so we’ll just re-use this filter here.)

tr_user <- dgs_collection %>%
  filter(str_detect(text,"yks2018|yksdil|dgsankara|cumhuriyetüniversitesi|DanceKafe") | lang == "tr") %>% 
  select(user_id) %>% 
  distinct()
tr_user %>% count() %>% 
  knitr::kable(format = "html", digits = 2)

n
529

Now, we’ll anti_join() the dgs_collection with this list

dgs_collection %>% anti_join(tr_user, by = "user_id") %>% count() %>%
  knitr::kable("html")

n
686

Let’s compare with the simpler filter(lang!="tr") approach

dgs_collection %>% 
  filter(lang != "tr") %>% 
  count() %>% 
  knitr::kable(format = "html", digits = 2)

n
717

That’s a difference of another ~30 Tweets. Nice. We’ll use that in a minute.

5 Rebuild the Corpus without Tweets with `lang=="tr"`

There is two ways to do that. One is lazy and one is more replication-friendly. I’ll mention the lazy one, but will continue with the more robust approach

The Lazy Way

(Two lazy ways, actually)

As we know that the Turkish Tweets are exclusively in the Sociology Corpus, we could just rebuild the dgs_corpus and then rebuild the joint_corpus with:

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus + 
                mixed_corpus

An even easier way would be to filter out Turkish Tweets from the already existing joint_corpus with quanteda::corpus_subset():

joint_corpus %>% corpus_subset(joint_corpus$documents$lang != "tr")

However, if you have been jumping back and forth within this post (or .Rmd Notebook), then you (=me) might have lost track of the various manipulations and state differences (i.e. think of custom_stopwords). Plus, the dgs_corpus would still need attention, too, and as we’ve filtered out a lot of Tweets by setting a lower time limit, the common time period for the corpora would differ, too. BAD!

So let’s rebuild the corpora from scratch.

5.1 Re-Build individual Corpora from Scratch

We’ll just add filter(created_at > "2018-09-23") and for the sake of robustness, we’ll filter all corpora for lang != "tr", and anti_join(dgs_corpus) with tr_user.

As I want both, discipline-specific copora with “multidisciplinary Tweets” and a redundancy-free joint corpus, I’ll build the corpora in two steps.

dvpw_corpus <- dvpw_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>% # combined filter()
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
  ## not totally sure about the effect of setting text_field = "text"
docvars(dvpw_corpus, "Discipline") <- "PolSci"

dgs_corpus <- dgs_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>%
  anti_join(tr_user, by = "user_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(dgs_corpus, "Discipline") <- "Sociology"

hist_corpus <- hist_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(hist_corpus, "Discipline") <- "History"

inf_corpus <- inf_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  # anti_join(mixed_collection, by = "status_id") %>% 
  corpus(docid_field = "status_id", text_field = "text")
docvars(inf_corpus, "Discipline") <- "CS"

mixed_corpus <- mixed_collection %>% 
  filter(created_at > "2018-09-23" & lang != "tr") %>%
  corpus(docid_field = "status_id", text_field = "text") # 42 docs!
docvars(mixed_corpus, "Discipline") <- "Mixed"

Of course, eventually, we should update and rm() all the obselete *_collection objects or simply consolidate all the “valid” Tweets in a .rds file. However, I’m not so much into editing raw-ish / original data, so doing that is up to you.

5.2 Create Joint Corpus

joint_corpus <- dvpw_corpus + 
                dgs_corpus + 
                hist_corpus + 
                inf_corpus # 3,748 docs

joint_corpus <- joint_corpus %>% corpus_subset(!(docnames(joint_corpus) %in% mixed_collection$status_id)) # 3,706 docs

joint_corpus <- joint_corpus + mixed_corpus #3,748 docs

6 3rd Attempt @ `dfm()` Building

joint_dfm <- dfm(joint_corpus,
                # groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 3),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                # remove_numbers = TRUE,
                 tolower = TRUE,
                 verbose = TRUE)

## Creating a dfm from a corpus input...

##    ... lowercasing

##    ... found 3,748 documents, 16,371 features

##    ... removed 608 features
##    ... created a 3,748 x 15,763 sparse dfm
##    ... complete. 
## Elapsed time: 0.58 seconds.

We’re down to 15763 features from 3748 documents (Tweets) and were able to get rid of 608 features with the iteratively refined approach.

And as we might want to use a grouped dfm for group comparisons (and dfm_group doesn’t seem to work for me here), we’ll create a custom grouped one, too.

## not working:
# dfm_group(joint_dfm, groups = "Discipline")
# OR
# dfm_group(joint_dfm, groups = c("Discipline"))
# OR
# dfm_group(joint_dfm, groups = docvars(joint_dfm, "Discipline"))
#> Error in qatd_cpp_is_grouped_numeric(as.numeric(x), group) : 
#>  (list) object cannot be coerced to type 'double'

joint_dfm_grouped <- dfm(joint_corpus,
                 groups = "Discipline",
                 remove = c(stopwords("german"),
                            stopwords("english"),
                            custom_stopwords,
                            min_nchar = 3),
                 remove_punct = TRUE,
                 remove_url = TRUE,
                # remove_numbers = TRUE,
                 tolower = TRUE,
                 verbose = FALSE) #for website readability

7 Some quick Analyses

7.1 Top Hashtags

hashtags_dfm <- dfm_select(joint_dfm, ('#*'), selection = "keep")
topfeatures(hashtags_dfm, 20, scheme = "count") %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	1694
#histag18	1043
#dgs18	622
#informatik2018	313
#dgs2018	136
#dvpw2018	113
#historikertag2018	84
#münster	53
#frankfurt	46
#demokratie	45
#berlin	45
#digitalisierung	43
#steinmeier	41
#departmentstruktur	41
#soziologie	38
#zetsche	36
#dvpw	35
#jena	34
#bremen	33
#histocamptag	33

7.2 Top Hashtags per Discipline

topfeatures(hashtags_dfm, 20,
            group = "Discipline",
            scheme = "count") %>% 
  map(knitr::kable, "html")

$PolSci

	x
#dvpw18	1604
#dvpw2018	108
#demokratie	43
#steinmeier	41
#departmentstruktur	41
#dvpw	35
#powi	26
#bundespräsident	26
#politikwissenschaft	25
#democracy	23
#takeover	17
#powilehre	15
#frankfurt	14
#energiewende	12
#servicetweet	9
#populism	9
#twittertakeover	9
#janemansbridge	9
#keynote	9
#polisci	7

$Sociology

	x
#dgs18	576
#dgs2018	131
#soziologie	38
#dvpw18	29
#sozkon18	26
#göttingen	16
#histag18	12
#dgs	11
#dgskongress	10
#integration	9
#soziologiekongress	9
#mixedmethods	8
#digitalisierung	7
#flucht	7
#migration	6
#forschungsethik	6
#sozkon2018	5
#sfb1265	4
#luhmann	4
#simmel	4

$History

	x
#histag18	1004
#historikertag2018	81
#münster	51
#histocamptag	33
#archivtag	26
#dvpw18	21
#historikertag	19
#doktorandenforum	15
#dgs18	14
#twitterstorians	13
#forschungsdaten	11
#geschichte	11
#digitalhumanities	11
#auxhist	11
#digigw	10
#gtshistag18	9
#openaccess	8
#digitalisierung	8
#muenster	8
#histag2018	8

$CS

	x
#informatik2018	307
#berlin	42
#zetsche	33
#frankfurt	32
#bremen	32
#jena	32
#dortmund	32
#kontrolle	32
#kanzlerin	32
#hessen	32
#nrwtag2018	32
#bayern	32
#niedersachsen	32
#brandenburg	32
#saarland	32
#baden	32
#württemberg	32
#koeln	32
#muenchen	32
#dresden	32

$Mixed

	x
#dvpw18	38
#dgs18	30
#histag18	25
#dgs2018	3
#infdh2018	3
#dvpw2018	2
#daimler	2
#informatik2018	2
#archivtag	2
#gfm2018	2
#historikertag2018	2
#kauder	2
#fcbfca	2
#powi	1
#gew	1
#göttingen	1
#türkei	1
#twitter	1
#hambibleibt	1
#hambacherforst	1

~~The Sociology Corpus has still some Turkish features. There are ways to adress this, but I have to post-pone this for another post.~~ (cf. this approach)

Most popular Hashtags by shared docfreq / feature-document frequency:

dfm_select(joint_dfm_grouped, ('#*')) %>% 
  topfeatures(30, scheme = "docfreq") %>% 
  knitr::kable(format = "html", digits = 2)

	x
#dvpw18	5
#histag18	5
#informatik2018	5
#powi	4
#dvpw2018	4
#göttingen	4
#archivtag	4
#twitter	4
#digitalisierung	4
#afd	4
#sozialwissenschaften	4
#dgs18	4
#infdh2018	4
#gfm2018	4
#servicetweet	3
#gew	3
#türkei	3
#daimler	3
#berlin	3
#hambibleibt	3
#hambacherforst	3
#nachhaltigkeit	3
#openaccess	3
#migration	3
#darkphoenix	3
#facebookdown	3
#innsbrucktirol2018	3
#bigdata	3
#spd	3
#ff	3

7.3 Top Features

Without Hashtags and Twitter @Nicks

# {r results="asis"}
features <- dfm_select(joint_dfm, pattern = list('#*',"@*"),
                       selection = "remove",
                       min_nchar = 3)
topfeatures(features, 40) %>% 
  knitr::kable("html")

	x
panel	167
demokratie	142
thema	93
democracy	86
sektion	76
münster	73
diskussion	69
kongress	67
frankfurt	65
danke	64
vortrag	63
spannende	60
digitale	59
soziologie	59
frage	58
findet	55
grenzen	55
gesellschaft	55
political	54
spricht	54
prof	54
historiker	54
stand	53
2018	51
forschung	51
fragen	51
freuen	50
steinmeier	50
deutschen	49
politikwissenschaft	46
geht’s	45
arbeit	45
german	44
new	44
digitalen	43
historikertag	43
rede	42
geschichte	42
democratic	41
eigentlich	40

7.4 Top Features per Discipline

# {r results="asis"}
features <- dfm_select(joint_dfm, pattern = list('#*',"@*"),
                       selection = "remove",
                       min_nchar = 3)
topfeatures(features, 20, groups = "Discipline") %>% 
  map(knitr::kable, "html")

$PolSci

	x
demokratie	130
panel	105
democracy	84
frankfurt	60
political	51
grenzen	49
steinmeier	47
politikwissenschaft	45
democratic	41
kongress	38
rede	36
frage	35
representation	34
diskussion	34
thema	31
politik	29
podiumsdiskussion	29
bundespräsident	28
need	28
dvpw	27

$Sociology

	x
soziologie	55
vortrag	32
kongress	26
göttingen	23
panel	17
spannende	17
gesellschaft	15
gruppe	15
session	14
raum	14
plenum	14
sektion	13
danke	12
freuen	12
forschung	12
thema	12
diskussion	12
sozialen	12
vogel	12
woche	11

$History

	x
münster	67
sektion	56
historiker	49
historikertag	41
digitale	40
geschichte	36
panel	33
geschichtswissenschaft	30
stand	29
poster	29
thema	28
van	25
danke	22
forschung	21
geht’s	21
digitalen	21
history	21
gespaltene	21
freuen	20
findet	19

$CS

	x
thema	21
prof	16
zukunft	15
digitale	14
digitalisierung	13
informatik	13
panel	12
stefan	12
data	12
berlin	11
diskussion	11
ullrich	11
science	10
keynote	10
digital	9
amt	9
ranga	9
erklärt	8
arbeit	8
gesellschaft	8

$Mixed

	x
woche	3
findet	3
tweets	3
minuten	3
gibts	3
digitalen	3
frankfurt	2
kolleg	2
sehen	2
kongresswoche	2
good	2
laufen	2
gemeinsam	2
interessant	2
begriff	2
deutlich	2
stärker	2
derzeit	2
politikwissenschaftler	2
digitalisierung	2

There are still a couple features which would qualify for being removed (such as “unsere”, “mal”, “schon”), but we will take care of that with the super-useful quanteda::dfm_trim(<term_freq>) threshold-based feature selection.

Plus, removing or parsing hashtags and applying dfm(stem = TRUE) would largely increase descriptive accuracy

7.5 Network of feature co-occurrences

topfeats <- names(topfeatures(joint_dfm, 60))
textplot_network(dfm_select(joint_dfm, topfeats), min_freq = 0.8)

7.6 Grouped wordcloud of features

“wordcloud, where the feature labels are plotted with their sizes proportional to their numerical values in the dfm” (Vignette for quanteda::textplot_wordcloud)

textplot_wordcloud(joint_dfm_grouped,
                   comparison = TRUE,
                   min_size = 0.5,
                   max_size = 3,
                   max_words = 60)

8 What’s next?

Topic Modelling!

Now that I had to spend quite an amount of time on data processing, cleaning, and eventually building a usable dfm, the actual analysis of the contents remains a #TODO

However, I learned a lot doing this today, and I hope that it became obvious that dealing with text as data is not to be underestimated.

[R] German Academic Twitter, Pt. 2: From Data to Corpus with a Turkish Twist

Abstract / TL;DR

1 Import Tweets from `.rds`

1.1 Subset “Interdisciplinary” Tweets

1.2 Inspect “Interdisciplinary” Tweets

2 Create Corpus

2.1 Build Individual Corpora

2.2 Create Joint Corpus

3 Create DFM

3.1 Naive

3.2 2nd Attempt: Remove `stopwords("german")`

3.2.1 Inspect `quanteda's` built-in Stop Words

3.2.2 Include Custom Stopwords and Remove English Stopwords

3.2.3 Inspect “Keyword in Context” `kwic()` for “amp”

3.2.4 Inspect the Tokens `"innen"` and `"bu"` `kwic()`.

4 The Turkish Plot-Twist

4.1 One last Mystery remains…

4.2 Removing Turkish accounts

5 Rebuild the Corpus without Tweets with `lang=="tr"`

5.1 Re-Build individual Corpora from Scratch

5.2 Create Joint Corpus

6 3rd Attempt @ `dfm()` Building

7 Some quick Analyses

7.1 Top Hashtags

7.2 Top Hashtags per Discipline

7.3 Top Features

7.4 Top Features per Discipline

7.5 Network of feature co-occurrences

7.6 Grouped wordcloud of features

8 What’s next?

Abstract / TL;DR

1 Import Tweets from .rds

1.1 Subset “Interdisciplinary” Tweets

1.2 Inspect “Interdisciplinary” Tweets

2 Create Corpus

2.1 Build Individual Corpora

2.2 Create Joint Corpus

3 Create DFM

3.1 Naive

3.2 2nd Attempt: Remove stopwords("german")

3.2.1 Inspect quanteda's built-in Stop Words

3.2.2 Include Custom Stopwords and Remove English Stopwords

3.2.3 Inspect “Keyword in Context” kwic() for “amp”

3.2.4 Inspect the Tokens "innen" and "bu" kwic().

4 The Turkish Plot-Twist

4.1 One last Mystery remains…

4.2 Removing Turkish accounts

5 Rebuild the Corpus without Tweets with lang=="tr"

5.1 Re-Build individual Corpora from Scratch

5.2 Create Joint Corpus

6 3rd Attempt @ dfm() Building

7 Some quick Analyses

7.1 Top Hashtags

7.2 Top Hashtags per Discipline

7.3 Top Features

7.4 Top Features per Discipline

7.5 Network of feature co-occurrences

7.6 Grouped wordcloud of features

8 What’s next?

1 Import Tweets from `.rds`

3.2 2nd Attempt: Remove `stopwords("german")`

3.2.1 Inspect `quanteda's` built-in Stop Words

3.2.3 Inspect “Keyword in Context” `kwic()` for “amp”

3.2.4 Inspect the Tokens `"innen"` and `"bu"` `kwic()`.

5 Rebuild the Corpus without Tweets with `lang=="tr"`

6 3rd Attempt @ `dfm()` Building