This blog post is a version of a working paper by Stijn Peeters1, Marc Tuters1, Tom Willaert2, Paul Van Eecke2, Jeroen Van Soest2, Katrien Beuls2 1OILab, University of Amsterdam 2Artificial Intelligence Lab, Vrije Universiteit Brussel
Over the last decade, a diverse and increasingly influential far-right online media sphere has emerged. This has raised concerns that parts of this sphere may function as incubators for radicalization. In our research, we found that this indeed seems to be the case. Racist or otherwise extreme vernacular of the vitriolic “politically incorrect” sub-forum of 4chan was later used in the comment section of Breitbart News, a right-wing news website that was found in earlier research to have played a significant ‘agenda-setting’ role in the 2016 U.S. presidential elections. This suggests that this extreme discourse is indeed becoming more mainstream. In this blog post, we briefly present our initial findings and the underlying method. This research is work-in-progress!
Content warning: Since this text concerns the spread of extreme terms, it displays instances of racist, homophobic, anti-Semitic, and other forms of hate speech.
Our research interests can be summarized in the following questions:
- Can we find evidence of 4chan speech that later became part of the language on Breitbart?
- How to characterize the words used on 4chan as compared to Breitbart?
- What kind of speech is later adopted on Breitbart, and is it cause for concern?
We collect two datasets from two separate platforms – 4chan’s ‘politically incorrect’ /pol/ forum and comments on Breitbart News – during 21 months centered around the 2016 U.S. elections. After filtering, the monthly occurrence of individual words was compared between both datasets. We find that there are far more terms that appear only in the language of 4chan/pol/ than in the language of Breitbart comments. Additionally, of the terms that over time are prominent first in one dataset and later in both, those that first appear on 4chan are often highly political, and can furthermore be characterised as racist, homophobic, or otherwise extreme. These extreme terms are then later observed on Breitbart. Though a direct causal relationship is difficult to ascertain, this suggests that 4chan, an active but non-mainstream niche site, has an outsized impact that reaches beyond its own confines. As such, studies of the spread of extreme discourse and misinformation should consider the platform as a potential originator or incubator, where such discourse appears before it occurs in more mainstream discussion spaces.
Context & implications
This research adds to a growing body of work (e.g. Beran 2019) that questions the view of 4chan as a ‘fringe’ platform. Despite its ‘fringeness’, the platform’s discourse seems to resonate beyond its own borders. Since much of the discourse on 4chan/pol/ can be characterized as conspiratorial, racist or otherwise extreme (see also Tuters & Hagen 2019), its later occurrence on other platforms is of great interest to those studying the dissemination of extremism and misinformation. Indeed a key implication of our work is that 4chan /pol/ might give an early impression of problematic discourse that may become used by a wider audience later on. As such, continued observation of the language disseminated through these fringe platforms – for which we offer one methodological blueprint – can benefit journalists, researchers and policy makers seeking to signal the emergence of new extreme discourses such as those that have more recently gained prominence in the 2020 U.S. election campaign.
More fundamentally, our findings speak to the much-debated relationship between the ‘bottom’ of the Internet – consisting of niche, often politically extreme sites – and mainstream sites. The observation and study of this ‘bottom’ has acquired urgency as ideas and vernacular originating in these parts have been implicated in a number of far-right terrorist attacks in the United States, Canada, and New Zealand. Furthermore, sites like 4chan serve as incubators for various successful conspiracy theories – e.g. Pizzagate (Tuters et al. 2018) and QAnon (Hagen et al. 2019).
There is therefore legitimate concern that the subculture associated with these sites could insinuate an extreme and conspiratorial discourse into the American political debate, and indications of this are apparent in the 2020 election campaign. QAnon, particularly, has become influential (see Stanley-Becker 2020), and the polarised language we study reflects the polarised nature of the recent American political discourse. Our findings suggest that four years earlier, in 2016, this political vocabulary associated with extreme right-wing politics consistently appears on 4chan first, and then on the more mainstream Breitbart News later (see Figure 1).
The metaphor of the Internet as having a ‘bottom’ implies the existence of further ‘layers’. At the top we find big media conglomerates, often rooted in ‘legacy media’. Here one would find the major newspapers such as the New York Times, cable broadcasters such as CNN, and newer online-first outlets like Vox. As one moves ‘down’, platforms grow more obscure, with a smaller reach and less clear editorial or content policies, if any. At the bottom one finds ‘fringe’ sites, obscurely subcultural; this ‘deep vernacular Web’ (De Zeeuw & Tuters 2020) can appear culturally baffling as well as offensive to the uninitiated. Sites in this stratum usually have a relatively small number of visitors compared to mainstream sites. 4chan is particularly relevant here, as a fringe platform that has nevertheless been scrutinised for its production of internet memes (Bernstein et al. 2011), unique subcultural practices (Nissenbaum and Shifman 2017) as well as language innovation (Tuters & Hagen 2019), despite its relatively small though not inconsequential audience.
Our findings are based on two datasets from a period centered around the 2016 U.S. elections, one from 4chan/pol/ and one from the comment section of Breitbart News, a conservative, right-wing American news website especially popular during this period, as a staunch supporter of eventual winner Donald Trump. Although it has been described as “factually dubious” (Guess et al. 2018), the site occupied a crucial place in the political media ecosystem at the time. Benkler et al. (2018) offered an in-depth study of Breitbart’s ‘agenda-setting’ role in that election, which was however limited to its editorial content. We instead study the comment sections of Breitbart’s articles that routinely receive thousands of comments, many only tangentially related to the article’s subject. These appear to be moderated loosely, if at all: a 2017 report cites Disqus, which provides the technology on which Breitbart’s comment section runs, promising that Breitbart “[wants] to work with us to figure out ways to minimize [hate speech]” (Captain, 2017); notably, this issue arose after the period from which we collected data. Appearing to function like a largely unmoderated discussion forum, the comment threads can thus serve to study the political views and discourse of a highly active element of far-right politics that has moved increasingly to the center of the American Republican party.
Our findings indicate that antagonistic, problematic words first observed on 4chan later entered the discourse in the comment section of Breitbart News, a much more mainstream platform with important connections to the current US presidential administration, as its editor-in-chief later joined Trump as his principal political strategist (Green 2017). While earlier research has investigated crossing over of particular ideas (e.g. the aforementioned conspiracy theories), our study provides descriptive evidence that suggests that this crossing-over also occurs on a language level, and is not only bound to specific theories or ideas. This further supports previous observations about the sustained connection between 4chan/pol/ and Breitbart’s comment section.
One possible explanation for this is that some 4chan posters also frequent Breitbart’s comment section. It would not be surprising if they used the language they were familiar with, and this could explain it occurring in both spaces. Tracing whether actors move between these platforms is difficult because 4chan is designed as an anonymous platform (Knuttila 2011). 4chan posters are notoriously derisive of ‘mainstream media’ and typically dismiss Breitbart as well as inadequately extreme. Although Breitbart has been described as having an “extreme right-wing bias” (Mediabiasfactcheck 2020), it is seen as a place for ‘normies’. In the vernacular, ‘normies’ are those not up to date with vernacular online subcultures and/or that adhere to common social norms (De Zeeuw et al. 2020). Nevertheless, it is possible that some 4chan posters may also frequent Breitbart News, which would be one explanation for the appearance of 4chan-like vernacular there. This would then be a way for not only this vernacular, but also the extreme political positions it implicitly and explicitly refers to, to spread to a new audience.
Our data is visualized in Figure 2. The datasets are split up in terms – each word, after filtering out hyperlinks and punctuation, is a term. For each term we can then determine in which of the datasets it occurs on a per-month basis. This allows for the following observations:
1. The language of 4chan/pol/ contains more unique terms than that of Breitbart’s comments.
Our analysis shows that there are more terms unique to the /pol/ dataset than there are terms unique to the Breitbart dataset. Of all 66,859 terms, 25,511 (38.2%) were classified as occurring in the 4chan/pol/ dataset only while 3,712 (5.6%) were classified as occurring only in the Breitbart dataset. 4chan has been described as a source of subcultural and linguistic innovation (Nissenbaum & Shifman 2017). This finding empirically confirms this, at least concerning the unique use of language on the forum. As English-language datasets, both are concerned with political discussion focused primarily on the United States context. Thus, while some variation may be expected, in principle the language used would be similar between both, but this is only partially the case.
4chan’s vernacular has been referred to as ‘chanspeak’: “peculiar in-group misspellings” characterized by “shortening, simplifying and cutting down words” (Fiorentini 2013; Herring 2012). While this is perhaps true for the broader 4chan vernacular, the /pol/ slang we found is not adequately captured by this description. This can be attributed to the rapid linguistic innovation on this forum (Tuters & Hagen 2019). The terms we find are more adequately described as ‘phrasal memes’, highly self-referential ‘remixes’ of words – e.g. ‘cuckerberg’ (a combination of ‘cuck’ and ‘Zuckerberg’). While a proper linguistic analysis of this vernacular is outside the scope of this article, the dataset on offer here could in the case of /pol/ serve as a starting point for such a study.
2. A substantial amount of terms are first only observed in the language on 4chan but later also on Breitbart
Terms that occur in one dataset first and later in another dataset or both datasets, can be observed in both ‘directions’; some occur first in the Breitbart dataset while others occur first on 4chan/pol/. In total, 1,803 terms (2.7%) follow such a pattern. Of these, 702 (38.9%) occur on 4chan first, while 1,101 (61.1%) occur on Breitbart first. This seems counterintuitive; it would imply that terms are first anchored in the language of Breitbart and only later in that of 4chan, which is difficult to reconcile with 4chan’s reputation as a more innovative linguistic space as established in Finding 1. On the other hand, Breitbart is a far larger and arguably more influential platform, and agenda-setting power may be attributed to it in that capacity. From this perspective, the fact that a substantial amount of terms occur on 4chan first at all is significant.
A closer look at these terms reveals that they can be divided into two broad categories – ‘named entities’ and ‘neologisms’. Named entities refer to all terms that are proper names; for example countries and people. The other category, ‘neologisms’ are words that are neither common English nor otherwise ‘normal’ discourse. In practice, these terms are mostly various slurs and memetic phrases that are associated with 4chan discourse.
The named entities cannot reasonably be assumed to originate on either platform. Instead, the likely explanation for the occurence of these is that they refer to people, places or organizations that were discussed because they were relevant to a current event or news item. This indicates that e.g. Breitbart discussed these topics before 4chan did so, which is interesting insofar as it provides insight into the type of topics discussed by both forums and how rapidly they enter the discourse. The ‘neologisms’ on the other hand are likely to originate in the vernacular of online platforms; and indeed some (including the examples in Figure 1) can be associated with discourse known to have originated or been popularized on 4chan (see Pappasavva et al. 2020). As such, the fact that they appear on 4chan and later on Breitbart suggests that there is some sort of connection between the two, either directly or via another intermediary platform.
3. Many terms that appear first on 4chan/pol/ and later also on Breitbart reflect an extreme far-right politics.
Of these terms that can be assumed to originate on 4chan/pol/, most are implicitly or explicitly related to far-right and conspiratorial theories or ideas. This is not surprising, since 4chan/pol/ itself has been described as a ‘kind of petri dish for concocting extreme and extremely virulent forms of right-wing populist antagonism’ (Tuters & Hagen 2019). Of the words that appear first on 4chan (see also Figure 1) several are emblematic of an extreme political discourse, such as ‘cuckerberg’ (a jab at Facebook owner Mark Zuckerberg combined with the sexual slur ‘cuckold’; another variation found was ‘cucktard’); ‘blacking’ (a racist term referring to sexual relations between black men and white women; other racist terms we found were ‘muzzies’, ‘germanistan’ and ‘britainistan’); words like ‘xhe’ (used mockingly to insult transgender people) or ‘phaggot’ (a homophobic slur); and various slurs aimed at liberal US voters like ‘shitlib’ and ‘berniebot’.
While 4chan/pol/ is well-known as a far-right discussion space (Hine et al, 2017; Ludemann, 2018), our data and analyses show that this discourse is not contained to this ‘fringe’ platform but after initial usage there also appears on more mainstream platforms. More specifically, the various racist or otherwise extreme slurs and ‘phrasal memes’ that are developed and incubated on 4chan/pol/ in some cases see uptake in the comments on Breitbart News. As most of this language is unambiguous, and hard to mistake for anything else than derogatory, this raises concerns that not only the language but also the extreme political discourse associated with it is shared across sites.
Research questions and hypothesis
Our goal is to answer the questions of (1) whether we can empirically identify terms that are first prominent in the language on 4chan/pol/ and later also in the language of the Breitbart comment sections, (2) how to characterize the language used on 4chan/pol/ compared to that of Breitbart’s comment sections, and (3) whether the nature of these identified terms is cause for concern. In this context, we depart from the hypothesis that we can observe this pattern for terms associated with far-right thought, and that this constitutes a mainstreaming of ‘extreme’ discourse. In order to evaluate this hypothesis and answer the research questions, we combined quantitative and qualitative methods.
Concerning the first question, we investigated a corpus of posts and comments from both 4chan and Breitbart, using basic Natural Language Processing (see e.g. Willaert et al. 2020 more generally on the NLP-based processing of Web datasets) to extract relevant language and study its occurrence patterns.
The dataset under scrutiny comprises a corpus of comments from Breitbart news articles (28,684,909 comments from 2,779 articles), and a corpus of posts from 4chan/pol/ (71,732,937 comments in 40,933 threads). Both corpora comprise posts from July 1, 2015 onwards, up to and including March 31, 2017, i.e. the period between the announcement of Donald Trump’s candidacy for the US presidential elections (June 2015) and his first months in office.
The 4chan data was sourced from 4plebs.org, a third-party 4chan archive that claims to “archive all threads and images from specific imageboards on 4chan” since 2013. This data is collected continuously (i.e. as it is posted on 4chan) and data ‘dumps’ are published regularly on The Internet Archive, containing all posts made on a number of 4chan’s boards, including /pol/. 4plebs’ data has been used as a ‘complete’ data set in other research on 4chan (see e.g. Tuters & Hagen 2019; Voué et al. 2020). Notably, posts are included even if they are later deleted from the site (all posts eventually disappear from 4chan, as threads are deleted after a period of inactivity). This data was loaded into 4CAT, a forum analysis toolkit (Peeters & Hagen 2018) to filter for posts relevant to the analysis period.
Breitbart data was collected between 2 and 9 September 2019 using a custom scraper written in Python which first crawled breitbart.com for internal links to create an index of all articles posted on the site, and then collected all comments for all articles posted between July 1, 2015 and March 31, 2017, using the Disqus API. Comments outside of this date range were ignored. The resulting dataset reflects the state of the comment section as it was at the moment of scraping. There is a possibility that some comments were removed between the moment of posting and the moment of scraping, up to 4 years later; however, as mentioned earlier, Breitbart’s moderation policy seems to have been lax during the period we study, and it is unlikely that later policies were enacted retroactively. We therefore assume that the data is a reasonably accurate reflection of what the comment threads would have looked like closer to the date the comments were posted.
Analysis of over-time language occurrence patterns
The scraped comments and posts were first cleaned by removing punctuation, URLs, HTML tags (in Breitbart comments), and comment referral numbers (in /pol/ comments). These texts were then tokenized (split into individual words), and split into separate token sets per month. For both corpora, for each month terms that occurred fewer than 200 times were removed. Next, to reduce the influence of terms that occurred very little relatively to the full data for a month and were therefore of limited indicative value, terms were removed per month if they had a relative frequency of less than 0.00001% of the total amount of terms in that month; in other words, if their absolute frequency was less than 0.00001% of the total amount of words for that month.
After this final filtering by the terms’ relative frequency, we then classified each term into one of four classification bins, indicating whether for a given month the term (1) occurred neither on /pol/ nor on Breitbart, (2) occurred on both /pol/ and Breitbart, (3) occurred exclusively on /pol/, or (4) occurred exclusively on Breitbart. For each term, this resulted in a sequence of classification bins, which can be visualized using color coding (see Figure 1 & 2). For a more detailed discussion and evaluation of our filtering and categorisation method, please refer to the Methodological Appendix.
The initial quantitative approach yielded a subset of terms that warranted further scrutiny; we were particularly interested in those terms that were first observed as prominent on4chan/pol/ that were later also observed on Breitbart. Our approach here, concerning the second and third research questions, was to first remove any obvious named entities (people, countries, institutions) from the list, as well as common English language. The remaining tokens could then be analyzed in more detail via a closer reading, in which the context and occurrence of the token on 4chan/pol/ as well as on other platforms is studied via manual querying of 4plebs (a searchable archive of /pol/) and 4CAT (a modular web platform scraping tool). Here we retained words with a clear political (sub)text, similar to those shown in Figure 1.
Benkler, Y., Faris, R., & Roberts, H. (2018). Network propaganda: Manipulation, disinformation, and radicalization in American politics. Oxford University Press.
Beran, D. (2018) It Came From Something Awful: How a Toxic Troll Army Accidentally Memed Donald Trump Into Office. New York: All Points Books.
Bernstein, M. S., Monroy-Hernández, A., Harry, D., André, P., Panovich, K., & Vargas, G. (2011, July 5). 4chan and /b/: An Analysis of Anonymity and Ephemerality in a Large Online Community. Fifth International AAAI Conference on Weblogs and Social Media. Fifth International AAAI Conference on Weblogs and Social Media. https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2873
Captain, S. (2017). Disqus Grapples With Hosting Toxic Comments On Breitbart And Extreme-Right Sites. Fast Company. https://www.fastcompany.com/3068698/disqus-grapples-with-hosting-toxic-comments-on-breitbart-and-extreme-right-sites
De Zeeuw, D., & Tuters, M. (2020). The Internet Is Serious Business: On the Deep Vernacular Web Imaginary. Cultural Politics, 16(2).
Fiorentini, I. (2013). ‘ZOMG! Dis Is a New Language’: The Case of Lolspeak. Newcastle Working Papers in Linguistics, 13(1), 90–108.
Green, J 2017. Devil’s Bargain: Steve Bannon, Donald Trump, and the Storming of the Presidency. New York: Penguin Books.
Guess, A. M., Nyhan, B., & Reifler, J. (2018). Selective Exposure to Misinformation: Evidence from the consumption of fake news during the 2016 U.S. presidential campaign [Working Paper]. http://www.ask-force.org/web/Fundamentalists/Guess-Selective-Exposure-to-Misinformation-Evidence-Presidential-Campaign-2018.pdf
Zeeuw, D. de, Hagen S., Peeters, S., Jokubauskaitė, E. (2020). Tracing Normiefication: A Cross-Platform Analysis of the QAnon Conspiracy Theory. First Monday. https://firstmonday.org/ojs/index.php/fm/article/view/10643/9998.
Herring, S. C. (2012, April 3). Special internet language varieties: Culture, creativity, & language change. LETiSS-Languages go web: Standard and nonstandard languages on the Internet, Pavia.
Hine, G. E., Onaolapo, J., De Cristofaro, E., Kourtellis, N., Leontiadis, I., Samaras, R., Stringhini, G., & Blackburn, J. (2016). Kek, Cucks, and God Emperor Trump: A Measurement Study of 4chan’s Politically Incorrect Forum and its Effects on the Web. ArXiv Preprint ArXiv:1610.03452.
Jokubauskaitė, E., & Peeters, S. (2020). Generally Curious: Thematically Distinct Datasets of General Threads on 4chan/pol/. Proceedings of the Fourteenth International AAAI Conference on Web and Social Media. https://www.aaai.org/ojs/index.php/ICWSM/article/view/7351.
Knuttila, L. (2011). User unknown: 4chan, anonymity and contingency. First Monday, 16(10). https://doi.org/10.5210/fm.v16i10.3665
Ludemann, D. (2018). /pol/emics: Ambiguity, scales, and digital discourse on 4chan. Discourse, Context & Media, 24, 92–98. https://doi.org/10.1016/j.dcm.2018.01.010
Mediabiasfactcheck contributors. (May 23 2020), “Breitbart”. In Mediabiasfactcheck. https://mediabiasfactcheck.com/breitbart/
Nissenbaum, A., & Shifman, L. (2017). Internet Memes as Contested Cultural Capital: The Case of 4chan’s /b/ Board. New Media & Society, 19(4), 483–501. https://doi.org/10.1177/1461444815609313
Papasavva, A., Zannettou, S., De Cristofaro, E., Stringhini, G., & Blackburn, J. (2020). Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 885-894.
Peeters, S., & Hagen, S. (2018). 4CAT Capture and Analysis Toolkit. Computer Software, v1.0. https://github.com/digitalmethodsinitiative/4cat.
Stanley-Becker, I. (2020). How the Trump campaign came to court QAnon, the online conspiracy movement identified by the FBI as a violent threat. Washington Post, 2 August 2020.
Tuters, M., Jokubauskaitė, E., & Bach, D. (2018). Post-truth protest: how 4chan cooked up the Pizzagate Bullshit. M/c Journal, 21(3).
Tuters, M., & Hagen, S. (2019). (((They))) rule: Memetic antagonism and nebulous othering on 4chan. New Media & Society, 1461444819888746. https://doi.org/10.1177/1461444819888746
Venturini, T. & Latour, B. (2009). The Social Fabric: Digital footprints and quali-quantitative methods. Proceedings of future en seine, 87-103.
Voué, P., De Smedt, T., & De Pauw, G. (2020). 4chan & 8chan embeddings. arXiv preprint arXiv:2005.06946. https://arxiv.org/abs/2005.06946
Willaert, T., Van Eecke, P., Beuls, K. and Steels, L (2020). Building social media observatories for monitoring online opinion dynamics. Social Media + Society, https://doi.org/10.1177/2056305119898778
Pipeline for generating occurrence sequences of terms based on relative frequencies
We generated occurrence sequences for words on 4chan/pol/ and Breitbart according to a seven-step process:
First, we cleaned the posts or comments in each dataset by performing case-folding, and removing html tags, URLs and punctuation. In the case of the 4chan posts, we also removed references to other threads (e.g. ‘>>280207128’).
Secondly, each comment or post was split into individual words (tokens).
Thirdly, we counted the number of occurrences of each word for each dataset for each month.
Fourthly, from the outputs of step three, we discarded the words that had an overall frequency of less than 200 in a dataset, as this concerned mostly noise such as typos.
In a fifth step, we calculated the relative monthly frequency of each word for both datasets, i.e. the number of times a word occurs divided by the total amount of words in the dataset, for a given month. For words that had a relative frequency of less than 0.00001%, we consider relative frequency to be 0.
Next, for each month, for each word, we compare the relative frequency of the word in both datasets. We thus discern whether for a given month the word occurs exclusively on 4chan pol, exclusively on Breitbart, both on Breitbart and /pol/, or on neither platform. This yields a sequence of monthly classifications for each word.
Finally, these granular sequences were classified according to their overall patterns (see Fig 1 and Fig 2 below). Our analysis focused specifically on those words that occur first on the /pol/dataset and later on Breitbart (including those terms that later occur on both (platforms). The patterns were classified according to the descriptions in Appendix Table 1 and Appendix Table 2.
Both thresholds (200 occurrences, or 0.00001% relative frequency) were decided inductively, to eliminate a baseline of nonsensical or meaningless words.
|O||‘Overlap’; Breitbart + 4chan/pol/|
Appendix Table 1: abbreviations used to describe word occurrence sequences.
|B_cst||Word occurrence sequence starts with Breitbart, followed by any number of occurrences in Breitbart or None.|
|O_cst||Word occurrence sequence starts with Overlap, followed by any number of occurrences in Overlap or None.|
|P_cst||Word occurrence sequence starts with 4chan/pol/, followed by any number of occurrences in 4chan/Pol/ or None.|
|N_B||The word occurrence sequence’s first ‘switch’ consists of a switch from None to Breitbart, followed by any number of occurrences in Breitbart or None.|
|N_O||The word occurrence sequence’s first ‘switch’ consists of a switch from None to Overlap, followed by any number of occurrences in Overlap or None.|
|N_P||The word occurrence sequence’s first ‘switch’ consists of a switch from None to 4chan/Pol/, followed by any number of occurrences in 4chan/Pol/ or None.|
|B_O||The word occurrence sequence’s first switch is from Breitbart to Overlap.|
|P_O||The word occurrence sequence’s first switch is from 4chan/Pol/ to Overlap.|
|P_B||The word occurrence sequence’s first switch is from 4chan/Pol/ to Breitbart.|
|B_O_P||The word occurrence sequence’s first switch is from Breitbart to Overlap, the last switch is from Overlap to 4chan/Pol/|
|P_O_B||The word occurrence sequence’s first switch is from 4chan/Pol/to Overlap, the last switch is from Overlap to Breitbart|
|other||The word occurrence sequence does not match any of the patterns described above. In practice, this mostly concerns words that are predominantly and initially in the Overlap category, but occur in another category for a limited number of months.|
Appendix Table 2: Descriptions of word occurrence sequence patterns used in classification.
Corroboration of results by comparing with absolute frequencies
In order to corroborate our results, we repeated the procedure described above, but used only the absolute monthly frequencies of terms in the final classification step, i.e. skipping step 5. We then compared the distributions of the non-constant patterns for both approaches (see Appendix Figure 1 and Appendix Figure 2 below). From this comparison, it follows that the retrieved patterns follow a similar distribution. A more-in depth assessment of the specific examples discussed in the article confirmed that these were classified in the same categories for both approaches.
Additionally, there were relatively few unexpected words (e.g. nonsense or stopwords). As such, in this particular case the additional threshold of a minimum relative frequency seems to have had relatively limited impact on the results, and a higher threshold was not required to acquire a useful dataset. However, in larger or otherwise different datasets it remains a useful tweakable safeguard against ‘noise’ via low-frequency terms.