[publication] Generally Curious: Thematically Distinct Datasets of General Threads on 4chan/pol/

Emilija Jokubauskaite and Stijn Peeters presented a dataset paper at the (virtual) proceedings of the AAAI ICWSM 2020 conference. They extracted different “general threads” from 4chan’s /pol/ board as a way to identify various issue publics from the space.

Read the paper here.


Over the second half of the 2010s, the /pol/ (‘politically incorrect’) forum on the 4chan imageboard has emerged as a space within which various extreme political ideologies are discussed and cultivated, occasionally informing off-site acts of political extremism. While previous research has often studied this space as a unified whole, it is relevant to more specifically demarcate different publics within 4chan’s /pol/ board, apart from studying it as an ‘amorphous blob’. This paper focuses specifically on ‘generals’ – recurring threads with a specific thematic focus identified by a particular vernacular phrase or tag. By identifying them it is possible to partition the board’s archive into multiple distinct datasets comprising discussions about a particular topic, such as Donald Trump, the Syria war, or British politics. We provide a dataset containing 58,841 opening posts and 13,697,738 replies to those, divided over 329 thematically distinct general thread collections. In this paper we outline our data collection and query protocol, the structure of the data and its rationale, as well as a number of suggested research uses for this new data.

