An overview of 4chan/b/ archives: What is left of the Internet’s cesspool?

This blog post outlines data archives of 4chan/b/. Once a major cultural force, much of the /b/ “Random” board’s history is now lost. While some of /b/’s tumultuous past might be best forgotten, knowing what primary data exists is not only relevant for historical purposes, but also for understanding the current-day Web. I therefore outline five significant /b/ archives, after which I discuss several other noteworthy ones. I found the archival scene of /b/ to be highly volatile, with some available text data from 2006-2008, scraps of HTML captures from 2003 and 2009-2012, and more consistent archiving since 2014.

On historical research into 4chan/b/

Once a cultural nexus, now a porn-ridden afterthought, 4chan/b/ was at the forefront of online culture in the 2000s. By 2021, the imageboard is now mostly known for its far-right /pol/ “Politically Incorrect” board, rife with white supremacist activity and conspiracy theories. Before this, however, /b/ “Random” was 4chan’s centre of gravity, a space said to have helped birth “meme culture” as we now know it (boyd, 2017) and described as “the most influential cultural force most people didn’t realize they were actually quite familiar with” (Phillips, 2018, p. 15).

However, because 4chan is ephemeral – the creation of a new comment thread always goes at the expense of another one – much of /b/’s history is now lost. Interestingly, it is this ephemerality that helped make /b/ a significant cultural force to begin with. As 4chan’s founder Christopher Poole noted, the website’s lack of infrastructural memory made it a powerful “meme-factory” since it encouraged constant reposting, and with it, the repetition of memorable images and phrases that could “swim upstream through the waterfall of content” (2015). Internet memes that “survived” 4chan/b/’s self-deletion often diffused and “normiefied” (De Zeeuw et al., 2020) into well-known pieces of Web folklore, like LOLCats and Rickrolling. To preserve this selection mechanism, as well as the anything-goes, “publicly private” (Lange, 2007) sense of community, Poole fervently opposed permanent archiving (2015). In other words, a lack of archived data shaped 4chan’s culture, infused it with relevance, but also impeded historical research into the space.

Transience was thus tied to /b/’s very culture. In this text I nonetheless ask: what is left of /b/’s past? To the best of my knowledge, only one scholarly project engaged in systematic archival of 4chan/b/ (Bernstein et al., 2011). Beyond this, great secondary sources are available, like ethnographic accounts by Coleman (2014) and Phillips (2015). However, academic texts dealing directly with archiving /b/ are absent entirely. Considering the type of content that appeared on /b/, this absence is not surprising. Because its anonymity and ephemerality, /b/ was rife with gore, suicide encouragements, “creepshots”, and, incidentally, child pornography. This naturally discouraged archival of /b/, since storing its data formed a legal liability, making archivers complicit in keeping alive content perhaps best forgotten. Several /b/-archivers therefore chose to withhold archived content or stop archiving altogether (discussed below). The Internet Archive’s Wayback Machine, one of the most important tools for conducting Web histories, disallows archiving 4chan pages entirely.

While any research into /b/’s past should take these ethical concerns to heart, digging up its past data can in some cases still be worthwhile. Not only did 4chan/b/’s disproportionate cultural influence in the early 21st Century imbue it with historical significance, revisiting the space can also illuminate a great deal about the present-day Web. For instance, it is still foggy what role /b/ played in the supposed “reactionary turn” of chan culture in the mid-2010s (Tuters and Hagen, 2020, p. 2223) and how it could form a “subcultural stock” for far-right extremists (Phillips, 2018, p. 37). Such historical questions can only partially be answered through secondary sources, especially because a current-day perspective can shine a radically different light on primary data. As Phillips and Milner (2021) argue, early “Internet culture”, of which much gravitated around /b/, was often “fetishized” while staying blind to its dark sides. For retrospective revisits, however, the first step is knowing what primary data is left of /b/.

Historically significant 4chan/b/ archives

Generally speaking, the earlier, the rarer /b/ archives become. This can be attributed to the aforementioned appearance of highly problematic content that required active moderation and a policy for takedown requests. Beyond this, the high volume of /b/ posts also drove up the (at the time very high) server costs. That said, some consistent early /b/-archives did exist, notably chanarchive.org and fgts.js. However, their content is not available anymore. In dramatic fashion, failure to remove problematic content caused fgts.js to close, with a hard drive crash subsequently wiping all its data, while chanarchive’s administrator mysteriously went missing after which it also suffered irrevocable hardware problems. Hence, early /b/ archives mostly consist of scraps of data from personal collections. After 2014, /b/-archival became more consistent, largely because of the lowering of server costs and the adaptation of the imageboard archival framework Fuuka, of which popular current-day archives (like archive.moe, desuarchive, 4plebs, and Fireden) all use improved versions of (either Asagi or FoolFuuka).

nametypetimeframethreadspostsretrieved from
Penfifteen archiveHTML and imagesFirst half of 2004~76https://archive.org/details/studionyami-com_penfifteen-2012-03-05
Jason Scott archive (4Chan Threads Text Archive Ten Billion 2009)Text in HTML and a corrupt MySQL dump2006-2008~5 millionhttps://archive.org/details/4chan_threads_archive_10_billion
4GBs of 4chan Threads (2009-2012)HTML and images2009 – 2012~180https://archive.org/details/4chan-threads-2009-2012
4archiveText in SQLite and images uploaded to imgur14 Jan 2014 – 5 May 201521,910~10 millionhttps://github.com/4archive/4archive
archived.moeCSV25 Oct 2016 – 9 Aug 2019~3.8 million~111 millionhttps://archive.org/details/archivedmoe_db_201908
Table 1: Overview of the historically significant /b/ archives.

Penfifteen archive

TimeframeFirst half of 2004, parts of 2009.
Type of dataHTML captures.
Amount76 threads from 2004, a few from 2009.
Retrieved fromhttps://archive.org/details/studionyami-com_penfifteen-2012-03-05
Table 2: Characteristics of the Penfifteen 4chan/b/ archive.

The earliest collection of /b/-posts I could find dates back to only a few months after 4chan’s launch in October 2003. The Penfifteen archive includes just a handful of threads, but what is there does provide an interesting window into /b/’s early days. The overall discourse of the posts feels genuinely “random”, with topics ranging from anime, cars, and yard sales to weapons, computers, and furries. Although some extreme content is present, the discussion also seems more light-hearted compared to activity from later archives. Interestingly, in 2004, Internet memes were already seen as central to 4chan’s culture, with one thread asking for popular “4chan catchphrases”, answered with both long-forgotten phrases (e.g. “THRUST VECTORING OWNS THE SKY”) and variations of LOLSpeak (e.g. “o rly”) that later exploded in popularity through commercial sites like cheezburger.com. While politics does not seem to be at the forefront of the conversation, some political content can be found, like anti-war sentiments and threads with (seemingly) ironic Hitler jokes.

  • 4chan/b/ catchphrases in the Penfifteen archive
    Figure 1: Screenshots from the Penfifteen archive, an early collection of /b/ threads.

Jason Scott archive: “4Chan Threads Text Archive Ten Billion 2009”

TimeframeEarly April 2006 – Late November 2008
Type of dataText in HTML and a SQL dump (FRM, MYI, MYD, and OPT files).
Amount~5 million extractable threads.
Retrieved fromhttps://archive.org/details/4chan_threads_archive_10_billion
Table 3: Characteristics of the Jason Scott 4chan/b/ archive.

The first large-scale data dump comes from Jason Scott, a well-known archiver and founder of Archive Team. The archive’s origin is not specified, but since one of the folders reads “chanarchive” and the archive’s description on archive.org lists it was “contributed [by] an anonymous donor”, it could be a copy of chanarchive.org’s database, the main /b/ archiver at the time. Scott got hold of the data in 2009 but decided not to release it, stating: “After a good amount of discussion with various mentors and friends, I’ve determined the negative results would outweigh the positive.” He eventually published the data in 2018, revealing that it includes text data (so, sensibly, no images) from most of 4chan’s boards at the time, among which /b/.

Unfortunately, I encountered multiple technical issues with the upload. In his original post on the archive, Scott notes: “I have [the archive] in XML, HTML and MYSQL formats, all of which show different parts of the data.” While separate formats would just be a practical inconvenience, all of them also have different technical problems. The XML files do not seem to be uploaded at all on the archive.org page. The MySQL files (db.opt, posts.myd, threads.myi, and threads.myd) seem corrupt; for example, trying to recover the database with MySQL’s REPAIR TABLE command results in the error “Found wrong packed record” after a few million rows. The compression for the HTML data also seems to have gone wrong; on Windows, unpacking the archive using 7zip results in an error, while doing so through the command prompt results in a “damaged tar file” error after 5 million files.

However, I managed to extract ~5 million posts form the HTML files. Unfortunately, these lack metadata – the only information available is the post’s text and a field on whether it included an image. While better than nothing, a simple timestamp would have been very useful; it is currently unclear when a post was made exactly. When I contacted Scott to ask if other formats were available, he responded: “What is there is what is there”.

I converted the posts results in a 8GB .csv file (available on request). It seems the extractable threads all date from early April to December 2008. While the timestamps are absent, the dates can roughly be interpolated from other data. The names of the HTML files correspond to the original post numbers on /b/. It starts at number 6574643. Since /b/’s six millionth post was made on 23 March 2006, the seven millionth on 20 April 2006, and an early thread links to a news article from 8 April 2006, the archive likely starts in early April 2006. The last post number is 100868695, just after the 100M GET on 30 November 2008.

There are some notable gaps in the data. As Figure 2 below shows, the largest gaps between encountered thread IDs are between IDs 34756073-34916035, 76027146-76396029, 86369406-86704472, and 86369406-86704472, along with a few others (see [1] for a complete list).

Figure 2: Gaps in the extracted threads from the Jason Scott archive (Y axis shows the amount of threads encountered with a specific ID range).

This timeframe includes several key events from /b/’s past, including infamous trolling campaigns like Battletoads prank calls and the Habbo Hotel raids in early July 2006 (see also Phillips, 2015). A cursory exploration quickly indicates many 4chan users (or: “anons”) at the time frowned upon discussing politics, but simultaneously also eagerly posted racist, misogynist, and anti-Semitic content. As the post on the bottom of Figure 3 also makes clear, this supports the argument that around this time, blatant racism could proliferate under the guise of “ironic” and “edgy” Internet culture (Phillips and Milner, 2020). In sum, the Jason Scott archive’s technical problems and lack of metadata are a shame, but it forms a uniquely large reference point of 4chan/b/ in its heyday – and thereby a valuable source for conducting Internet histories.

Screenshots from the 4chan/b/ Jason Scott archive
Figure 3: Examples of HTML files from Jason Scott’s archive, showing various forms of posting including extremely offensive and reactionary ones. Sensibly, the archive does not include images.

2009-2012 archive: “4GBs of 4chan Threads (2009-2012)”

Timeframe2009-2012.
Type of dataHTML captures.
Amount~180 /b/ threads.
Retrieved fromhttps://archive.org/details/4chan-threads-2009-2012
Table 4: Characteristics of the 2009-2012 4chan/b/ archive.

The next noteworthy /b/-archive stems from 2009, uploaded by the founder of Bibliotheca Anonoma, a wiki on imageboard histories. While its name – “4GBs of 4chan Threads (2009-2012)” – implies a large repository, it only contains the HTML files from around ~180 4chan threads, most from /b/. Like the Penfifteen archive, what is there does provide an interesting glimpse into the history of 4chan. Some “can’t be unseen” gore posts appear, as well as vile forms of trolling, like tricking anons into wounding themselves or harassing organisers of an anti-bullying initiative. The posts also contain vernacular practices and memes that have since entered the historical lexicon of chan culture, like Girugamesh, Boxxy, Epic Fail Guy (seemingly the most popular /b/-meme around 2011), Rage comics, “demotivational posters”, various image macros, and an overuse of Caps Lock.

Apart from historically distinct objects and practices, there are also clear parallels to reactionary and far-right activity that 4chan would come to be known for in the late-2010s. As also seen in Figure 4, these parallels among others include the following:

  • While most content is apolitical, the posts seem more activist or vitriolic than the Penfifteen ones. Several posts express activism surrounding SOPA/PIPA, bills seen as infringing on free speech online, and Anonymous, the loose hacker collective that rose to prominence during the same time as the archive’s data (Coleman 2014). Beyond this, there are some instances of posts with a libertarian bent, like a picture of someone trolling Obama into posing next to the sentence “Vote Ron Paul”. With the current-day /pol/ also featuring a libertarian contingent, the anons who transitioned from /b/ to /pol/ perhaps did not have to take a huge ideological leap.
  • One thread opens with an announcement of a mall shooting. Such terrorist threats would later also semi-frequently appear on /r9k/ and /pol/, but the archive reminds that this was a longstanding practice on /b/ long before this.
  • Misogynist content is frequent. In different threads, anons lament how girls refuse to have sex with them and criticise their “emotional” reasoning. Several threads are dedicated to listing the irrationality of “women logic” or hatred towards popular girls. This enmity, often materialising as misogyny, would on 4chan later crystallise with the “incels” of /r9k/.
  • A few posts express disdain towards Reddit – a platform antagonism still very common on the current-day 4chan and /pol/.
  • The Pepe the Frog and Feels Guy memes made appearances on /b/ in 2012, long before they became mascots for /pol/ in the latter part of the 2010s.

While this archive’s data is thus limited, the saved threads do provide an interesting glimpse into the state of 4chan/b/ during an era described by Phillips (2015) as the “Golden Age of Trolling”

  • 4chan/b/ politics in the Penfifteen archive
    Figure 4: Screenshots of /b/ posts from the 2009-2012 archive.

4archive

TimeframeJanuary 2014 – May 2015
Type of dataSQLite database
Amount9.754.504 posts.
Retrieved fromhttps://github.com/4archive/4archive
Table 4: Characteristics of the 4archive 4chan/b/ archive.

The most complete /b/-dataset after the Jason Scott upload is a data dump from 4archive. This website let users submit specific threads, which would then be archived. It started in January 2014 but was shut down in May 2015 because the owner stated they “felt a bit of a disgust with what was being archived” and “didn’t feel good about hosting archived content that actually upset people”. It was later reinstated and is still operational, hosted by the same administrator as archived.moe (discussed below).

Although the available data dump is a full database copy (in SQLite), because 4archive only stored user-submitted threads, it is very partial. That said, it still contains almost 10 million /b/ posts, being “user-sampled” on relevance. Its completeness cannot be verified since the total amount of posts on /b/ from that timeframe is unknown. However, it is roughly 10% of the amount of monthly posts from the “complete” archived.moe dump a year later. Despite its incompleteness, the 4archive dump is quite valuable because little other archives cover the same timeframe with this amount of data. The post’s images are not included but were uploaded to imgur.com and can still be retrieved, if really necessary. The threads can also be opened on 4archive.org, where it mimics their original appearance.

Figure 5: Screenshot of 4archive via the Wayback Machine.

Digging through the 4archive dataset is beyond the scope of this text, but in my own research I have found it of great value. It features data from a moment (2014-2015) on the cusp of many politicising events in Western politics and online culture, including Gamergate, Brexit, and Trump’s presidency. With respect to 4chan specifically, the archive moreover marks the time when /b/ started being overtaken by /pol/ as the most active and infamous board.

4ch.be / archived.moe

Timeframe24 October 2015 to 9 August 2019
Type of dataCSV
Amount~111m posts
Retrieved fromhttps://archive.org/details/archivedmoe_db_201908
Table 5: Characteristics of the 4ch/be and archived.moe 4chan/b/ archives.

As server costs dropped significantly in the 2010s, complete archives of /b/ became feasible. The FoolFuka archive 4ch.be had been documenting the entirety of /b/ from 24 October 2015 onwards, but ceased operations on 27 February 2016 (like 4archive, stopping only a few months after its start). The administrator of a new archiver, archived.moe, then imported 4ch.be database snapshots and started capturing /b/ on 28 May 2016, continuing to this day. Since there is a gap between the stop of 4ch.be and the start of archived.moe, data is missing for March and April 2016. The most recent archive.moe dump I could find was made in August 2019. Archived.moe’s administrator is also behind other /b/ archives, including the current 4archive, RandomArchive.com, and TheBArchive.com.

Figure 6: Volume of posts for 4archive, 4ch.be, and archive.moe.

Considering 4chan/b/ lost much of its cultural significance in the latter part of the 2010s, what is gained in completeness here is lost in historical relevance. That said, large datasets like the archive.moe dumps can be useful for data analysis, for instance by tracing language propagation from politically radical spaces like /pol/ to less ideologically coherent ones like /b/, or vice versa.

Other noteworthy archives

4chanarchive / chanarchive

Starting in 2007, 4chanarchive was the primary archive of 4chan/b/, allowing users to submit threads to store. In March 2012 it was shut down “due to server performance and copyright issues”, but the owners of Encyclopedia Dramatica, a wiki dedicated to trolling and chan culture, reinvigorated it under the new name chanarchive. However, not long after, the site’s new owner disappeared without a trace – along with the server access keys. As a result, the website and its 500GBs of 4chan data went offline. A data dump is thus not available. However, while actual 4chan/b/ captures are not allowed on the Wayback Machine, some intermittent captures of 4chanarchive and chanarchive are available – a rare case of an archive archiving an archive.

Fgts.jp

Fgts.jp was an asagi archive housing many different boards, including /b/. Interestingly, it only permitted access to its /b/ data after authentication because of the child pornography it could store (Figure 7). While it apparently was a complete archive, the owners experienced a hard drive crash in April 2015, causing all the data to be lost and the owner deciding to shut down the website.

  • fgts.jp
    Figure 7 - 1: Internet Archive screenshot of the /b/ landing page.

Fybertech archive

This archive claims to have “one of the largest collections of 2008-2014 4chan threads, outside of the fallen Chanarchive and the modern 4archive”. According to Archiveteam, it “archives any thread referenced in its IRC channel, apparently”. However, when I explored it, the available data dump contains just six /b/ threads, along with over a thousand threads from other boards like /co/, /vg/, and /v/.

4chan-threads-2015-10

This somewhat mysterious archive is uploaded by Bibliotheca Anonoma but has no description. In my case, the 20GBs of data took almost two weeks to download. The data is compressed as a .tar.gz file. When extracting, I ran into several errors, suggesting the file is corrupted. (On Windows, extracting the file through 7zip results in an “unexpected end of data” error, while using bash commands returns the message “Damaged .tar archive”). I managed to extract two threads before encountering problems, which reveal the data are 4chan page captures (the HTML files including the CSS, JavaScript, and media objects) from October 2015. Much of the archive is thus seemingly inaccessible. Its text data is moreover likely present in the 4ch.be archive, which covers the same timeframe..

MIT archive

For a 2011 paper on 4chan/b/ (Bernstein et al., 2011), MIT researchers collected 5,576,096 posts in 482,559 threads dating from 19 July to 2 August 2010. Unfortunately, when I contacted the authors in November 2020, those who responded did not have access to the data anymore.

Reddit r/4chan

With users accruing screenshots of 4chan threads since 2008, the subreddit r/4chan effectively functions as a valuable archive of “best of” 4chan content, including many posts from /b/. The Reddit discussion on these screenshots are moreover a useful secondary source because they provide meta-commentary on the historical state of 4chan.

Conclusion

In sum, the archiving scene of the chaotic /b/ proved to be a Wild West in itself. Attempts to establish permanence to the transient /b/ proved ephemeral, complicated by ethical quandaries, hardware crashes, missing administrators, data gaps, and technical issues. However, with archives from 2006-2008, 2014 onwards, and earlier scraps from smaller archives, reconstructing parts of 4chan/b/’s history through primary data is possible, especially in combination with secondary sources.

To repeat, in many cases, it may be best to refrain from resurfacing 4chan/b/’s forgotten content. As I witnessed myself, browsing through archives also means encountering depictions of gore and highly offensive material. Nonetheless, the /b/ archives can be valuable in studying the Web of the past – and what elements have remained.

Acknowledgements

For this text, the 4chan archival page by Archive Team was a major help, which provides further and more extensive information on archives for other 4chan boards. Other consulted resources include the Wayback Machine and Bibliotheca Anonoma, a wiki on imageboard history whose maintainers are responsible for many of the archives discussed above.

Notes

[1]

fromto
6574643100868695
2470850024859880
2500335525186294
3475607334916035
3533096635927498
4119446441378074
4907845049194981
6267430563001428
7056551770763555
7573176476000346
7602714676396029
8636940686704472
8843062288587748
8888354589032138
99831314100868695
Table 6: Gaps of at least 100.000 integers between post IDs in the Jason Scott data.

Literature

Michael Bernstein, Andrés Monroy-Hernández, Drew Harry, Paul André, Katrina Panovich, and Greg Vargas, 2011. “4chan and /b/: An Analysis of Anonymity and Ephemerality in a Large Online Community.” Proceedings of the International AAAI Conference on Web and Social Media, volume 5, number 1 at https://ojs.aaai.org/index.php/ICWSM/article/view/14134.

danah boyd, 2017. “Hacking the Attention Economy”. Data & Society: Points (5 January), at https://points.datasociety.net/hacking-the-attention-economy-9fa1daca7a37, accessed 20 May 2021.

Niels Brügger, Gerard Goggin, Ian Milligan, and Valérie Shafer, 2019. Internet Histories. New York: Routledge.

Gabriella Coleman, 2014. Hacker, Hoaxer, Whistleblower, Spy: The Many Faces of Anonymous. London: Verso Books.

Brett J. Fujioka and Julia R. DeCook, 2021. “Digital cynical romanticism: Japan’s 2channel and the precursors to online extremist cultures.” Internet Histories, at https://doi.org/10.1080/24701475.2021.1919966.

Patricia G. Lange. “Publicly Private and Privately Public: Social Networking on YouTube.” Journal of Computer-Mediated Communication, volume 13, no. 1, pp. 361–80.

Whitney Phillips, 2015. This Is Why We Can’t Have Nice Things: Mapping the Relationship between Online Trolling and Mainstream Culture. Cambridge, Mass.: MIT Press.

Whitney Phillips, 2018. “The Oxygen of Amplification: Better Practices for Reporting on Extremists, Antagonists, and Manipulators Online.” Data & Society (22 May), at https://datasociety.net/wp-content/uploads/2018/05/1_PART_1_Oxygen_of_Amplification_DS.pdf, accessed 20 May 2020.

Whitney Phillips, Gabriella Coleman, and Jessica Beyer, 2017. “Trolling Scholars Debunk the Idea That the Alt-Right’s Shitposters Have Magic Powers.” Vice Motherboard (22 March), at https://www.vice.com/en/article/pgw88g/trolling-scholars-debunk-the-idea-that-the-alt-rights-shitposters-have-magic-powers-motherboard, accessed 20 May 2021.

Whitney Phillips and Ryan Milner, 2021. You Are Here: A Field Guide for Navigating Polluted Polarized Speech, Conspiracy Theories, and Our Polluted Media Landscape. Cambridge, Mass: MIT Press.

Christopher Poole, 2015. “moot’s final 4chan Q&A.” YouTube (23 January), at https://www.youtube.com/watch?v=XYUKJBZuUig, accessed 20 May 2021.

Marc Tuters and Sal Hagen, 2020. “(((They))) Rule: Memetic Antagonism and Nebulous Othering on 4chan.” New Media & Society, volume. 22, number 12, pp. 2218–2237.

Daniël Zeeuw, Sal Hagen, Stijn Peeters, and Emilija Jokubauskaitė, 2020. “Tracing Normiefication: A Cross-Platform Analysis of the QAnon Conspiracy Theory.” First Monday, volume 25, number 12 (2 November), at https://dx.doi.org/10.5210/fm.v25i11.10643.

Leave a Reply

Your email address will not be published. Required fields are marked *