San, Toeknee. "Recursive linking in the archive" http://metatalk.metafilter.com/20364/ February 17, 2011 6:22 PM   Subscribe

How big is the entire Metafilter folder / database, if it were an archive quality file? Is there an archive, or plans for long term preservation, or is Archive.org the plan? Yes, I know that it's hard to archive a living thing, but I'm wondering what might be happening to insure preservation.

Totally goofy followup, inspired by the solar flares. Is there a print version of any of it? Should there be?
posted by Toekneesan to MetaFilter-Related at 6:22 PM (42 comments total) 2 users marked this as a favorite

Damn you, screamingnotlaughing. Now my title is broken.
posted by Toekneesan at 6:25 PM on February 17, 2011


We have redundant backups both onsite and offsite. Solar flares be damned.
posted by mathowie (staff) at 6:27 PM on February 17, 2011 [5 favorites]


Don't anger the sun god, mathowie!
posted by jessamyn (staff) at 6:29 PM on February 17, 2011 [11 favorites]


We should send a nuclear warhead into the Sun. It's been showing off for far too long!
posted by mathowie (staff) at 6:36 PM on February 17, 2011 [7 favorites]


Yeah, the "it not exploding into non-existence" plan is regular and redundant backups. As far as a print version goes, the corpus work I've been doing based on 1999-2010 comment activity comes out to about 457 million words, so, well, let's just say that'd put your parents' Britannica set to shame in terms of shelf space is such a thing existed.
posted by cortex (staff) at 6:37 PM on February 17, 2011 [1 favorite]


My Britannica brings all the nerds to the yard....
posted by jessamyn (staff) at 6:38 PM on February 17, 2011 [24 favorites]


See that solar flare? Mathowie don't give a shit.
posted by Brandon Blatcher at 6:40 PM on February 17, 2011 [7 favorites]


I just checked a few random days in web.archive.org (Way Back Machine) to make sure it's archiving for fun... and found mathowie tacitly endorsing high schoolers getting drunk on high quality beer: thread 203 (tagline it archived: "it's your web, log it").
posted by skynxnex at 6:40 PM on February 17, 2011


Jesus, it's a mod convention.
posted by Brandon Blatcher at 6:41 PM on February 17, 2011


Well when cortex is one of the mods, it's a Jesus convention.
posted by hincandenza at 6:50 PM on February 17, 2011


As of May 09 the database was 12GB including indices so I would imagine it's probably at least around 15 GB or so now.
posted by Rhomboid at 6:53 PM on February 17, 2011


I don't think the question is about backups. It's about archives, which is totally different. Like, 100 years from now will someone be able to look at what Life On The Early Internet was like by viewing MetaFilter threads?

Backups are something the owner can restore. Archives are something anyone can view even after the owner stops existing.
posted by DU at 6:54 PM on February 17, 2011


Hello future visitors! Sorry about the lack of an ozone layer, we thought the scientists were kidding!
posted by Brandon Blatcher at 6:58 PM on February 17, 2011 [1 favorite]


Well, if Twitter gets a place in Library of Congress...

I mean, you know they won't store 4chan, because that's barely legal as it is. And the huge gap on the shelf will offer plenty of space for MeFi.

Actually, now I'm curious. What's the rate the internet "expands" at, in terms of publicly-accessible data on the web, vs. the rate that non-volatile storage lowers in price per bit? Like, I presume it gets more and more expensive to archive the internet each year, but could that change so that at some strange distant era, it's cheaper to store the internet each year?

Granted, I know this is hard to predict, and there are probably spikes as new specifications emerge that enable new media. For example, flash video playback and broadband penetration made it possible for Youtube to exist. I bet if I were to track 1995-1999 expansion, it'd look a lot different than 2000-2005 expansion of the internet.
posted by mccarty.tim at 7:02 PM on February 17, 2011


I wonder this about a lot of websites. They just vanish, *poof* and we're lucky if there's a google cache. It seems wrong somehow, and Internet Archive doesn't/can't capture it all.

Either way, these are questions I'd love to see a digital archivist take a stab at. If only there were a place on the net where tons of MLIS students/grads hung out...
posted by lesli212 at 7:08 PM on February 17, 2011


the corpus work I've been doing based on 1999-2010 comment activity comes out to about 457 million words --- Is that from all of us, or just you?
posted by crunchland at 7:16 PM on February 17, 2011 [3 favorites]


So I did a few quick calculations. For reference, I work for a fifty-year old publisher that has published about 2,000 books-or about half a million pages. 500 words per page is a useful average so in 50 years of book publishing, we've output a bit less than half of what this site has done in a little over a decade. As for what that would look like, the shelves to hold that amount of books would take the entire wall space of two medium sized living rooms.
posted by Toekneesan at 7:29 PM on February 17, 2011 [1 favorite]


Solar flares be damned.

I have taken the precaution of slathering my monitor with SPF 100 Sunblock. Everything's a little blurry but at least I won't get screen burn!
posted by amyms at 7:54 PM on February 17, 2011


I've secretly spent the last 8 years of my life carefully etching MetaFilter into stone tablets in Hittite Cuneiform complete with an intentional Rosetta stone as a key and translation guide, but then I hit thread 9622 and that infamous mushroom thread.

Do you have any idea how hard it is to etch an animated GIF of a pissing elephant into a stone tablet?
posted by loquacious at 7:59 PM on February 17, 2011 [4 favorites]


It sounds like this could be a job for Archive Team. They tackled the archiving of GeoCities before it went down, whose corpus was both a lot larger (640+ GB) and a lot less organized than Mefi is.
posted by Rhaomi at 8:06 PM on February 17, 2011 [1 favorite]


mccarty.tim: " I mean, you know they won't store 4chan, because that's barely legal as it is. And the huge gap on the shelf will offer plenty of space for MeFi."

4chan doesn't even store 4chan.
posted by zarq at 8:15 PM on February 17, 2011


4chanarchive does back up popular/historic threads though, and is searchable. And it's for all the boards, not just /b/, so it's actually useful on occasion.
posted by Rhaomi at 8:17 PM on February 17, 2011 [2 favorites]


Wait, is no one else printing out every update of each new day??
With each new membership the five dollars is used to print out and mail one page of comments, that one page fits together with the page of the user sequentially after you (in this story they don't trust the already signed up users, who mostly just want to keep the lawns watered these days. Mostly), then the user must go out to the world and find the subsequent user, without using any words, only pictures, or gestures, and glue your pages together, each, continuing this process like a katamari thing, until there is a mass of humans and printouts and glue.
-on preview, whew, loquacious has it covered, besides, tablets last longer than paper and glue.

If someone started to use something like the Memento Project to create a Memento-filter; a place without the boundaries of time... well, that would be incredimazing... but also, many down line things likely never got archived, and the "simple" architechture of Memento seems to be mostly reliant on Archive. It would still be a good tool for finding backups of where many links went to (like getting the CNN page, as it was at the time a particular post was made).

Still would be cool to implement it... I have zero knowledge on how the uri/reference thing works in reality... all I know is that I installed a firefox plugin, and can choose a date or time, and access a site as it was at that point in time (or the closest available backup, and it is really neat to me. Also it allowed me to update like 50 dead links in about 5 minutes. [did anyone else try out installing the plugin?])

The coolest part of memento is the way many sites are able to have layered support for it... like, I go to a site as it was in June 06, and then click a link that existed there, rather than going nowhere, it parses the link, and somehow brings me to the downstream page, as it was in June 06 (or nearest backup point).
posted by infinite intimation at 8:20 PM on February 17, 2011


Is that from all of us, or just you?

Oh, I've only typed about 1.7 million words here myself. I know it feels like more sometimes.
posted by cortex (staff) at 8:27 PM on February 17, 2011


There was a small community that was lobbying to get funding for a PET-Scanner, and the advocates of this purchase were on the radio explaining the importance of wide-access to this type of technology, hearing the intro-bumper, describing a "small-town defending their need for a PET-Scanner, and how vital it would be to their health".

For a few awkward seconds I could not decouple why they would possibly want to put people in their pet scanners or how... Then I parsed it. I'm still unsure if I felt good, or bad.
posted by infinite intimation at 8:34 PM on February 17, 2011


Oh, I've only typed about 1.7 million words here myself. I know it feels like more sometimes.

Dare I ask? I dare. Any chance it's a trivial string to run to get my total word count?
posted by loquacious at 8:57 PM on February 17, 2011


894,974 words, loq.
posted by cortex (staff) at 9:13 PM on February 17, 2011


Remember that you can export your own comments. You have the power!
posted by grouse at 10:01 PM on February 17, 2011


894,974 words, loq.

Damn, I figured I'd be a millionaire by now.

*successfully resists copypasting the digits of pi. again.*
posted by loquacious at 10:35 PM on February 17, 2011


This is deathless prose?
posted by Cranberry at 12:22 AM on February 18, 2011


a million monkeys...
posted by quonsar II: smock fishpants and the temple of foon at 4:18 AM on February 18, 2011


... let's just say that'd put your parents' Britannica set to shame in terms of shelf space if such a thing existed.

I have *got* to lose some weight.
posted by Meta Filter at 5:30 AM on February 18, 2011


*successfully resists copypasting the digits of pi. again.*

That'd probably be counted as 1 word, wouldn't it? Or maybe 2 because of the decimal point.
posted by FishBike at 5:33 AM on February 18, 2011


As of right now, I have it all committed to memory.
posted by Sailormom at 6:18 AM on February 18, 2011


894,974 words, loq

Does that filter out URLs and hypertext coding?

If not, the number's probably overinflated, no?
posted by zarq at 6:18 AM on February 18, 2011


I'm at work and we just got a new toner cartridge so I'm going to print MetaFilter right now. Nobody post anything for a few minutes, ok? I'll let you know when it's done.
posted by dirtdirt at 6:39 AM on February 18, 2011


DU makes a good point distinguishing between backups and archives. Long term preservation might be complicated a little bit by copyright issues. Because there has been no explicit copyright policy except the "All posts are © their original authors" in the footer, even 100 years from now most of the comments and posts will probably still be under copyright. So my totally non-lawyer guess would be that "officially" re-publishing the site elsewhere would involve some sort of risk of griefer legal action.

Unofficially, with a little planning or a little luck MeFi will live forever in the lawless libertine heart of the internet where it belongs.
posted by XMLicious at 6:47 AM on February 18, 2011


I wonder this about a lot of websites. They just vanish, *poof* and we're lucky if there's a google cache. It seems wrong somehow, and Internet Archive doesn't/can't capture it all.

Yeah, every now and then at conferences I hear museum and archive people getting concerned about how so much on the 'net just goes poof. Not only is it sometimes not archived, even if it's backed up or mirrored online somewhere, it's not indexed like a true archive (though it's at least more searchable than a non-digitized archive). Finding aids, and so on. Seems like a worthy concern - at least for sites with worthy content.

Personally it seems to me that MeFi will be an unusually rich and valuable document for future social historians. There's just tons and tons and tons of data here about the experience of daily life, ideas in debate, tastes and fashions, trends and issues. For many major events it has the potential to provide a more cohesive, chronoligical and contiguous narrative of unfolding news events than you will find anywhere else in a single source.
posted by Miko at 7:05 AM on February 18, 2011 [2 favorites]


Does that filter out URLs and hypertext coding?

It does. For the work I've been doing, I'm stripping markup entirely to get at the bare text of comments. However, it does not attempt to remove quoted text, so there's some amount of inflation going on. On the other hand, it does not include post text from any subsite, or comment text from Projects or IRL, so it's under the mark in that respect.

As far as long-term archiving, it's something I certainly feel like should happen on principle, yeah; if we get to a point where the future viability of mefi as a going concern is actually in question, one of my priorities would be to get the public-facing content of the site (i.e. comment and post text to go along with what's already in the infodump) exported from the DB into something more universally readable. That'd probably just mean a large collection of structured plaintext flat files.

Fortunately we're not in a position right now where anything other than a coordinated nuclear strike could suddenly make that impossible, and I will be honest with you: at that point I think we may have bigger problems.
posted by cortex (staff) at 7:36 AM on February 18, 2011 [1 favorite]


Fortunately we're not in a position right now where anything other than a coordinated nuclear strike could suddenly make that impossible, and I will be honest with you: at that point I think we may have bigger problems.
posted by cortex


FWIW, I feel the opposite. I've often thought about what would happen should humanity choose to reset civilization via catastrophic nuclear war. If I was ever on a scrappy team of scienctific minds trying to figure out ways to mitigate the damage, the preservation of information would be of tantamount importance to me. I guess I'd like to think that we could not go the way of Alexandria or Carthage (if my history is accurate) and preserve the accumulated knowledge of humankind so that our underground-dwelling, mutant-flipper survivors could get a jump start on breaking back out of the Middle Ages.

Should a coordinated nuclear strike occur, saving MetaFilter for the future of humanity would be very important indeed. To me.
posted by lazaruslong at 12:03 PM on February 19, 2011


After the coordinated nuclear strike I am most certainly gonna need to know how to hide a body or two.
posted by Mick at 12:04 PM on February 19, 2011 [2 favorites]


cortex: " It does. For the work I've been doing, I'm stripping markup entirely to get at the bare text of comments. However, it does not attempt to remove quoted text, so there's some amount of inflation going on. On the other hand, it does not include post text from any subsite, or comment text from Projects or IRL, so it's under the mark in that respect."

Interesting! Thanks for explaining.
posted by zarq at 12:05 PM on February 22, 2011


« Older help with managing my blue/green addiction^H^H^H...   |   Spies from the future Newer »

You are not logged in, either login or create an account to post comments