The Metafilter Infodump is back
, and moderately better than ever! Statistics nerds rejoice!
For those who don't know what I'm talking about
posted by cortex to MetaFilter-Related at 1:55 PM (502 comments total)
44 users marked this as a favorite
, the Infodump is a collection of files generated from the Metafilter database, containing a wealth of vital stats about posts, comments, tags, favorites, and so on.
Brief history: after being launched in January, 2008
, the Infodump spent a happy year being nerdly and occasionally updated before we took
(along with some other things on the stuff.mf subdomain) as a safety measure after the site got hacked
in January of this year.
It is not a full-text dump of the site—that'd be gigantic and is not something we're necessarily comfortable with regardless—but it does contain most of the quantitative information available about activity on the site, enough for folks to have done a variety
For those of you who are already familiar with it
, note that there are a few neat additions (and some important format-change caveats if you have existing automated tasks you've been importing older versions of these files into):
There are four new "Post Titles" files, listing thread id and title text for Mefi, Ask, Meta and Music.
There is now a tag data file for Metatalk, since we've added tags since the Infodump originally launched last year.
The comment data files now include two new columns: (1) favorite count, to save folks from having to cross-reference against the fave data file for a simple count check, and (2) best answer boolean so that folks who want to look into Best Answer-related stuff or create ad hoc search tools for same can now do so. Note that this BA column is present in all comment data files for format consistency but has no meaningful content except in the askme data.
The post data files also have a new column, for category code. There's meaningful data for Ask, Meta, and Music; Mefi has no category data, but includes a dummy column just for, again, format consistency.
There is now an all-in-one zip file available for folks who intend to download most/all of the files regardless, to reduce the amount of file/transfer wrangling required.
I've done a bunch of updating and expanding of the wiki's Infodump page, to note current format stuff. Big, big thanks to Pronoiac for doing a fantastic job of creating that page in the first place.
Invisible but nice: the process for generating the Infodump is now significantly streamlined, which makes regenerating it by hand very easy, but beyond that pb is planning to set it up as an automated task so it'll likely just run weekly in the middle of the night without any human intervention at all. And then: Skynet.
The Infodump page is now automatically updated when stats are regenerated, and includes a timestamp and up-to-date file-size info for the downloadable zips. The page is also less ugly than it used to be.
So! Get your nerd on. If folks have specific ideas/requests for additions, let me know; there're some things we just aren't going to do (full text dumps, flag data), but there may be other things worth adding going forward. I'm considering doing some sort of word-frequency and collocation tables, for example, so that folks could dig into some of that in the absence of an entire corpus.
If you've done work with the Infodump (or related mefi datawankery) that isn't currently noted on the MetaAnalysis wiki page, please mention it here or in mefimail or just add it to the wiki yourself if you're into that sort of thing.