The perfect gift for the computational linguist in your life:
Metafilter Frequency Tables! Finally you can know definitively how many times words such as "metafilter", "fucknozzle", or "mctootypoots" have been used on the site. (A: 127,484 times, 28 times, and once.)
If you're unfamiliar with frequency tables, the idea is pretty simple: take a pile of text, parse it into individual words, and count up how often each of those words occurs and write it out into a big table. Frequency tables are a nice way to examine language usage in a specific context, and to compare language usage between different contexts (through differences in time or differences in place, for example).
These tables look at the comments of various subsites (Mefi, Askme, Metatalk, and Music, just like the Infodump data) across various slices of time. So you can look at the biggest of big pictures by examining the file in the
Complete section for all subsites, which contains frequency data for over 457 million words of comment text over eleven and a half years; you can also examine just the shift in vocabulary on the blue throughout the second week of September, 2001 by looking at the 2001 mefi files from the
Daily section.
If you want to play with this but are intimidated by the nerdery, the easiest way to poke around is to grab one of the Complete files and just use Find in it with your favorite text editor. For anyone with a unix/linux/osx box, you can use the grep command to produce some filtered files really easily, by doing the following in the directory where you've downloaded a file:
grep butts allsites--1999-01-01--2011-01-01.txt | less
Which will give you a list of only those words that contain the string "butts" somewhere inside, fed to you a page at a time.
If you just want to look through a few fun pre-computed lists produced that way, you're in luck:
metafilter,
mefi,
filter,
tater,
-ism,
-ist,
fuck,
shit,
GRAR, and
hurf and/or durf.
I've been working on this off and on for the last few weeks and feel that it's in pretty good shape; that said, any feedback on utility or presentation or elaborations of this is totally welcome. I've documented the format for the files and the basics of my methodology on a new
wiki page for the project.
In general I'd like to increase the contents of this general Metafilter Corpus project beyond just these frequency tables; we have a lot of numbers available already via the
Infodump, but there's very little actual
language data available there, and these tables are a first step toward examining the actual words that make this place what it is. I intend to explore doing ngram stuff (two- or three- or five-word sequences with frequency information), but I'd be interested in other specific ideas have as well.
posted by cortex to MetaFilter-Related at 8:51 AM (59 comments total)
34 users marked this as a favorite
posted by box at 8:56 AM on January 4, 2011