Introducing the Metafilter Frequency Tables January 4, 2011 8:51 AM   Subscribe

The perfect gift for the computational linguist in your life: Metafilter Frequency Tables! Finally you can know definitively how many times words such as "metafilter", "fucknozzle", or "mctootypoots" have been used on the site. (A: 127,484 times, 28 times, and once.)

If you're unfamiliar with frequency tables, the idea is pretty simple: take a pile of text, parse it into individual words, and count up how often each of those words occurs and write it out into a big table. Frequency tables are a nice way to examine language usage in a specific context, and to compare language usage between different contexts (through differences in time or differences in place, for example).

These tables look at the comments of various subsites (Mefi, Askme, Metatalk, and Music, just like the Infodump data) across various slices of time. So you can look at the biggest of big pictures by examining the file in the Complete section for all subsites, which contains frequency data for over 457 million words of comment text over eleven and a half years; you can also examine just the shift in vocabulary on the blue throughout the second week of September, 2001 by looking at the 2001 mefi files from the Daily section.

If you want to play with this but are intimidated by the nerdery, the easiest way to poke around is to grab one of the Complete files and just use Find in it with your favorite text editor. For anyone with a unix/linux/osx box, you can use the grep command to produce some filtered files really easily, by doing the following in the directory where you've downloaded a file:
grep butts allsites--1999-01-01--2011-01-01.txt | less
Which will give you a list of only those words that contain the string "butts" somewhere inside, fed to you a page at a time.

If you just want to look through a few fun pre-computed lists produced that way, you're in luck: metafilter, mefi, filter, tater, -ism, -ist, fuck, shit, GRAR, and hurf and/or durf.

I've been working on this off and on for the last few weeks and feel that it's in pretty good shape; that said, any feedback on utility or presentation or elaborations of this is totally welcome. I've documented the format for the files and the basics of my methodology on a new wiki page for the project.

In general I'd like to increase the contents of this general Metafilter Corpus project beyond just these frequency tables; we have a lot of numbers available already via the Infodump, but there's very little actual language data available there, and these tables are a first step toward examining the actual words that make this place what it is. I intend to explore doing ngram stuff (two- or three- or five-word sequences with frequency information), but I'd be interested in other specific ideas have as well.
posted by cortex (staff) to MetaFilter-Related at 8:51 AM (56 comments total) 35 users marked this as a favorite

Sounds like an interesting project, mctootypoots.
posted by box at 8:56 AM on January 4, 2011 [2 favorites]


Sounds interesting. Are you using any third party concordance or word form software other than the grep command?
posted by lampshade at 9:02 AM on January 4, 2011


You said "grep butts" LOL.
posted by rtha at 9:02 AM on January 4, 2011 [10 favorites]


The word list reads like some kind of Beckett-inspired free-form verse:

us lot take back
find many
may same
did never
can't
new years off

its anything day
every always
isn't read
used
made

and, my favorite:

government information
white books
local comments
head needs video

posted by googly at 9:05 AM on January 4, 2011 [7 favorites]


Are you using any third party concordance or word form software other than the grep command?

Not as yet, no. I'm really still a total newbie with this stuff; all my work is filtering with perl against the db, and in the mean time I've been just reading, reading, reading lately.
posted by cortex (staff) at 9:05 AM on January 4, 2011


Great stuff! Got insomnia again?
posted by Melismata at 9:06 AM on January 4, 2011


You said "grep butts" LOL.

He likes to grep butts and he cannot lie.
posted by Horace Rumpole at 9:06 AM on January 4, 2011 [1 favorite]


"You need a hobby" I said
"Minecraft is a hobby" he said
"Maybe a non-gaming hobby" I said
"Is computational linguistics a game?" he said
posted by jessamyn (staff) at 9:17 AM on January 4, 2011 [52 favorites]


Maybe cortex needs some more MeTa flamewars to fill up his spare time.

Don't you think so, Burhanistan, you fucking asshole?
posted by shakespeherian at 9:37 AM on January 4, 2011 [3 favorites]


It seems to me that you could get some interesting results by comparing this to other standard frequency tables such as these.

I glanced at the most frequent nouns in written english. They are 1. time 2. year 3. people...
The 2010 list for metafilter not only has them in the order people, time, year but the word "people" appears three times as much as in their corpus of written english. You could say this is because metafilter comments are more informal, more like spoken english.

In their spoken english corpus, "people" is indeed the the most common noun but the PPM count of 2063 is still well below the Mefi 2010 PPM count of 3625. So, it is something else. I expect the popularity of specific expressions or the fact that this site is more focused on rhetoric than communication to be the likely reason.
posted by vacapinta at 9:38 AM on January 4, 2011 [2 favorites]


I was going to ask about how feasible it'd be to provide this per-user but then I came up with two solid reasons not to do it. It'd make authorship attribution pretty simple, which would A) unmask sockpuppets & B) unmask anonymous askme's. So, long story short - never mind!
posted by scalefree at 9:43 AM on January 4, 2011


Is this with the intention of creating MetaFilter spam bots?
posted by iotic at 9:48 AM on January 4, 2011 [1 favorite]


I'm looking forward to the next podcast when Matt and Jess inevitably tease Cortex about this latest bit of nerdery.

Though that in no way means I am not in awe of this nerdery.
posted by bondcliff at 9:48 AM on January 4, 2011


It seems to me that you could get some interesting results by comparing this to other standard frequency tables such as these.

Absolutely, yeah. One thing that I've already briefly hacked about with is trying to extract mefi-specific jargon by doing a comparison against the BNC frequency table, though that does reveal one major difficulty: the BNC source text is going on twenty years old, which means a lot of stuff under-represented or just plain absent in its files isn't so much mefi jargon as it is modern/techy/webby jargon. We talk about DVDs and iphones and https here a fair amount, but that's more a product of the general domain of technology and web discussion than anything to do with metafilter (whereas "mefi" or "threadshitting" or "fishpants" are more legitimately local terms).

So an alternative baseline corpus of more modern texts would be useful for some sorts of comparison. That said, the general frequency stuff captured by the BNC would be really useful for profiling mefi activity, as you say. A lot of exciting possibilities there.

Inter-subsite comparisons would be really interesting too: how do the blue, the green, and the grey differ?
posted by cortex (staff) at 9:53 AM on January 4, 2011


Oooh! I can see it now - ContentBot!

1. Score the word list based on a frequency reduction over time to determine an interest that has faded from Mefi's frontal cortex (as in the brain... not that cortex).
2. Google search the key words.
3. Build a post out of website context and expansion of the substring terms that were originally keyed on.
4. ????
5. Proffit!


6. Watch ContentBot promptly get banned...

Well.. I'll download the table now, but when my current work project is up in February - I'm seeing some nice SAS work out of this.

Or more interesting... I could just make CalloutBot which syntactically mis-identifies offensive terminology and hangs out on MetaTalk...

Or PonyBot....

Or RelationshipBot - which pre-emptively special snowflakes a topic which has not been on o the green this week...
posted by Nanukthedog at 10:30 AM on January 4, 2011 [1 favorite]


My 2011 to-do list:

1. Build a time machine.
2. Take this data back to 2009.
3. Write a better dissertation.

Seriously, cortex, this awesome. Long live datawankery!
posted by lewistate at 10:41 AM on January 4, 2011 [1 favorite]


kingarthurflour
kingartdurflour
posted by xorry at 10:47 AM on January 4, 2011 [2 favorites]


In their spoken english corpus, "people" is indeed the the most common noun but the PPM count of 2063 is still well below the Mefi 2010 PPM count of 3625. So, it is something else. I expect the popularity of specific expressions or the fact that this site is more focused on rhetoric than communication to be the likely reason.

Ok, so no disrespect, but as a communication scholar I just can't let this go...rhetoric is communication, just one type of communication (depending on which rhetoric scholars you are talking to). I'm sure you were using these terms to try to get at some distinction between the discourse here on MeFi and the general spoken discourse represented in the corpus, but I'm not sure what exactly you were going for. Certainly a lot of MeFi discourse tends toward the persuasive, but I'm not sure that necessarily accounts for the frequency of the word "people."

I think one hypothesis might be that what MeFi is for is discussing links, and quite often those links are about people or things people are doing, so that might account for the increased frequency of "people" in the MeFi corpus.
posted by DiscourseMarker at 10:52 AM on January 4, 2011


cortex you might be interested in LIWC (pronounced "luke" for some reason); my psychology colleagues use it a lot, and I'm actually using it for some MetaTalk data right now. LIWC gives you percentages of different word categories that a text falls into, biased, of course, towards categories of things that cognitive psychologists are interested in.
posted by DiscourseMarker at 10:57 AM on January 4, 2011


My 2011 to-do list:

1. Build a time machine.
2. Take this data back to 2009.
3. Write a better dissertation.


The best dissertation is a finished dissertation. Think of this as tenure-fodder.
posted by DiscourseMarker at 10:59 AM on January 4, 2011 [2 favorites]


I would like to see this compared against FOX NEWS frequency tables.
posted by blue_beetle at 11:16 AM on January 4, 2011


rhetoric is communication, just one type of communication

Oh, you want to open THAT can of worms, do you? Clearly, you haven't been sufficiently brainwashed by the all-communication-is-rhetoric crowd.
posted by lewistate at 11:17 AM on January 4, 2011


Oh, you want to open THAT can of worms, do you? Clearly, you haven't been sufficiently brainwashed by the all-communication-is-rhetoric crowd.

Hah--no! But I did recently witness two of my colleagues get into a minor argument about the definition of rhetoric.

This is why I'm a social scientist!
posted by DiscourseMarker at 11:29 AM on January 4, 2011


This is so totally awesome, thanks cortex!
posted by iamkimiam at 11:49 AM on January 4, 2011


I would like to see this compared against FOX NEWS frequency tables.

Oh that's easy, here's the whole table:

39422145        8945.25684534    rabble
posted by Rhomboid at 12:08 PM on January 4, 2011 [5 favorites]


(Er, I guess technically PPM should be a million. Joke fail.)
posted by Rhomboid at 12:09 PM on January 4, 2011


Showoff.
posted by zennie at 1:14 PM on January 4, 2011


LIWC (pronounced "luke" for some reason)

"Liw" would be pronounced "lu" or "loo" similar to how the Chinese last name "Liu" is. Actually, the vowel would be closer to "ew" ("ewww, that's gross!") than "oo" ("ooh, shiny!"), but when you add in the "k" sound from the C, you get "lewk," which quickly corrupts into "luke."
posted by explosion at 2:50 PM on January 4, 2011 [1 favorite]


The greater the number of words that were used here only by me in the last 10 years, the greater will be my sense of self-esteem.
posted by stavrosthewonderchicken at 3:25 PM on January 4, 2011 [1 favorite]


It's Anything Day, everybody!!

It's A Beautiful Day!
posted by hippybear at 4:08 PM on January 4, 2011


And the lower your comprehensibility, Stavros.
posted by Fraxas at 4:32 PM on January 4, 2011


[NOT LEGOMENIST]
posted by cortex (staff) at 4:43 PM on January 4, 2011 [1 favorite]


So an alternative baseline corpus of more modern texts would be useful for some sorts of comparison.

Cortex, the Google n-gram data should at least let you do a direct comparison between metafilter and the public web (as of January 2006, which is when Google collected their data).

Google seems like a good source of inspiration all around, actually.

For example, the Google n-gram viewer might be able to show trends over 500 years of publishing, but metafilter has been around for over 10 years, which is slightly less than forever in internet time. Imagine being able to chart the frequency of "hope me", "i for one welcome", "suck.com" or "twitter" on metafilter between 1999 and now.

You might also consider using the same tokenizing rules as google, with the idea that it might make comparisons with their data more meaningful. Or using the same n-gram file format that they use
posted by jjwiseman at 5:30 PM on January 4, 2011 [1 favorite]


I am tempted to change my username to "Fucknozzle McTootypoots" now, but y'all have just figured out how to pronounce "Sidhedevil", so.

It was either that or Vituperative Contumelious
posted by Sidhedevil at 7:14 PM on January 4, 2011 [6 favorites]


Amazing diligence Cortex. Duly impressed. Would be nice to see this compilation in the same format as this or this. (preemptive forgiveness if these are from the blue, I get my real life and Meta life mixed up.)
posted by ~Sushma~ at 8:00 PM on January 4, 2011 [1 favorite]


scalefree writes "I was going to ask about how feasible it'd be to provide this per-user but then I came up with two solid reasons not to do it. It'd make authorship attribution pretty simple, which would A) unmask sockpuppets & B) unmask anonymous askme's. So, long story short - never mind!"

It would be pretty awesome to get it for yourself though.
posted by Mitheral at 9:01 PM on January 4, 2011


I glanced at the most frequent nouns in written english. They are 1. time 2. year 3. people..

That's why Time's Person of the Year is so beloved by all.
posted by twoleftfeet at 9:14 PM on January 4, 2011 [4 favorites]


It would be pretty awesome to get it for yourself though.

I'd be happy to make a cleaned up version of the perl code available at some point for anyone who wanted to throw their exported comments at it. I could probably also compute some one-off files for inquiring folks, as far as that goes.
posted by cortex (staff) at 9:18 PM on January 4, 2011


You'd think that, but actually, a lot of people seem to want to punch me.
posted by box at 9:19 PM on January 4, 2011


For example, here's my frequency table, across the 1.7M words I've written on the site over the last ten years or so.
posted by cortex (staff) at 9:30 PM on January 4, 2011 [1 favorite]


Corpus of Contemporary American English will avoid the "iPhone problem" of using the BNC as a baseline. And you can get frequency tables for it.

The other iPhone problem, you say? No, it does nothing to prevent your coworkers from dropping off poofarts at your desk. That one, you're on your own.
posted by eritain at 10:50 PM on January 4, 2011


And now that you have your frequency table Cortex, you can automate your responses to people and just focus on the deleting and the banhammer!
posted by Phantomx at 7:35 AM on January 5, 2011


Cortex, the Google n-gram data should at least let you do a direct comparison between metafilter and the public web (as of January 2006, which is when Google collected their data).

I was under the impression that the available ngram datasets were only for the Books stuff. Not that that'd be terrible, since recent books/magazines/journals will have much of the up-to-date jargon in 'em as well, just want to be sure I understand what exactly is an isn't available.

You might also consider using the same tokenizing rules as google, with the idea that it might make comparisons with their data more meaningful. Or using the same n-gram file format that they use

Yeah, it might be worth it to clone or at least nearly so their file format once I start looking seriously at the ngram stuff. As far as their tokenizing process, where is that laid out in detail? I haven't gone looking very hard.

Corpus of Contemporary American English will avoid the "iPhone problem" of using the BNC as a baseline.

Yeah, it seems like a good possibility. I'm going to need to get a couple hundred dollars more serious about it's usefulness to me before that's a viable option, though.
posted by cortex (staff) at 8:00 AM on January 5, 2011


odd little idea probably. But is there anyway to do this on a realtime thread by thread basis.
Many threads here seem to have either brands or ideas that get repeated over the thread.
So for a person just visiting the thread and wanting a good summary of the most popular ideas or brands would be perfect.

Maybe on like a sidebar on the threads with more than 25 replies example.
posted by radsqd at 9:23 AM on January 5, 2011


I don't see that becoming an actual feature any time soon, but as an analysis idea it might be interesting. I did something similar a couple years back with the Word Clouds stuff, essentially looking for over-represented terms for any given thread, and I might give than another shot with this more complete mefi baseline to work against.
posted by cortex (staff) at 9:41 AM on January 5, 2011


It was either that or Vituperative Contumelious

Aw, sidhe-it.
posted by kittyprecious at 1:08 PM on January 5, 2011 [2 favorites]


It seems to me that you could get some interesting results by comparing this to other standard frequency tables such as these.

yes! And then we could argue about how use of 'mctootypoots' wasn't particularly widespread amongst mefites, how it doesn't reflect our values, and... accuse okcupid of poor science?

(sorry, couldn't resist: i actually think it's a fine idea)

Is there any way for people who aren't cortex to get access to the db to do their own experiments? How much data is involved there?
posted by nml at 2:14 PM on January 5, 2011


Yep, between the Infodump and these tables that's the data currently publicly available. If you have a specific sort of thing you'd like to look at that's in the database, you can totally chatter about it in here or send me mail, all reasonable requests will be considered.
posted by cortex (staff) at 2:23 PM on January 5, 2011


I've been working on a project that involves a lot of similar stuff. You might try filtering the corpus against a dictionary (this would remove most proper nouns), or at least tag them. Another really interesting slant is to use a lemmatizer to reduce words to their root word, thus collapsing lots of word variations into a single representative. I've been using MorphAdorner with some success for this, but it's built-in irregular words list is dated.
posted by liquid54 at 8:47 PM on January 5, 2011 [1 favorite]


How do you handle case? Lower-case everything? (e.g. "May" <> "may" unless "may" occurs at the start of a sentence)
posted by alasdair at 3:46 AM on January 6, 2011


Yeah, this simply lcs everything.
posted by cortex (staff) at 7:16 AM on January 6, 2011


I can claim the singular-until-now mctootypoots reference.
posted by mbd1mbd1 at 11:38 AM on January 6, 2011


Ha!
posted by cortex (staff) at 11:56 AM on January 6, 2011


It's nice to see 'fuckwits' being used much more often than 'fuckwit'; as 'fuckwit' implies a direct insult to a single entity, while 'fuckwits' implies general levels of assholeness.
posted by el io at 6:06 PM on January 7, 2011


oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges oranges

oranges
posted by obiwanwasabi at 9:13 PM on January 9, 2011


jjwiseman: am concerned that the google n-gram viewer doesn't include the phrase 'cheesy peas' anywhere in its corpus.

I am only slightly drunk.
posted by knoxg at 2:49 PM on January 21, 2011


« Older I feel more like Little Egypt than ever before!   |   Let's not put the n-word on the front page please. Newer »

You are not logged in, either login or create an account to post comments