Metafilter Frequency Tables updated, now with 636 million words! January 14, 2013 11:13 AM   Subscribe

It's been over a year since we first made the Metafilter Frequency Tables available, and now they're updated with word frequency information for all of 2011 and 2012 as well, bringing the total number of words up to six hundred and thirty-six million. Gosh! (Is a word we've collectively used 5,707 times since 1999!)

For some good background on what's in these tables and what a body might do with them, check out the original announcement post and the wiki page about 'em.

But the very short version is that these tables represent the total count of occurrences of any given word in Metafilter comments (specifically from the Metafilter, Ask Metafilter, Metatalk, and Music subsites, just like we cover in the Infodump), as well as the relative frequency with which each word appears expressed as "parts per million" (or PPM), the number of times that word would appear if the total word count of the source text was exactly one million words.

If you feel like just doing some basic searching with a text editor's "find" function, the easiest thing is to download one of the "complete" files and play around with that. If you're feeling like doing some more ambitious datawankery, there's a lot of potentially interesting things to do with comparisons between different subsites, or between the same subsite over time, or comparing the Mefi corpus to other more general linguistic corpora that exist on the internet.
posted by cortex (staff) to MetaFilter-Related at 11:13 AM (98 comments total) 9 users marked this as a favorite

Can someone compare this to more standard English corpuses and tell us how we're unusual?

What rare words do we use a lot, what common words do we eschew, kind of thing.

EDIT: oh I see that was one of the homework problems, up on the main post. Well, do it, someone!
posted by grobstein at 11:27 AM on January 14, 2013


Geez, we sure can talk, can't we?

I mean, it's almost like once we start talking, we can't stop. Or maybe some of us are used to getting paid by the word, so they become more and more verbose, causing people to just skim past every comment they make.

Balderdash, I say! We are in control of our own destinies. There is no way that we can't lick this! We can and must conserve words!

several hundred thousand words later

And that is how we bring about a reboot of Quantum Leap.

Thank you for your time.
posted by inturnaround at 11:32 AM on January 14, 2013 [3 favorites]


Well I'm certainly grintisticated by the emplondibility on show here.
posted by Jofus at 11:35 AM on January 14, 2013 [2 favorites]


We can and must conserve words!

I've got an entire shelf in the basement of mason jars with fucks in syrup. And the barrel of shits is pickling nicely. I'm not sure about the smoked assholes, however.
posted by griphus at 11:36 AM on January 14, 2013 [9 favorites]


ptarmigan ptarmigan billabong skookum skookum hella hella
posted by "Elbows" O'Donoghue at 11:36 AM on January 14, 2013 [1 favorite]


My own contribution to the specific character of the MeFi corpus is that I say "yay" far more than is either normal or prudent. (But not, I might add, more than is warranted.) Yay!
posted by ocherdraco at 11:36 AM on January 14, 2013 [1 favorite]


/looks up "fuckknuckle"
posted by facetious at 11:36 AM on January 14, 2013


Q: Will IRL eventually get added to the Infodump, as the Fieri thread (etc.) has not-insignificant conversation?
posted by shakespeherian at 11:37 AM on January 14, 2013


Also hands up if you pronounce "hapax legomenon" in the same cadence as "koosalagoopagoop."
posted by griphus at 11:39 AM on January 14, 2013


Can someone compare this to more standard English corpuses and tell us how we're unusual?

A mental evaluation would probably do a better job of that.

Though it would probably drive the psychologist insane.
posted by Brandon Blatcher at 11:40 AM on January 14, 2013


Q: Will IRL eventually get added to the Infodump, as the Fieri thread (etc.) has not-insignificant conversation?

Maybe. Same answer for Projects and Jobs. The primary issue with those three is that they have more significant variation in form and content than do the big three subsites and Music, and so for simplicity's sake I've always kept it to those four. (Music is the odd man out even there, since it's so much lower-traffic than the rest. But it makes for a decent control corpus for sanity-checking stuff.)
posted by cortex (staff) at 11:41 AM on January 14, 2013


Looking at the data for the 308 million words on the blue since the beginning of time erm MetaFilter. Pulling out two entries:

27,896 cat
18,499 romney

This is most satisfying.
posted by Wordshore at 11:43 AM on January 14, 2013 [7 favorites]


And before anyone says it's an unfair comparison, remember that Romney has been running for president since the beginning of time/metafilter.
posted by Wordshore at 11:44 AM on January 14, 2013 [2 favorites]


What's the significant variation for IRL, other than the specificity of audience for a given thread? It used to be in MetaTalk, after all.
posted by shakespeherian at 11:44 AM on January 14, 2013


Yeah, but it's Romney's dog that got everyone up in arms.
posted by k5.user at 11:45 AM on January 14, 2013


IRL's main distinguishing feature on this front is that it didn't exist when the Infodump got put together.
posted by cortex (staff) at 11:47 AM on January 14, 2013


Romney announced he is running for cat next year.
posted by Mister_A at 11:49 AM on January 14, 2013 [9 favorites]


And he will lose to Bo.

And I will laugh and laugh and laugh.
posted by MCMikeNamara at 11:52 AM on January 14, 2013 [3 favorites]


Can we have the scripts that were used to make these(?)? Would it be abusive to run our own word counts on different questions (e.g. user frequency tables)?
posted by grobstein at 11:54 AM on January 14, 2013


inturnaround: We can and must conserve words!

"Eschew surplusage." -- Mark Twain
posted by Greg_Ace at 11:56 AM on January 14, 2013


I like very much that once you get out of the top 20 words -- your incidences of the and and and suchlike -- it sounds a lot like the dialogue of very inarticulate teenagers: "It's about, like, all so just what..." is words #28 through 34 for January 2008, for example.
posted by ricochet biscuit at 12:10 PM on January 14, 2013


27,896 cat
18,499 romney


About time the animals hosed down Romney.
posted by arcticseal at 12:10 PM on January 14, 2013


Here's some examples of how year-by-year dataviz of this stuff could be fun to play with:

- The rise of Hamburger, which shows the enormous spike in the use of that particular word when it accidentally took on jargonful meaning.

- Mod usernames vs "mods", suggesting the utility of referring to plural mods generically rather than specifically by name. (I kept the mod name list short here only to keep the graph a little clearer than if there were more lines on there.)

- Link aggregators over time, as a sort of long view of the accelerated history that is the internet.

The code generating this is pretty hacky and not ready for public consumption at all, but the basic idea—calculate values for the same word(s) from multiple subsites or slices of time—would be easy enough for a dedicated datawanker to play with.
posted by cortex (staff) at 12:16 PM on January 14, 2013 [6 favorites]


Can we have the scripts that were used to make these(?)? Would it be abusive to run our own word counts on different questions (e.g. user frequency tables)?

I can take a look at my perl code and see if it's presentable enough to make public. Playing with this vs. your personal word frequency stuff would be totally fine, yes.
posted by cortex (staff) at 12:17 PM on January 14, 2013


- The rise of Hamburger, which shows the enormous spike in the use of that particular word when it accidentally took on jargonful meaning.

I'm actually more curious about that severe drop in AskMe around 2003-2004.
posted by griphus at 12:18 PM on January 14, 2013 [2 favorites]


I'm actually more curious about that severe drop in AskMe around 2003-2004.

There was relatively very little data in the 2003 askme bucket, since the site only came into being in December of that year. So any datapoints there are likely to be more erratic than stuff later on; many words otherwise represented in small numbers most years might be totally absent that year, and any words actually present would be over-represented.
posted by cortex (staff) at 12:22 PM on January 14, 2013 [1 favorite]


Ah, that makes a lot more sense than my Christmasburger Fad of '03 theory.
posted by griphus at 12:30 PM on January 14, 2013 [3 favorites]


Apologies, I haven't had the chance to look at this. Is the data anonymized? Or are the words tagged with your username?
posted by Afroblanco at 12:38 PM on January 14, 2013


This is all necessarily anonymous; they're aggregate counts for the entire site or specific subsites, not user-specific counts. For that sort of thing you'd have to contact me directly and ask really nicely or wait around for the odd Metatalk thread where that's already they going thing.
posted by cortex (staff) at 12:42 PM on January 14, 2013


Got it. Thanks!
posted by Afroblanco at 12:43 PM on January 14, 2013


I see we reached peak cortex in 2007.
posted by justsomebodythatyouusedtoknow at 12:44 PM on January 14, 2013 [6 favorites]


People just couldn't shut up about that guy.
posted by cortex (staff) at 12:45 PM on January 14, 2013 [1 favorite]


The parabola created by Slashdot and reddit is so on the money it isn't even funny.

I look forward to any and all visual representations or other neat things people find. (Some might call this "me being lazy"; I call it "encouragement of others.")
posted by MCMikeNamara at 1:02 PM on January 14, 2013 [1 favorite]


How much of that is from those weirdo ax-grinding Meta threads where that one guy had decided that cortex was personally the death of Metafilter?
posted by shakespeherian at 1:03 PM on January 14, 2013


No man can kill MetaFilter!
posted by Mister_A at 1:25 PM on January 14, 2013


of woman born, 'til Birnam wood do come to Dunsinane...
posted by beryllium at 1:29 PM on January 14, 2013 [6 favorites]


This is the sort of stuff that makes me feel all squiply about Metafliter.
posted by fantabulous timewaster at 3:03 PM on January 14, 2013 [1 favorite]


This is great! Downloaded and uploaded and am happily swimming in the data.

Just one thing I'm struggling with here ... is there anybody reading this who knows how to write an awk (or other text tool) script that will stitch files together? It would be heavenly if I could select the first 10K (matching) rows of each of the subsite yearly files into one table that would have the following columns:

WORD, PPM_1999, COUNT_1999, PPM_2000, COUNT_2000, PPM_2001, COUNT_2001, ...

The point is to stitch these files together *before* uploading them into the database (otherwise I could just do this in SQL). Having one neat file/table for each subsite to upload would be fabulous, and would make SQL queries a snap...otherwise one would need to upload 14 files into the database (one for each year), which would be a pain to do 4 times.*

I think this could be useful to anybody who wants to easily look at the trend of a particular word over time, as this information would be contained in a single row (rather than spread over 14 different text files).

My efforts in coming up with such a script are hopeless thus far and I look to you wise data-crunchers to save me from drowning in data.

*4 times=once for each subsite in the corpus. However, for my MetaFilter research, I have over 1,000 sets of individual corpus files (from MeFites who took the 2010 and 2012 surveys and gave consent to have their personal word frequency table generated and shared with me...thanks a million words!), broken down by year...that's about 14,000 text files (one for each year, per person). It'd sure be nice to get them back down to 1,000 files. :)
posted by iamkimiam at 3:20 PM on January 14, 2013


Also, ginormous THANK YOU to cortex for all these corpus files! It's such a cool, fun thing and I totes 'preciate the work you've put into this. It's amazeballs.

(I'm also angling to get my personal corpus as whackadoo as poss here, word by zany word)
posted by iamkimiam at 3:25 PM on January 14, 2013 [2 favorites]


Fun activity: pick a word—e.g., turd—then go down into the 1's and find the amusing variations of the word.
posted by fleacircus at 3:35 PM on January 14, 2013


iamkimiam, I don't quite understand what you want (I haven't looked at the structure of the files), but it sounds like uploading a file is a pain for you. Does this mean you do each file by hand? If that's the case I would think about automating the process and then you could use a bash script like:
for file in *.data
do
    upload $file 
done
Assuming you mark each row with the year it came from (which isn't really good database practice, but what the heck), you can make the table you want with SQL, which like you say is designed for this kind of stuff.

If the year isn't there in the data, you could preprocess each file with sed or awk and add it too each row. I think it's something like
s/$/,2001/g
to add it to the end of each row.
posted by benito.strauss at 3:45 PM on January 14, 2013 [1 favorite]


The data in all the corpus files all look like so:

16479887 44413.7395233388 the
10296832 27750.2396930015 to
9509465 25628.2644120258 a

...where the first column is the raw count, the second is the PPM (parts per million) and the third is the word. Each text file represents a different year in the corpus. I basically want to join 14 text files together so you have one text file that has 29 columns (1 for the word, 14 for the raw counts of that word for each year, 14 for the PPM...hell, I don't even need the PPM columns—that is a simple calculation of the count divided by the total words in the corpus for that year, which is listed at the top of the text file).

Anyways, I think you're right...I'm just going to have to find a way to batch load these files into the database and create the tables I want from within there. Although, I'm starting to wonder if R could do this with its data slicing and concatenating capabilities...
posted by iamkimiam at 4:10 PM on January 14, 2013


You can tell Music is a nurturing and postive environment for us musician types because once you get past the top 20 or so conjunctions and prepositions, the words that jump out are 'song' (obviously) followed by 'great', 'good' and 'thanks'.
posted by TwoWordReview at 4:31 PM on January 14, 2013


Somebody ought to make a great meta-song.
posted by iamkimiam at 5:20 PM on January 14, 2013


yet-another:
yet-another-2d-scroller
yet-another-attempt
yet-another-bad-news-post
yet-another-bush-is-the-devil
yet-another-californian
yet-another-crappy-term-that-techies-hate
yet-another-embarrassing-old-tool-rock-move
yet-another-emergency
yet-another-example
yet-another-family-tragedy
yet-another-first-person-shooter
yet-another-flamewar
yet-another-fpp
yet-another-government-bureaucracy
yet-another-head-to-head
yet-another-industry-flack
yet-another-link
yet-another-low-rung-job
yet-another-m-16-variant
yet-another-mormon-racism
yet-another-movie
yet-another-naomi-huff
yet-another-networking
yet-another-newsfi
yet-another-online-photo-album
yet-another-overproduced
yet-another-pair-of-shoes
yet-another-perpetual-motion
yet-another-raised-catholic-gone-atheist
yet-another-related-posts-plugin
yet-another-request-for-sf-info
yet-another-russian-oligarch
yet-another-sign
yet-another-trademark-infringement
yet-another-ugly-old-freeway
yet-another-version-of
yet-another-way-to-get-you-killed
yet-another-wedding-music
yet-another-whiny-callout
yet-another-window-manager
yet-another-word-cloud
yet-another-xenu-joke
yet-another-zed-word
posted by unliteral at 5:28 PM on January 14, 2013 [3 favorites]


I can't use my real computer right now, and my iPad is pleading with me not to make it try to grapple with these files, so pretty please could someone extract the "ass*" compounds for our enjoyment? I am presuming "asshole" outdistances "asshat", with "assmunch" bringing up the rear, and "asscandle", "asswaffle" and others having to relinquish their deposits.
posted by Sidhedevil at 5:35 PM on January 14, 2013


I was hoping to use this somehow to see all of the "metafilter:" posts. But I just could not find an easy way to group them all.

Heck of a job! Thanks for sharing!
posted by TangerineGurl at 5:57 PM on January 14, 2013


Typos and url's being what they are, I guess I shouldn't be surprised that almost 60% of the 1.4m different words that have appeared on Mefi proper are hapax legomena, but I am kind of surprised anyway.

Or maybe I should be surprised that almost 600,000 different words have been used more than once. Some of the two-bangers are pretty funny, though. (jackassitude and teapocalypse, I'm looking at you...)

By the way, for those who, like me, suck at SQL and have no coding or scripting ability but still like to dick around with this kind of stuff, if you throw out the unique words and work only with those that have appeared more than once, these tables are totally manipulable in Excel.
posted by dersins at 6:10 PM on January 14, 2013


Also, thanks cortex!
posted by dersins at 6:11 PM on January 14, 2013


I see we reached peak cortex in 2007.

They have had to begin fracking as the major fields became depleted.
posted by y2karl at 6:31 PM on January 14, 2013 [1 favorite]


Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
posted by OmieWise at 6:39 PM on January 14, 2013


Sidhedevil - Top 20:
63091 ass
30850 asshole
17079 assholes
2190 half-assed
2029 asshat
1551 kick-ass
1174 asshats
940 bad-ass
886 big-ass
503 asshattery
502 assed
485 crazy-ass
406 ass-kicking
397 lame-ass
395 cheap-ass
380 smart-ass
334 assload
318 half-ass
300 weird-ass
278 dumb-ass

Honourable mentions:
2 james-river-traders-wearing-calvin-klein-aftershave-smelling-goofy-ass
1 ass-manuel-antonio-gonorrhea-noriega-moreno
1 dyed-in-their-own-ass-smelling-wool-hoody
posted by unliteral at 7:00 PM on January 14, 2013 [2 favorites]


Is

ACETYL­SERYL­TYROSYL­SERYL­ISO­LEUCYL­THREONYL­SERYL­PROLYL­SERYL­GLUTAMINYL­PHENYL­ALANYL­VALYL­PHENYL­ALANYL­LEUCYL­SERYL­SERYL­VALYL­TRYPTOPHYL­ALANYL­ASPARTYL­PROLYL­ISOLEUCYL­GLUTAMYL­LEUCYL­LEUCYL­ASPARAGINYL­VALYL­CYSTEINYL­THREONYL­SERYL­SERYL­LEUCYL­GLYCYL­ASPARAGINYL­GLUTAMINYL­PHENYL­ALANYL­GLUTAMINYL­THREONYL­GLUTAMINYL­GLUTAMINYL­ALANYL­ARGINYL­THREONYL­THREONYL­GLUTAMINYL­VALYL­GLUTAMINYL­GLUTAMINYL­PHENYL­ALANYL­SERYL­GLUTAMINYL­VALYL­TRYPTOPHYL­LYSYL­PROLYL­PHENYL­ALANYL­PROLYL­GLUTAMINYL­SERYL­THREONYL­VALYL­ARGINYL­PHENYL­ALANYL­PROLYL­GLYCYL­ASPARTYL­VALYL­TYROSYL­LYSYL­VALYL­TYROSYL­ARGINYL­TYROSYL­ASPARAGINYL­ALANYL­VALYL­LEUCYL­ASPARTYL­PROLYL­LEUCYL­ISOLEUCYL­THREONYL­ALANYL­LEUCYL­LEUCYL­GLYCYL­THREONYL­PHENYL­ALANYL­ASPARTYL­THREONYL­ARGINYL­ASPARAGINYL­ARGINYL­ISOLEUCYL­ISOLEUCYL­GLUTAMYL­VALYL­GLUTAMYL­ASPARAGINYL­GLUTAMINYL­GLUTAMINYL­SERYL­PROLYL­THREONYL­THREONYL­ALANYL­GLUTAMYL­THREONYL­LEUCYL­ASPARTYL­ALANYL­THREONYL­ARGINYL­ARGINYL­VALYL­ASPARTYL­ASPARTYL­ALANYL­THREONYL­VALYL­ALANYL­ISOLEUCYL­ARGINYL­SERYL­ALANYL­ASPARAGINYL­ISOLEUCYL­ASPARAGINYL­LEUCYL­VALYL­ASPARAGINYL­GLUTAMYL­LEUCYL­VALYL­ARGINYL­GLYCYL­THREONYL­GLYCYL­LEUCYL­TYROSYL­ASPARAGINYL­GLUTAMINYL­ASPARAGINYL­THREONYL­PHENYL­ALANYL­GLUTAMYL­SERYL­METHIONYL­SERYL­GLYCYL­LEUCYL­VALYL­TRYPTOPHYL­THREONYL­SERYL­ALANYL­PROLYL­ALANYL­SERINE

on the list?
posted by double block and bleed at 7:05 PM on January 14, 2013


It will be next year.
posted by beryllium at 7:11 PM on January 14, 2013 [5 favorites]


is there anybody reading this who knows how to write an awk (or other text tool) script that will stitch files together?

Try this Python script. Run it in the same directory as the yearly .txt files and it should produce yearly-combined.txt. BTW cortex, there's a slight filename inconsistency in that the pre-2010 files are named like freqtable--2006-01-01--2007-01-01.txt and the newer ones are named like allsites--2012-01-01--2013-01-01.txt but it's no big deal. The script tries both forms. I've tested it with Python 2.6 and 2.7, but it will not work with Python 3 as the the csv module no longer wants the file opened binary mode, but instead text mode with newline=''. You can probably just change the 'rb' and 'wb' arguments to newline='' and that should fix it, but there might be other incompatibilities.
posted by Rhomboid at 7:17 PM on January 14, 2013 [1 favorite]


I'm looking forward to seeing how that thread where we were replacing "heart" with "butt" in song and movie titles affected word usage, too.
posted by misha at 7:51 PM on January 14, 2013 [1 favorite]


Typos and url's being what they are, I guess I shouldn't be surprised that almost 60% of the 1.4m different words that have appeared on Mefi proper are hapax legomena, but I am kind of surprised anyway.

This is actually pretty typical of frequency distributions across pretty much any large-ish corpus of natural language. Give or take a few percentage points, you'll see about half of the words in a given frequency table be one-hit wonders (or hapax legomena), on account of the way language use shakes out in practice. It's a neat phenomenon.
posted by cortex (staff) at 7:51 PM on January 14, 2013 [1 favorite]


Now I wish I had learned to read.
posted by trip and a half at 8:16 PM on January 14, 2013


How many times was 'fulvous' used, excepting this one?
posted by Mister_A at 8:23 PM on January 14, 2013


This is actually pretty typical of frequency distributions across pretty much any large-ish corpus of natural language.

One person often identified with describing this phenomena is the American linguist George Zipf, after whom Zipf's Law was named. Following these threads can take you to all sorts of interesting places ... Previously, but maybe a Zipf post is in order.
posted by carter at 9:12 PM on January 14, 2013 [1 favorite]


So.. any hope for bigrams?
posted by Going To Maine at 11:12 PM on January 14, 2013


In 2007, the word "viking" was used 743 times, more than it had been during the entire preceding history of metafilter.
posted by RobotHero at 11:14 PM on January 14, 2013 [5 favorites]


Interesting stuff, but I can't hear the name George Zipf without picturing Leslie Nielsen giving an inspirational speech.
posted by TwoWordReview at 11:54 PM on January 14, 2013


- The rise of Hamburger, which shows the enormous spike in the use of that particular word when it accidentally took on jargonful meaning.

I'm actually more curious about that severe drop in AskMe around 2003-2004.


Should I eat this hamburger?
0 answers
posted by mannequito at 3:05 AM on January 15, 2013


Here are some notes on longest words (used more than once)

The longest word used more than once is:

88888888888888oo8oooooooooocoococccococccccccccccccccccccccccccococcocccccoccococcoccccoooccoocoococcocoo8888888888888oooooooooooocoooooocccccccoccccccccccccccccccccccccccccccccccccoccccoccccccococoococcccccoc888888888888888oo8oooooooooooocococccccccocccccccccccccccccccccccccccccccccccccccoccccccccococcoccccccccc888888888888888oo8ooooooooooocoooccccccccccccccccccoccccccccccccccocccccccccoccccccccococccccooccccccccccc8888888888888888o8ooooooooooocccccocccccocccccccccccccccccccccccccccccccccccccccccocococcccccoccccccccccco8888888888888o8o8ooooooooocooocccoocccccccccccccccccccccccocccccccccocccccccccccccccccccccoccoccccccccccc888888888888888oooooooooooocoococococcccccccccccccccccccoccccccccocoocccococccccccccccccccccccocccccocccccc888888888888888o8ooooooooooocoooccccoccccccccccccccccccccccoccccccccccccccccccccccccccoccccccccccccccccccccc888888888888888o8o8ooooooooooooccooocccccccccccccccoccccccoccocccococccoccoccccccccccccoccccccccccccccccccccc888888888888888888oooooooooooocooocoocccocccccccccoccccccooccccoccccococooccccccccccccccccccccccccccccccccccccc8888888888888888oo8oooooooooooooocccccoccccccccccccccccccccoccccocococccccooocccccccccccccccccccccccccccccccccc888888888888888o8oooo8ooooooooooocccoocccccccccccccccccccccccccccccoccccccccccccccccccccccccccccccccccccccccccc88888888888888o8oooooooooooooooocoooccccccccccccccccccccccccocccoocccocccccccccccccccccccccccccccccccccccccccccc88888888888888oooooooooooooooooococcccccccccccccccccccccccccccocoooccccccccccccccccccccccccccccccccccccccccccccc8888888888888oo8ooooooooooooocoooocccccccocccccocccccoooooocccccccoococccocccccccccccccccccccccccccccccccccccccc888888888888ooooooooooooooooooocccccoccccccoc

I don't know why that would be.
The longest word used more than 10 times is:

4567893acdefghijstuvwxyzcdefghijstuvwxyz838485868788898a92939495969798999aa2a3a4a5a6a7a8a9aab2b3b4b5b6b7b8b9bac2c3c4c5c6c7c8c9cad2d3d4d5d6d7d8d9dae1e2e3e4e5e6e7e8e9eaf1f2f3f4f5f6f7f8f9faffc4001f0100030101010101010101010000000000000102030405060708090a0bffc400b511000201020404030407050404000102w00010203110405

The longest word (used more than once) which is an english word (sort of) is:

mushrooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooom

The longest word (used more than once) which could actually be a word (in german I think) is:

wolfeschlegelsteinhausenbergerdorffvoralternwarengewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbeschutzenvonangreifendurchihrraubgierigfeindewelchevoralternzwolftausendjahresvorandieerscheinenwanderersteerdemenschderraumschiffgebrauchlichtalsseinursprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchenachdiesternwelchegehabtbewohnbarplanetenkreisedrehensichundwohinderneurassevonverstandigmenschlichkeitkonntefortplanzenundsicherfreuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvonandererintelligentgeschopfsvonhinzwischensternartigraum

Longest string of all zzzzzzzzzzz used more than once:
72 chars

Popularity distribution of Z length:
2 z's - 467
3 z's - 168
4 z's - 95
5 z's - 65
6 z's - 46
7 z's - 37
8 z's - 40

From then on, general decline, with a couple of peaks at 18 and 20 Z's and a peak at 61 Z's
posted by Just this guy, y'know at 4:17 AM on January 15, 2013 [3 favorites]


Oh, Mister_A fulvous was apparently used 3 times.

Here Referring to a cloud of dust stirred up by hamsters.

Here referring to an orangutan.

and Here specifically listing it as an obscure word.
posted by Just this guy, y'know at 4:32 AM on January 15, 2013


Niveous has not been used before, and I was kicking myself just yesterday because it's a word I just learned, it means "snowy" basically, and I could have used it in a post I made last week. I might have been able to whittle those four comments down to three!
posted by OmieWise at 5:20 AM on January 15, 2013


- The rise of Hamburger, which shows the enormous spike in the use of that particular word when it accidentally took on jargonful meaning.

Ich bin ein Hamburger.
posted by ersatz at 6:00 AM on January 15, 2013


Never used before you say.....

Be right back.
posted by Just this guy, y'know at 6:03 AM on January 15, 2013 [1 favorite]


The longest word used more than once is: [removed for conciseness] I don't know why that would be.

These have got to be because of users quoting the prior comment, right? Used more than once in different threads would be the better metric?
posted by nobody at 6:08 AM on January 15, 2013


Ahah!

There are loads of instances of nonsense words in the longest word lengths, but specifically lots of chains of c's and 0's and 8's.

I have found it!

posted by Just this guy, y'know at 6:35 AM on January 15, 2013 [2 favorites]


A few years ago, I ran across an obsolete word randomly in the OED (benefit of still being a college student) that meant "lowly and despicable coward". I thought it was something like "Calthwain". I spent some time last night looking for it again in the OED, but now I can't find it at all. Do they ever remove words from the OED? Did I just imaginate the whole thing?

It also occurred to me that my big word above is going to cause VARCHAR length problems for people in the future trying to import the next batch of data into a database.

Sorry about that.
posted by double block and bleed at 6:54 AM on January 15, 2013


The longest word used more than once is:

One thing to keep in mind for weird nonsense words that should be hapax legomena but are instead (what the internet suggests just now are called) dis or tris or tetrakis legomena: the table generation script makes no effort to distinguish quoted material in any comment from original material.

So someone can type apparent nonsense, and then someone else can quote that nonsense and reply to it, and the nonsense will be counted twice.

Automatically identifying quoted material is an interesting problem but also a hard one, and while I might get at it some day for now it's a big complication with unpredictable results, so these tables take the naive approach and just count everything that's not html attribute text.

So.. any hope for bigrams?

hope is the thing with feathers

I have code that will generate n-grams for arbitrary n and I hope to do something useful with it. On the short term, it's a logistical challenge because even bigram tables are hugely larger in size than the current tables; my perl script literally crashes in the current (rather patchy) dev environment before it finishes compiling a single bigram table for a significant chunk of the db.

I have successfully calculated bigrams and trigrams for small subsections, though; here for example is a (somewhat out of date) 3-gram frequency table for my metatalk comments, if you want something to poke at.

I'll revisit the general bigram/trigram problem at some point; our dev environment has gotten a little more resourceful since 2011 apparently, since I was able to run the largest of the 1-gram tables the other day successfully whereas it was a hacky stitch-together problem last time.
posted by cortex (staff) at 8:40 AM on January 15, 2013


Mister_A: "No man can kill MetaFilter!"

"Do not pursue Cortex! He will not return to these lands. Far off yet is his doom, and not by the hand of man shall he fall."
posted by Chrysostom at 9:05 AM on January 15, 2013 [1 favorite]

Some of the two-bangers are pretty funny, though. (jackassitude and teapocalypse, I'm looking at you...)
Probably you shouldn't neglect the effect of quoting in replies here.

On non-preview, I skimmed right past cortex's most recent comment.
posted by fantabulous timewaster at 9:11 AM on January 15, 2013


I thought it was something like "Calthwain".

Could it be caitiff?
posted by Lorin at 9:12 AM on January 15, 2013 [2 favorites]


(One fun implication of n-gram tables is that you can turn those very directly into Markov chains, which are in a reductive sense nothing but a frequency table plus a little bit of structural whammy to put it into an efficient lookup table. So you could take my 3-gram file there and build a simple cortex bot really easily. Back when we had Markovfilter, it was doing essentially just that on- the fly for arbitrary users. Some day maybe we'll do that again!)
posted by cortex (staff) at 10:25 AM on January 15, 2013 [2 favorites]


> Some day maybe we'll do that again!

DO IT NOW
posted by languagehat at 11:00 AM on January 15, 2013 [4 favorites]


Enough of your rhodomontade, you jackanapes!
posted by y2karl at 1:20 PM on January 15, 2013


Only if it intersperses Emily Dickinson quotations !
posted by ersatz at 1:23 PM on January 15, 2013


Lorin: "I thought it was something like "Calthwain".

Could it be caitiff?
"

Yes, that's it. I wonder how I managed to butcher it so badly in my memory?
posted by double block and bleed at 3:09 PM on January 15, 2013


cortex, is any portion of the comment corpus available? I don't see it above, and it'd be fun to stress-test some code I've got laying around that's an implementation of a multiprocessing-based distributed n-gram generator written in python that might be good for running on EC2 that I wrote for seeding password cracking algorithms.
posted by TheNewWazoo at 4:35 PM on January 15, 2013


There's no intermediary comment-text corpus available, no; these tables are all generated directly from queries of the the comment tables in the database itself. If you want to drop me a line directly about any mefi-specific one-off content dump or something for testing purposes, we can hash out possibilities though.
posted by cortex (staff) at 4:47 PM on January 15, 2013


To celebrate this, I asked pentametron to make

pentafilter
and
ask pentafilter

(OK, I didn't actually use the frequency tables for this.)
posted by moonmilk at 5:27 PM on January 15, 2013 [4 favorites]


Great felching Moses, you've fed pentametron a corpus of mefi/askme post titles? WORLDS COLLIDING.
posted by cortex (staff) at 5:30 PM on January 15, 2013 [1 favorite]


Rhomboid, that's great, thanks!!! I'll have some time to play around with it later today and let y'all know how it works. I've been learning Python through Codecademy anyway, and this gives me the perfect excuse to mess around with some meaningful code (that's some academic dirty talk right there, woo).

"It also occurred to me that my big word above is going to cause VARCHAR length problems for people in the future trying to import the next batch of data into a database."

Not if you use NVARCHAR or NVARCHAR2 and set the char limit to 2000. This may be an Oracle thing only though, I'm not sure.
posted by iamkimiam at 10:45 PM on January 15, 2013



21038 33.06 beans
17626 27.7 plate
4995 7.85 ianal
3741 5.88 ianad
3058 4.81 dtmfa
2840 4.46 viking
2683 4.22 snowflake
1434 2.25 hurf
1358 2.13 durf

posted by RobotHero at 11:48 AM on January 16, 2013 [1 favorite]


How do I find these 76 posts that used "hurf" without a corresponding "durf?"
posted by RobotHero at 10:50 PM on January 16, 2013


Site search: hurf -durf.
posted by cortex (staff) at 11:13 PM on January 16, 2013


Are you staff peoples aware of any usage of the infodump for general data analysis research or other such purposes, and if so, have you gotten feedback of how the infodump rates compared to other datasets they have available?
posted by Anything at 5:46 AM on January 17, 2013


I don't know of any specific on-going use of it, but it's been referenced (and/or I've heard from folks wanting to talk about it) in one-off research contexts a few times. iamkimiam is using the corpus stuff directly in her current work, and a couple other folks at least have used it in previous academic work. I gave a little talk about it when Kim and lewistate and DiscourseMarker invited me along to a internet research conference thing last year.

Mostly I am just not aware of other sites doing similar things. It seems like it's mostly just been (a) an available API for (sometimes pretty limited) direct querying/fetching or (b) nothing at all. It'd be neat to see it be more of a done thing; I think the huge amount of aggregate data lurking in the skeleton of large community activity is really interesting and would be really neat to be able to compare across more than one site's body of statistics.
posted by cortex (staff) at 8:28 AM on January 17, 2013


Site search: hurf -durf.

Alright, it's mostly people saying "hurf durfing" or "hurf durfery" so it looks like I don't have to type "durf" 76 times to close out all the hurfs.
posted by RobotHero at 9:06 AM on January 17, 2013


Between this and this I'm hearing REM singing "Everybody Hurfs".
posted by cortex (staff) at 9:14 AM on January 17, 2013


iamkimiam, got any anecdotal data on the pronunciation of hurf durf?
posted by ersatz at 9:19 AM on January 17, 2013


cortex: Site search: hurf -durf.

Oh, hey, another search results highlighting bug: the word "not" is highlighted there.
posted by stebulus at 12:06 PM on January 17, 2013


To celebrate this, I asked pentametron to make

pentafilter
and
ask pentafilter


Wait, the guy who made pentametron is MeFi's Own? I salute you, sir.
posted by escabeche at 3:58 PM on January 29, 2013


I'm taking a data analysis course, so I'm practicing my R on this.

I used this to merge all the yearly files into one:

folderPrefix = "allsites";
dataPrefix = folderPrefix;
firstYear = 1999L;
lastYear = 2012L;
skipLines = 3L;

secondYear = firstYear+1L;
thirdYear = secondYear+1L;


for (y in firstYear:lastYear) {
folderpath = paste0(".\\",folderPrefix,"--",y,"-01-01--",y+1,"-01-01.txt");
filename = list.files(folderpath);
filepath = paste0(folderpath,"\\",filename);
columnNames =c(paste0("count",y),paste0("PPM",y),"word");
columnClasses =c("integer","numeric","character");

assign(paste0(dataPrefix,y), read.table(filepath,sep="\t", quote="\"", skip=skipLines, col.names=columnNames, colClasses=columnClasses));
}


assign(paste0(dataPrefix,"Merged"), merge(get(paste0(dataPrefix,firstYear)), get(paste0(dataPrefix,secondYear)), by="word", all=TRUE));

for (y in thirdYear:lastYear) {
assign(paste0(dataPrefix,"Merged"), merge(get(paste0(dataPrefix,"Merged")), get(paste0(dataPrefix,y)), by="word", all=TRUE));
}

write.table(get(paste0(dataPrefix,"Merged")), paste0(folderPrefix,"Merged.csv"), quote=FALSE, sep=",", na="0", row.names=FALSE);


Then I can read it back in:



csv5rows = read.csv(paste0(folderPrefix,"Merged.csv"), quote="\"", nrows = 5)
columnClasses = sapply(csv5rows, class)
assign(paste0(dataPrefix, "Merged"), read.csv(paste0(folderPrefix, "Merged.csv"), quote="\"", colClasses=columnClasses))


And makes some lists of the columns I want to check:

ppmList = c()
for (y in 1999L:2012L) {
ppmList = append(ppmList,paste0("PPM",y));
}

countList = c()
for (y in 1999L:2012L) {
countList = append(countList ,paste0("count",y));
}


(For a sanity check, every PPM column sums to one million, so I'm pretty sure I didn't leave anything out.)

Once that's done, I can pull individual words easily enough:


allsitesMerged[allsitesMerged$word=="viking",countList]


0 13 59 30 29 52 118 128 743 317 375 347 314 315


Later in the class, I think we learn to plot the pretty pictures.
posted by RobotHero at 7:58 PM on January 31, 2013


For posterity, the code here will generate n-gram frequency tables for 1- through 6-grams, not just the unigrams in the files above: https://github.com/wiseman/mefingram
posted by jjwiseman at 2:08 PM on February 4, 2013


« Older Bug with new titles/deleted posts?   |   Five days of MetaFilter interest Newer »

You are not logged in, either login or create an account to post comments