Metafilter n-gram viewer February 1, 2013 3:42 PM   Subscribe

The recent meta post about weird tags got me playing again with the infodump, and I was finally motivated to create something I've wanted for a while: A clone of the Google Books N-Gram viewer, but for Metafilter: http://mefingram.appspot.com/.

Look at trends in metafilter cliches. See which presidential candidates we like to discuss the most, and which vice presidential candidates. How has our interest in the giants of the internet changed over the years?

What social networks have been giving people trouble for the past 10 years? Which browsers? When did we first start wondering if it was safe to eat something?

The most common 6-grams in askme titles give you some idea of the canonical askme questions:

what is the best way to
what do i need to know
should i stay or should i
how do i get rid of
what is the name of this

The source code for the view is available on github at https://github.com/wiseman/mefingram.
posted by jjwiseman to MetaFilter-Related at 3:42 PM (38 comments total) 27 users marked this as a favorite

This is really neat, thanks.
posted by atrazine at 3:49 PM on February 1, 2013


NEAT!!!!!

(Sorry, but that was my actual out loud reaction so I had to share it.)

snowflake, special, special snowflake
posted by MCMikeNamara at 3:51 PM on February 1, 2013




Did somebody say 'neat' yet?
posted by box at 4:03 PM on February 1, 2013


neat, neato, neat-o
posted by MCMikeNamara at 4:07 PM on February 1, 2013


Awesome work, jjwiseman. Interesting but I guess not really surprising on reflection that there's enough title data alone to do something interesting with.

If you're interested in playing around with a muuuuuch larger dataset, I could look into creating some one-off n-gram files for comment content for you to try incorporating into this.
posted by cortex (staff) at 4:08 PM on February 1, 2013 [2 favorites]


hmm...seems punctuation throws it off. "google+" appears to swallow results for "google"
posted by juv3nal at 4:10 PM on February 1, 2013


Features I'd like to add:
  • Views of the most common/least common n-grams.
  • "Auto-suggest": type some words, then see the most likely/least likely next words.
I'd also love to do this for all the text on metafilter: post bodies and comments. But that amount of data starts to become difficult to deal with, and would also best be done with cooperation from the mods. On preview: yess.
posted by jjwiseman at 4:10 PM on February 1, 2013


juv3nal: Yes, see "How does the n-gram viewer handle punctuation?". Ideally it would be nice to be able to choose whether punctuation is significant (imagine being able to search for "metafilter:").
posted by jjwiseman at 4:12 PM on February 1, 2013


oh my bad. I'd actually read that and somehow glossed over it, I guess because the chart shows the extra line/legend item.
posted by juv3nal at 4:14 PM on February 1, 2013


I have no idea what I am doing, but I like this. I just put "beer, job" in it for Ask and it is sad to see that job has out-performed beer lately. From 2010 - 2012 job has an upward sloping graph while beer is downward sloping. What's up with that folks? Maturity sucks.
posted by JohnnyGunn at 4:23 PM on February 1, 2013




juv3nal, I consider the fact that it didn't correctly give a count for competing n-grams a bug, so thanks for finding that.

BTW, while processing I ran into the following issues with the data in the infodump:

Encoding issues--titles that aren't UTF-8:

2013-02-01 13:55:56,040:INFO: Joining post data for askme...
2013-02-01 13:55:57,125:WARNING: Skipped 375 posts due to UTF8 errors.
2013-02-01 13:56:06,825:INFO: Joining post data for mefi...
2013-02-01 13:56:07,372:WARNING: Skipped 50 posts due to UTF8 errors.
2013-02-01 13:56:12,568:INFO: Joining post data for meta...
2013-02-01 13:56:12,657:WARNING: Skipped 19 posts due to UTF8 errors.
2013-02-01 13:56:13,554:INFO: Joining post data for music...
2013-02-01 13:56:13,583:WARNING: Skipped 30 posts due to UTF8 errors.


postdata_mefi.txt has a bad record for post 113202 due to an embedded newline:

113202 129814 2012-02-24 20:48:48.907 0 3 0 1 This is maybe too random
to achieve traction. -- <a href="http://www.metafilter.com/user/292" id="sig">jessamyn</a>

posted by jjwiseman at 4:25 PM on February 1, 2013


seems punctuation throws it off.

Which explains why restless_nomad is traveling so under the radar?
posted by jessamyn (staff) at 5:14 PM on February 1, 2013


Maybe we hated Bush more than we love Obama?
posted by double block and bleed at 6:33 PM on February 1, 2013


Depends on the bush.
posted by cjorgensen at 7:08 PM on February 1, 2013


taters, tater, fedoras, fedora.

Man, I can't wait until we can n-gram the whole text corpus and not just the titles!
posted by barnacles at 7:22 PM on February 1, 2013


Looks like we hit peak cortex awhile ago.

And now we are down to seeds and brain stems.
posted by y2karl at 7:51 PM on February 1, 2013 [1 favorite]


Rad!
posted by rtha at 7:58 PM on February 1, 2013


Could someone please release an engram viewer next? I've got a bunch of thetans to audit and my e-meter's busted so an app or something would just be great.
posted by FAMOUS MONSTER at 8:40 PM on February 1, 2013 [1 favorite]


Thank Xenu, I'm not the only one. That is exactly where my brain goes when I hear about N-grams as well.
posted by maryr at 9:18 PM on February 1, 2013


This is fun.

Mefi:

What's your pleasure? (beer, apparently)

Dogs and cats about equally popular

Crisis points

Ask:

The family members who can't talk are the most puzzling

Questions for every occasion (but mainly for weddings and parties)
posted by Orinda at 11:42 PM on February 1, 2013


This is great!
posted by iamkimiam at 12:22 AM on February 2, 2013


I assume everything drops off sharply at the end because we have only had one month in 2013 in which to talk about stuff. Would it make sense to have an option to smooth the data? You could, for example, multiply the results for each year by 12/n where n is the number of months of that year which have elapsed. This would allow any emerging trends to become apparent, especially when the current month falls toward the beginning of a year.
posted by tractorfeed at 2:30 AM on February 2, 2013 [1 favorite]




It gets better!
posted by Potomac Avenue at 5:33 AM on February 2, 2013


jessaymn, it looks like restless_nomad just hasn't been mentioned in any post titles yet.

tractorfeed, that's right. I have a bug open for that: https://github.com/wiseman/mefingram/issues/2
posted by jjwiseman at 10:52 AM on February 2, 2013


Awesomsauce! Also the most use of the word 'pony' so far happened in 2011. Whodathunkit?
posted by Faintdreams at 11:36 AM on February 2, 2013


Would it be trivial or possible to be able to click through to see the posts being referenced? Like, what was going on with BP in 2002?

Also, Scottish pedants might not care for this.
posted by cmoj at 12:42 PM on February 2, 2013


cmoj, yes, that is planned. Maybe I'll have a chance to do that this weekend, even: https://github.com/wiseman/mefingram/issues/4.

(Also I noticed that hate is stagnant, love is growing).
posted by jjwiseman at 1:53 PM on February 2, 2013


chicken chicken chicken chicken chicken chicken
posted by mendel at 5:17 PM on February 2, 2013


cmoj, You can now click on the data points for a year and see the first 30 posts that match your query in that year, e.g. http://mefingram.appspot.com/?content=bp&corpus=mefi#2002.
posted by jjwiseman at 6:21 PM on February 2, 2013


Sweet!
posted by cmoj at 9:58 PM on February 2, 2013


> Would it make sense to have an option to smooth the data? You could, for example, multiply the results for each year by 12/n where n is the number of months of that year which have elapsed.

What would be more useful is to correlate results with post volume, rather than time.

Otherwise, almost every phrase will trend upwards since Mefi has been growing continuously, and you can't actually gauge what terms are actually tailing off in general usage (0.01% of all phrases in 2003 is probably significantly less than 0.001% of all phrases in 2013).

As an added benefit, there will be less tendency for results to drop to 0 at the end of the graph regardless of when the graph is generated.
posted by ardgedee at 6:35 AM on February 3, 2013




Display results in terms of parts per million rather than as raw count is the default approach to this sort of thing, and is actually what I was assuming this was doing though I never did actually sanity check that.
posted by cortex (staff) at 7:48 AM on February 3, 2013


I should have been more explicit. The way I was planning on handling relative frequency is exactly how Google handles it. For example, if you search for "pony" in meta, what the chart will show for each year is what percentage of all unigrams in meta are "pony" for that year.

By the way, being able to click through to posts has gotten me digging around in the early days of mefi, and it is interesting to see all the ways it's different. A few examples: Before permalinks became a big deal, before formalizing the rules about double posts, allowing links in titles. Seeing all the broken links also makes me sad about how much has been lost from the earlier days of the web.
posted by jjwiseman at 12:03 PM on February 3, 2013


The viewer now shows relative frequencies.
posted by jjwiseman at 6:07 PM on February 3, 2013


« Older What constitutes a thin post?   |   Irresponsible Medical Advice And What To Do With... Newer »

You are not logged in, either login or create an account to post comments