Now let's not everybody talk at once... December 10, 2010 4:07 PM Subscribe

Infodump request: what is the largest number of (non-deleted) posts over the shortest period of time at MeFi (inspired by the recent 6 posts in 16 minutes, of which mine was one)? Were there any memorable incidents, accompanied by Mods going OMGWTFBBQ? And have we ever broken out what hours are generally the busiest for Mefi posts (and if so, why do I not remember)? How about similar data for Ask, MeTa, etc. if it's not a bad thing to be doing on a Friday afternoon (it probably is, forget I asked).
posted by oneswellfoop to MetaFilter-Related at 4:07 PM (26 comments total) 3 users marked this as a favorite

I don't have time to dig into any of those questions right this moment, but since we're talking about infodumpery and I've been messing around with this stuff today, here is a 5 megabyte zip file of word-frequency information for about 96 million words from the metafilter comment tables.

Rough stuff, needs more work on a couple different fronts and I'll have something more comprehensive and some documentation for it in the near future, but for the seven people who are really into this sort of thing, there you go. There's about 976K distinct words in that thing, in descending order of frequency of appearance.
posted by cortex (staff) at 4:16 PM on December 10, 2010 [2 favorites]

Waxy used to keep track of this sort of thing. I miss those charts.
posted by Gator at 4:17 PM on December 10, 2010

No prob, cortex, I realized it was Friday halfway through writing it, but, hey, if you have some time next week... (I was also one of three posts in two minutes... it felt a little like the Cyriak video posted earlier today)
posted by oneswellfoop at 4:31 PM on December 10, 2010

The stats page in my Google Reader account says Thursday is the busiest day for the blue and 3-4:00 p.m. EST the busiest time, at least over the last month.

Here are some similar charts for AskMe from earlier this year the last time this sort of question came up (all times Central):

Average number of AskMe questions per hour

Average number of AskMe questions per day of the week

Daily AskMe activity over the last month
posted by Rhaomi at 4:43 PM on December 10, 2010

I did some graphs of site activity by hour, day of the week, and day of the month a while back, so that might answer part of your question.

This part of the question:

what is the largest number of (non-deleted) posts over the shortest period of time at MeFi

... is, I think, a little tricky. I think it's necessary to pick either a time interval, or a number of posts, to find the maximum of the other. If you just try to do something like maximum posts/hour, you find it's infinite (1 post in 0.0 hours). But something like fastest 10 posts, or most posts in 1 hour, could be done pretty easily.
posted by FishBike at 6:07 PM on December 10, 2010 [3 favorites]

here is a 5 megabyte zip file of word-frequency information for about 96 million words from the metafilter comment tables.

There are some fantastic neologisms with a frequency of 1 at the end of that table...

manimalbut
manifistations
maleanswersyndrome
euro-mullet

etc
posted by unSane at 6:55 PM on December 10, 2010

FishBike: "I did some graphs of site activity by hour, day of the week, and day of the month a while back, so that might answer part of your question."

What's with the spike of users "leaving" in this graph? Did a lot of members suddenly start jumping ship in 2009?
posted by Rhaomi at 7:10 PM on December 10, 2010

What's with the spike of users "leaving" in this graph? Did a lot of members suddenly start jumping ship in 2009?

Nah, that's just kind of an artifact of the measurement process. There's no direct way of telling that a user has stopped being active on the site permanently. So I considered that a user has "left" the site in whichever month they were most recently active. So the right-hand side of that graph is really driven more by how often most users are active than by how many have actually left the site.

In fact, if you look at the very last month, it clearly shows that MetaFilter doesn't have any active users any more--all the remaining ones left in the last month! Which is true, given that at the time I made that graph, no one had been active in the month following that. But only because it hadn't happened yet.

I should probably try to link back those graphs to the original MetaTalk threads that prompted them, where a lot of these caveats are discussed.
posted by FishBike at 7:19 PM on December 10, 2010

So I worked up something that analyses bursts as a function of run length. Here's the result:

Fastest run of 2 posts was 1 second on 2000-03-30 [0.5 seconds per post]
    1158, 1159

Fastest run of 3 posts was 2 seconds on 2006-04-20 [0.7 seconds per post]
    51048, 51049, 51050

Fastest run of 4 posts was 1 minute and 21 seconds on 2006-10-01 [20.2 seconds per post]
    55200, 55201, 55202, 55203

Fastest run of 5 posts was 2 minutes and 45 seconds on 2006-04-20 [33.0 seconds per post]
    51046, 51047, 51048, 51049, 51050

Fastest run of 6 posts was 9 minutes and 5 seconds on 2004-12-16 [90.8 seconds per post]
    37880, 37881, 37882, 37883, 37884, 37885

Fastest run of 7 posts was 13 minutes and 1 second on 2006-04-20 [111.6 seconds per post]
    51046, 51047, 51048, 51049, 51050, 51051, 51052

Fastest run of 8 posts was 18 minutes and 53 seconds on 2006-04-20 [141.6 seconds per post]
    51046, 51047, 51048, 51049, 51050, 51051, 51052, 51053

Fastest run of 9 posts was 38 minutes and 55 seconds on 2001-09-12 [259.4 seconds per post]
    10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122

Fastest run of 10 posts was 43 minutes and 58 seconds on 2001-09-12 [263.8 seconds per post]
    10113, 10114, 10115, 10116, 10117, 10118, 10119, 10120, 10121, 10122

Fastest run of 11 posts was 55 minutes and 22 seconds on 2001-09-13 [302.0 seconds per post]
    10198, 10199, 10200, 10201, 10202, 10203, 10204, 10205, 10206, 10207, 10208

Fastest run of 12 posts was 1 hour and 18 seconds on 2001-09-13 [301.5 seconds per post]
    10198, 10199, 10200, 10201, 10202, 10203, 10204, 10205, 10206, 10207, 10208, 10209

Essentially all runs 9 and longer are dominated by the days after 9/11. I tried running it out to 20 but they were all the same bunch so I stopped the report at 12.
posted by Rhomboid at 1:29 AM on December 11, 2010 [3 favorites]

...which only confirms that mathowie was behind 9/11.
posted by gman at 5:31 AM on December 11, 2010

Cortex! How random, fortuitous and perfect! This is probably the most awesome unintentional holiday gift that I will ever receive (or download?)! I'm currently (like, this weekend) typing up a methodology for internet corpus research I'm (re)doing. I've been digging around in Westbury labs' word frequency lists and was hoping to find a way to emulate the same thing somehow with MetaFilter. Their usenet list looks very much like what you've created and that is so amazing, I can hardly stand it (because now I can compare their usenet list with the MeFi list with their Wikipedia list with a COCA speech data list with the RegEx/Amer. Heritage Dictionary list and OMG my head asplodes)!

Would you be able to elaborate a little bit more on how you went about creating the list, and maybe some meta data about it? (The info provided on the Westbury labs page is a good model of the background information I'm wondering about.) This will allow the 7 (and counting I hope!) of us to describe, use, and cite the list in our research. Also, I'm just curious. Thank you!!!

(I now reread your comment and see that you said you're going to be working on the documentation in the near future. So totally awesome...I'm so giddy about data! Thanks again!)
posted by iamkimiam at 7:19 AM on December 11, 2010

One more thing...would it be impossible to add a freq. per million column for each word in the list? Pretty please (if it's not a PITA)?
posted by iamkimiam at 7:21 AM on December 11, 2010

We could easily break all those records. Everybody post something [good] at noon today.
posted by beagle at 7:50 AM on December 11, 2010

Neat, Rhomboid. Interesting but I guess not really surprising that so many of those bests cluster to a couple distinct runs; aside from the 9/12 and 9/13 blasts, the 4/20/06 thing grabs four of those top twelve slots as well.

Which, well, hijinks: that was also the day of fustian, so at least some of the grouping was artificial. I remember that whole thing vaguely but fondly. If you check out the metatalk archive for April 2006, you can see that a metatalk about that was in fact only the first on a busy and weird day.

I've been digging around in Westbury labs' word frequency lists and was hoping to find a way to emulate the same thing somehow with MetaFilter.

Oh neat! For my part, having gotten some basic freq chart stuff proofed out now, I've been cramming for practical resources on actually approaching the project with some more care and finesse. If there's such a thing as Assembling A Corpus Out Of Electronic Conversational Text From Scratch For Dummies, I need to read it. I did trip across Developing Linguistic Corpora last night which looks like it might be helpful, but I've got a lot of details to consider still in how best to filter the input and present the output.

would it be impossible to add a freq. per million column for each word in the list? Pretty please (if it's not a PITA)?

It's totally possible. It will balloon the filesize somewhat, which isn't a huge deal but was the main reason I hadn't bothered yet. But since I've never done any real work with existing corpora, I don't know what's practically useful vs. not in presentation.
posted by cortex (staff) at 8:04 AM on December 11, 2010 [1 favorite]

This is weird. According to cortex's list, the word "matt" shows up just once... that can't be right, eh?

Also, as per unsane's list:

ultrahypermegawidget
pachydermititis
microphone-in-the-pants
contentiousnesses

Also also, I find it very interesting that the word "good" is used 197281 times, while the word "bad" is used only 59975 times. I wonder if that's because we are lazy when talking about things we like or because there are more adequate alternatives to "bad" than there are to "good".

(I really don't understand why the period should have gone inside the quotation mark there.)
posted by Night_owl at 10:20 AM on December 11, 2010

the word "matt" shows up just once

More likely, the word "matt" with some unprintable unicode character attached to it shows up only once. I don't have the file handy at the moment but I'd assume "matt" actually shows up earlier in the list as well on the order of hundreds or thousands of times.

I wonder if that's because we are lazy when talking about things we like or because there are more adequate alternatives to "bad" than there are to "good".

It might also be an expression of "good" being more overloaded than "bad" in terms of kinds of use. We say "a good deal of discussion" but we never say "a bad deal of discussion", for example.
posted by cortex (staff) at 10:31 AM on December 11, 2010

I've always been fond of October 8th, 2006, or as I like to call it, Day of the Elephants.
posted by gwint at 11:13 AM on December 11, 2010 [2 favorites]

Out of curiosity I reran the analysis with 2006-04-20 and the days following 9/11 excluded to see if things changed much. I also let it run out to length 50, since that is the number of posts shown on the front page -- I wanted to see what the shortest time it took the entire front page to scroll off. Here are the results.
posted by Rhomboid at 3:29 PM on December 11, 2010

I remember the time everyone wanted to make a post about some new product that vibrated -- Was it a broomstick? -- and one by the luckless members would go off and post their title of "This broomstick . . . it vibrates?" and one by one the posts woud be vanquished and deleted by the mods. Then the next poster would search for a post about vibrating broomsticks and not find one, and make their own special snowflake post about "This broomstick . . . it vibrates?" and the cycle began again. The whole vibrating broomstick fiasco was eventually sidebarred. Good times!
posted by onlyconnect at 5:42 PM on December 11, 2010

We could easily break all those records. Everybody post something [good] at noon today.

Please do not break the mods.
posted by ninazer0 at 8:39 PM on December 11, 2010

This is neat stuff, many thanks cortex and rhomboid! A collocation analysis of the infodump would be really interesting I think. I'm actually trying to do this with much smaller documents (articles, etc.) on a Mac. I've been trying out Nisus as it has a macro for frequency counts you can use in the word processor, but I'd really like to branch out into collocation and also parts-of-speech (e.g. sorting for verbs, nouns, etc.). So if anyone has any ideas for Mac tools ... (I was going to post an AskMe question, but this thread seems a good place to do this ... apologies if not)
posted by carter at 4:24 AM on December 12, 2010

Heh. I'm sort of scrambling to get familiar with tools and techniques too, carter, and I'm considering an askme as well. A lot of eyes over there that won't necessarily see this one metatalk thread, is my reasoning. So, I think you're good on either front.

Collocation stuff is on my list for this mefi stuff as well; I've actually done that on a small scale indirectly a couple times, since that's more or less the brains of a markov chain analysis, but haven't tried it at the scale of the mefi db comment tables (which is, if you add it all up, getting close to one billion words).

Parts-of-speech tagging I've never tried.
posted by cortex (staff) at 6:12 AM on December 12, 2010

I'm super busy atm, so I can't look into this just yet, but there are several great online tools for automatically tagging and parsing part-of-speech (POS) data in a corpus. I'm new to this as well, but I have run across these things before and I have a couple books on corpus linguistics that even step-by-step you thru it. When I have some more time I'll check it all out (I need to do that anyway, and this gives me super good motivation).

But an AskMe would be just swell, too, as that would probably return answers from people who have actually *done* it already and know the pitfalls.

(Come to think of it, I definitely have articles on this...the guy who worked on the Penn Treebank, Keith Johnson, is an amazing font of knowledge here. Also, Biber, Conrad. I think the way that that parser works is by dictionary comparisons and statistical analysis of the collocates...but there's still a shitton of cleanup...but what's very nice is that internet data doesn't have the disfluencies, interruptions, etc. But spelling errors, polysemy, acronyms, and symbols are a whole 'nother ball of wax tho.)
posted by iamkimiam at 6:46 AM on December 12, 2010

I have direct personal experience with Weka, both as a standalone executable and as an API I've integrated into other Java code. It's pretty spiffy once you figure out the somewhat-quirky interface. Note that it's GPL, so anything you create that links to it is covered by the GPL. As far as I know, you can safely use the output of the standalone for other purposes, though (IANAL).
posted by Alterscape at 8:00 AM on December 12, 2010

Sounds good! I'll put together some ideas for an AskMe, if that's okay with everyone. One thing I'll emphasize is that we are working on very different scales; cortex mentions 10^9 words drawn from the Metafilter logs/database), whereas I would consider 10^5 words (drawn from naturally occurring examples of real-time human interaction) as a large sample. Also we may be looking for different things in the data. Anyway I will think of something to post, post it, and then other folks can chime in with further requirements.

FWIW I was actually using this tool for analysis, it was useful but licensing issues were kind of a pain for various reasons, so a freeware version would be great.
posted by carter at 9:08 AM on December 12, 2010 [1 favorite]

More mefiwhacks:

zebraphiliacs
zestiness ("... full, glorious Unicode 3.0, now with the untamed beauty and Hellenic zestiness of polytonic!")
zhunk-zhu-zhunga-zhu-zhunk (Gotta love onomatopoeia.)
zillionfold
zionist-blindered
zipstrapping
zlotniks (Not just any zlotniks, but Swahili Zlotniks!)
zoidbergian
zombie-shuffling-through-molasses
zombiologists
zompocalyptic (Not just zompocalyptic, but post zompocalyptic.)
zomgfilter
zooish
zonkerspeak
zorofustianism (Bet you can't guess what day this one popped up on!)
zorped
zugula
posted by ErWenn at 8:47 PM on December 12, 2010

« Older Who the hell are you people? | Help with Mental Health Newer »

You are not logged in, either login or create an account to post comments

MetaTalk

Now let's not everybody talk at once... December 10, 2010 4:07 PM Subscribe

Tags

Share