Metafilter Infodump: more stats than you can shake a stick at. January 22, 2008 7:58 AM Subscribe
Nerds, start your engines: it's the new Metafilter Infodump.
There's been a lot of requests over the years for crunchy metafilter data (and another spike of requests lately). This is an attempt to make some of that available in a standard form in a reliable, officially sanctioned location. This is a compromise between the fantasy (a full db dump, or a live API for db requests) and the previous reality (nothing!); it has the advantage of actually existing, which I think is a big step up.
Hopefully, this will save enterprising number crunchers some scraping, and lower the bar for entry some for those folks who would like to play with data but don't want to scrape.
All of the files are plain text. Some are quite large. I've listed approximate file sizes for each file. They're not updated automatically at this point, but I can regenerate them as needed, and will if nothing else run the scripts once a month or so.
If there's something specific you'd like to see added to or tweaked on the Infodump, let me know, here or via mail. If you do something cool with the data (or have done cool things with mefi data in the past), also let me know; I'd love to put together a permanent collection of mefi analyses.
Have fun!
There's been a lot of requests over the years for crunchy metafilter data (and another spike of requests lately). This is an attempt to make some of that available in a standard form in a reliable, officially sanctioned location. This is a compromise between the fantasy (a full db dump, or a live API for db requests) and the previous reality (nothing!); it has the advantage of actually existing, which I think is a big step up.
Hopefully, this will save enterprising number crunchers some scraping, and lower the bar for entry some for those folks who would like to play with data but don't want to scrape.
All of the files are plain text. Some are quite large. I've listed approximate file sizes for each file. They're not updated automatically at this point, but I can regenerate them as needed, and will if nothing else run the scripts once a month or so.
If there's something specific you'd like to see added to or tweaked on the Infodump, let me know, here or via mail. If you do something cool with the data (or have done cool things with mefi data in the past), also let me know; I'd love to put together a permanent collection of mefi analyses.
Have fun!
standard disclaimer: screw around with this for nefarious purposes and it will go away and never come back.
posted by jessamyn (staff) at 8:02 AM on January 22, 2008 [5 favorites]
posted by jessamyn (staff) at 8:02 AM on January 22, 2008 [5 favorites]
She joined way after me!
posted by parmanparman at 8:05 AM on January 22, 2008
posted by parmanparman at 8:05 AM on January 22, 2008
That usernames/numbers file reminds me unnervingly of the Vietnam Memorial.
posted by hermitosis at 8:06 AM on January 22, 2008
posted by hermitosis at 8:06 AM on January 22, 2008
Slashdot has a tag for this: http://slashdot.org/tags/whatcouldpossiblygowrong
posted by NortonDC at 8:07 AM on January 22, 2008 [1 favorite]
posted by NortonDC at 8:07 AM on January 22, 2008 [1 favorite]
Totally fun. So how do users get into the database? By paying the 5 bucks and becoming active?
posted by shothotbot at 8:07 AM on January 22, 2008
posted by shothotbot at 8:07 AM on January 22, 2008
There are only 35000 usernames but the front page claims 65003 users...
Also: Cortex gets an extra hug.
posted by shothotbot at 8:10 AM on January 22, 2008
Also: Cortex gets an extra hug.
posted by shothotbot at 8:10 AM on January 22, 2008
Myself, well, I could make a hat, or a broach, or a pterodactyl ...
Astrozombie, between Office Space and Airplane, you're on a roll. Welcome to my favorites. Have a seat, hang out.
posted by sneakin at 8:17 AM on January 22, 2008
Astrozombie, between Office Space and Airplane, you're on a roll. Welcome to my favorites. Have a seat, hang out.
posted by sneakin at 8:17 AM on January 22, 2008
There are only 35000 usernames but the front page claims 65003 users...
It's people who completed the sign-up process vs people who just started it.
posted by jessamyn (staff) at 8:24 AM on January 22, 2008
It's people who completed the sign-up process vs people who just started it.
posted by jessamyn (staff) at 8:24 AM on January 22, 2008
It's people who completed the sign-up process vs people who just started it.
Yeah, my interview was pretty rough, and the entrance exam was a killer. Glad I made it in, though.
posted by prophetsearcher at 8:27 AM on January 22, 2008 [2 favorites]
Yeah, my interview was pretty rough, and the entrance exam was a killer. Glad I made it in, though.
posted by prophetsearcher at 8:27 AM on January 22, 2008 [2 favorites]
I was just glad I took that prep class. The instructor only spent three days discussing the plate of beans, though. I felt that could have been studied more in depth.
posted by winna at 8:30 AM on January 22, 2008 [8 favorites]
posted by winna at 8:30 AM on January 22, 2008 [8 favorites]
mmmm, 85 MB txt files... nothin' says "ease of use" like an 85MB txt file.
posted by shmegegge at 8:34 AM on January 22, 2008 [2 favorites]
posted by shmegegge at 8:34 AM on January 22, 2008 [2 favorites]
screw around with this for nefarious purposes
::dials nefarious plan down to 'merely cruel' ::
posted by Brandon Blatcher at 8:35 AM on January 22, 2008
::dials nefarious plan down to 'merely cruel' ::
posted by Brandon Blatcher at 8:35 AM on January 22, 2008
(grins, drools happily)
Minor, teeny-tiny pony request -- can we get a "last updated" date/time stamp next to each file on the web page?
posted by Doofus Magoo at 8:36 AM on January 22, 2008
Minor, teeny-tiny pony request -- can we get a "last updated" date/time stamp next to each file on the web page?
posted by Doofus Magoo at 8:36 AM on January 22, 2008
How about some compression on these files? For me the 21MB plain text comment data file compresses to less than 5MB. And wouldn't comma-separated-value format (.csv) be better than just a text dump?
Also: pony.
posted by burnmp3s at 8:36 AM on January 22, 2008
Also: pony.
posted by burnmp3s at 8:36 AM on January 22, 2008
65003 is false advertising! I call shenanigans!
posted by Dave Faris at 8:38 AM on January 22, 2008 [1 favorite]
posted by Dave Faris at 8:38 AM on January 22, 2008 [1 favorite]
nothin' says "ease of use" like an 85MB txt file.
You gotta want it, shmegegge.
(Realistically: chunked versions of some of the very large ones could be doable in the long run.)
How about some compression on these files?
That's not a bad idea.
And wouldn't comma-separated-value format (.csv) be better than just a text dump?
Marginally? I'm not sure if there's some secret sauce to the CSV format that would make it signficantly more useful/slicker than the current setup, but if folks think it'd be worth the effort I can definitely look at that.
posted by cortex (staff) at 8:41 AM on January 22, 2008
You gotta want it, shmegegge.
(Realistically: chunked versions of some of the very large ones could be doable in the long run.)
How about some compression on these files?
That's not a bad idea.
And wouldn't comma-separated-value format (.csv) be better than just a text dump?
Marginally? I'm not sure if there's some secret sauce to the CSV format that would make it signficantly more useful/slicker than the current setup, but if folks think it'd be worth the effort I can definitely look at that.
posted by cortex (staff) at 8:41 AM on January 22, 2008
I'm not sure what the hell is going on there, item, but user 23 links you too. Even though they don't exist. Bizarre.
posted by cortex (staff) at 8:45 AM on January 22, 2008
posted by cortex (staff) at 8:45 AM on January 22, 2008
I'm not sure what the hell is going on there, item, but user 23 links you too. Even though they don't exist. Bizarre.
Probably from when I debuted the feature and people were fucking around with it before it got locked down.
posted by mathowie (staff) at 8:47 AM on January 22, 2008
Probably from when I debuted the feature and people were fucking around with it before it got locked down.
posted by mathowie (staff) at 8:47 AM on January 22, 2008
CSV is nice if you do all of your analysis in Excel. But real men use awk | sort | uniq -c | sort -rn.
posted by Plutor at 8:48 AM on January 22, 2008 [3 favorites]
posted by Plutor at 8:48 AM on January 22, 2008 [3 favorites]
Will someone please crunch the numbers and tell us who is winning MetaFilter?
posted by brain_drain at 8:51 AM on January 22, 2008 [7 favorites]
posted by brain_drain at 8:51 AM on January 22, 2008 [7 favorites]
Also: Cortex gets an extra hug.
Or maybe an extra poke.
The preceding comment inspired by text item from linked site: ("...poke cortex and he'll regenerate it for you") and does not necessarily represent the opinions of Flapjax Industries.
posted by flapjax at midnite at 9:01 AM on January 22, 2008
Or maybe an extra poke.
The preceding comment inspired by text item from linked site: ("...poke cortex and he'll regenerate it for you") and does not necessarily represent the opinions of Flapjax Industries.
posted by flapjax at midnite at 9:01 AM on January 22, 2008
5435 of davey_darling's 6028 favorites were written by ThePinkSuperhero.
posted by Plutor at 9:07 AM on January 22, 2008
posted by Plutor at 9:07 AM on January 22, 2008
Will someone please crunch the numbers and tell us who is winning MetaFilter?
Matthew Haughey. You haven't figured out how the internet works yet?
posted by TheOnlyCoolTim at 9:11 AM on January 22, 2008
Matthew Haughey. You haven't figured out how the internet works yet?
posted by TheOnlyCoolTim at 9:11 AM on January 22, 2008
I like my numbers to sit in the milk for a bit so they're not so crunchy.
posted by not_on_display at 9:16 AM on January 22, 2008
posted by not_on_display at 9:16 AM on January 22, 2008
This is cooler than rocket fuel. Can we see a postid + tags dump? eg, "1: foo, bar, baz; 2: fiz, foo, bap,..."
posted by ardgedee at 9:19 AM on January 22, 2008 [1 favorite]
posted by ardgedee at 9:19 AM on January 22, 2008 [1 favorite]
I want to understand! What kind of nerd stuff can I look at if I decide to download this 85MB text file?
posted by iamkimiam at 9:28 AM on January 22, 2008
posted by iamkimiam at 9:28 AM on January 22, 2008
Mmm, tags. Yeah, that's a good idea; I'd love to see some tag mapping action.
So, a running TODO list from above:
- zip (and tar/gzip) compression for these mothers;
- last-updated timestamp on the index page
- tags stats
A few other things I've been thinking about, top-of-mind:
- expanded askme posts stats to include # best answer, category id
- category id for metatalk stats
- post-deletion index (with reasons where available) for various subsites
- flagging stats (listing what was flagged but not by whom; I think details about who is doing any given bit of flagging is a good example of the sort of thing that shouldn't go into the infodump).
posted by cortex (staff) at 9:29 AM on January 22, 2008
So, a running TODO list from above:
- zip (and tar/gzip) compression for these mothers;
- last-updated timestamp on the index page
- tags stats
A few other things I've been thinking about, top-of-mind:
- expanded askme posts stats to include # best answer, category id
- category id for metatalk stats
- post-deletion index (with reasons where available) for various subsites
- flagging stats (listing what was flagged but not by whom; I think details about who is doing any given bit of flagging is a good example of the sort of thing that shouldn't go into the infodump).
posted by cortex (staff) at 9:29 AM on January 22, 2008
To format a file containing a text dump of comments which contain numerous commas in a comma-separated format is to stick your dick into a sausage grinder. Do not make such a n00b-ass mistake. Tab-delimited or pipe-delimited would be marginally better, null-delimited might be best.
posted by Horken Bazooka at 9:29 AM on January 22, 2008 [1 favorite]
posted by Horken Bazooka at 9:29 AM on January 22, 2008 [1 favorite]
- expanded askme posts stats to include # best answer, category id
- category id for metatalk stats
- post-deletion index (with reasons where available) for various subsites
- flagging stats (listing what was flagged but not by whom; I think details about who is doing any given bit of flagging is a good example of the sort of thing that shouldn't go into the infodump).
I like all of the above.
posted by iconomy at 9:32 AM on January 22, 2008
- category id for metatalk stats
- post-deletion index (with reasons where available) for various subsites
- flagging stats (listing what was flagged but not by whom; I think details about who is doing any given bit of flagging is a good example of the sort of thing that shouldn't go into the infodump).
I like all of the above.
posted by iconomy at 9:32 AM on January 22, 2008
I want to understand! What kind of nerd stuff can I look at if I decide to download this 85MB text file?
Imagine a question about when or how often (or rarely) or how much or by whom something on metafilter is done.
Try to frame that question in terms of the specific stats that are available—userid, datestamp, number of favorites, number of comments, etc.
Did you succeed? Okay! Download the file(s) that contain those stats. Throw them into Excel or write a little perl or python or chuck it into a graph program, and make it happen!
That's pretty much the deal.
posted by cortex (staff) at 9:33 AM on January 22, 2008 [1 favorite]
Imagine a question about when or how often (or rarely) or how much or by whom something on metafilter is done.
Try to frame that question in terms of the specific stats that are available—userid, datestamp, number of favorites, number of comments, etc.
Did you succeed? Okay! Download the file(s) that contain those stats. Throw them into Excel or write a little perl or python or chuck it into a graph program, and make it happen!
That's pretty much the deal.
posted by cortex (staff) at 9:33 AM on January 22, 2008 [1 favorite]
To format a file containing a text dump of comments which contain numerous commas in a comma-separated format is to stick your dick into a sausage grinder.
Agreed, but moot. Not to break any hearts, but I'm not too hot on the idea of just dumping the actual text of comments up on Infodump; it'd be enormous, and it'd be kind of weird, and I know Matt doesn't like the idea much at all. This is more of guts/metrics project than a full text mirror.
If you have a specific text-oriented project in mind, that's probably the sort of thing to talk to me about via mail and we can see if it's reasonable/doable.
posted by cortex (staff) at 9:35 AM on January 22, 2008
Agreed, but moot. Not to break any hearts, but I'm not too hot on the idea of just dumping the actual text of comments up on Infodump; it'd be enormous, and it'd be kind of weird, and I know Matt doesn't like the idea much at all. This is more of guts/metrics project than a full text mirror.
If you have a specific text-oriented project in mind, that's probably the sort of thing to talk to me about via mail and we can see if it's reasonable/doable.
posted by cortex (staff) at 9:35 AM on January 22, 2008
cortex, thanks for this. I can't imagine why I want this or what will happen as a result - but more datasets = teh good.
*wanders off to download a few of the files*
posted by geminus at 9:37 AM on January 22, 2008
*wanders off to download a few of the files*
posted by geminus at 9:37 AM on January 22, 2008
CSV would be very useful (mysql allows it to be automatically imported).
posted by null terminated at 9:38 AM on January 22, 2008
posted by null terminated at 9:38 AM on January 22, 2008
That usernames/numbers file reminds me unnervingly of the Vietnam Memorial.
Strange. I don't see any such file?
posted by Horken Bazooka at 9:41 AM on January 22, 2008
Strange. I don't see any such file?
posted by Horken Bazooka at 9:41 AM on January 22, 2008
- flagging stats (listing what was flagged but not by whom...
Think carefully about this one. It would open you up to all manner of annoyance: "hey why was my psot deleted it only got 2 flags and this othr post got 3 and is still up".
If you do decide that's livable, don't forget to break out the flag-count by reason.
Also, if you're more the bird's-eye-view sort of person rather than the trenchfoot-and-muddy-fingernails sort of person, don't forget Waxy's MeFi Stat Page
posted by gleuschk at 9:43 AM on January 22, 2008
Think carefully about this one. It would open you up to all manner of annoyance: "hey why was my psot deleted it only got 2 flags and this othr post got 3 and is still up".
If you do decide that's livable, don't forget to break out the flag-count by reason.
Also, if you're more the bird's-eye-view sort of person rather than the trenchfoot-and-muddy-fingernails sort of person, don't forget Waxy's MeFi Stat Page
posted by gleuschk at 9:43 AM on January 22, 2008
Oh, wait. Er. I just got to the part where you wrote "Throw them into Excel or write a little perl or python or chuck it into a graph program, and make it happen!" Haha. I'll let somebody else who knows how to "make it happen" better/quicker than I, and read their results. :)
posted by iamkimiam at 9:46 AM on January 22, 2008
posted by iamkimiam at 9:46 AM on January 22, 2008
Aye, good points on the flagging stuff. Nixed!
posted by cortex (staff) at 9:51 AM on January 22, 2008
posted by cortex (staff) at 9:51 AM on January 22, 2008
Are waxy's stats broken, or am I reading that completely wrong? If that's supposed to be total post and comment traffic on the blue, it's waaaaay underreporting. Like, by an order of magnitude.
posted by cortex (staff) at 9:58 AM on January 22, 2008
posted by cortex (staff) at 9:58 AM on January 22, 2008
null terminated: "CSV would be very useful (mysql allows it to be automatically imported)."
You can also import tab- or space-delimited files. LOAD DATA INFILE actually defaults to tab-delimited, and you can easily change it to spaces:
LOAD DATA INFILE 'some_file.txt'
INTO TABLE some_table
FIELDS TERMINATED BY ' ' ENCLOSED BY '' ESCAPED BY '\\';
posted by Plutor at 9:58 AM on January 22, 2008
You can also import tab- or space-delimited files. LOAD DATA INFILE actually defaults to tab-delimited, and you can easily change it to spaces:
LOAD DATA INFILE 'some_file.txt'
INTO TABLE some_table
FIELDS TERMINATED BY ' ' ENCLOSED BY '' ESCAPED BY '\\';
posted by Plutor at 9:58 AM on January 22, 2008
Amazing! According to my calculations, user 36188 is the most awesomest.
posted by milarepa at 9:59 AM on January 22, 2008
posted by milarepa at 9:59 AM on January 22, 2008
Strange. I don't see any such file?
The page has been shifted about a bit in the past few minutes. It's under User data - the usernames.txt file.
posted by iconomy at 9:59 AM on January 22, 2008
The page has been shifted about a bit in the past few minutes. It's under User data - the usernames.txt file.
posted by iconomy at 9:59 AM on January 22, 2008
I'd support tab-delimited, but I would really really hate CSV. It makes it hard to get what you want with command-line text tools like awk and sed. Even with Perl it's less than ideal (perl -lane is your friend).
posted by Plutor at 10:00 AM on January 22, 2008
posted by Plutor at 10:00 AM on January 22, 2008
Tags would be super-helpful for meta-meta projects like eatMe or readMe.
posted by shothotbot at 10:01 AM on January 22, 2008
posted by shothotbot at 10:01 AM on January 22, 2008
Plutor: oh cool, thanks.
posted by null terminated at 10:03 AM on January 22, 2008
posted by null terminated at 10:03 AM on January 22, 2008
I compressed all the files and updated the infodump page. Everything is about 1/3 the size now.
posted by mathowie (staff) at 10:03 AM on January 22, 2008
posted by mathowie (staff) at 10:03 AM on January 22, 2008
Cortex: Looks like the last run was incomplete? I'm going to update my scripts to use these new dumps, and run it again.
One note: I'll need timestamps for the user creation date in the user file to do the stats for my new users per month. Also, I noticed your timestamps have microseconds... Was that deliberate?
posted by waxpancake at 10:19 AM on January 22, 2008
One note: I'll need timestamps for the user creation date in the user file to do the stats for my new users per month. Also, I noticed your timestamps have microseconds... Was that deliberate?
posted by waxpancake at 10:19 AM on January 22, 2008
Ah, join date is a good idea. I should add that to the usernames file, I suppose.
The datestamp down-to-the-ms is what the SQL query dished up by default; not deliberate so much as blinked at, shrugged at, and left alone. If it's actually a data-processing headache for a lot of folks, I can put some polish on the query to keep it to seconds, but I was figuring it wasn't likely to cause harm.
posted by cortex (staff) at 10:25 AM on January 22, 2008
The datestamp down-to-the-ms is what the SQL query dished up by default; not deliberate so much as blinked at, shrugged at, and left alone. If it's actually a data-processing headache for a lot of folks, I can put some polish on the query to keep it to seconds, but I was figuring it wasn't likely to cause harm.
posted by cortex (staff) at 10:25 AM on January 22, 2008
I am pleased with my newbieness. That is all.
posted by Nick Verstayne at 10:27 AM on January 22, 2008
posted by Nick Verstayne at 10:27 AM on January 22, 2008
Okay, I updated the post/comment stats up to the end of 2007. The microseconds were no problem at all, but lemme know when timestamp's in the username file and I'll update that too! THIS ROCKS.
posted by waxpancake at 10:37 AM on January 22, 2008
posted by waxpancake at 10:37 AM on January 22, 2008
Nice! Looks like a lot of us take December off.
posted by mathowie (staff) at 10:37 AM on January 22, 2008
posted by mathowie (staff) at 10:37 AM on January 22, 2008
So did we ever figure out what timezone the times are in?
posted by smackfu at 10:39 AM on January 22, 2008
posted by smackfu at 10:39 AM on January 22, 2008
Weird, if you apply the right filters, you get the lyrics to Bohemian Rhapsody. Backward.
I smell conspiracy. Or bacon. Hard to tell those apart sometimes...
posted by pupdog at 10:39 AM on January 22, 2008
I smell conspiracy. Or bacon. Hard to tell those apart sometimes...
posted by pupdog at 10:39 AM on January 22, 2008
In the future, would it be possible to include the deleted status of posts and comments? Some of the recent stats-related MeTa posts seemed to hinge on that.
posted by jedicus at 10:40 AM on January 22, 2008
posted by jedicus at 10:40 AM on January 22, 2008
Although including the deleted status might open things up to "nefarious purposes"...
posted by jedicus at 10:42 AM on January 22, 2008
posted by jedicus at 10:42 AM on January 22, 2008
Nice! Looks like a lot of us take December off.
Yeah. I don't know whether to say November was a small dip too (Thanksgiving, after all) or if October was just a weird spike.
Of course, months aren't all the same length. It'd be interesting to see a normalized posts/comments-per-unit-time graph.
So did we ever figure out what timezone the times are in?
I'm pretty sure it's server time, GMT -8. Datestamps at the top of the files are just a call to localtime(), as well.
posted by cortex (staff) at 10:43 AM on January 22, 2008
Yeah. I don't know whether to say November was a small dip too (Thanksgiving, after all) or if October was just a weird spike.
Of course, months aren't all the same length. It'd be interesting to see a normalized posts/comments-per-unit-time graph.
So did we ever figure out what timezone the times are in?
I'm pretty sure it's server time, GMT -8. Datestamps at the top of the files are just a call to localtime(), as well.
posted by cortex (staff) at 10:43 AM on January 22, 2008
Although including the deleted status might open things up to "nefarious purposes"...
I think including a 'deleted' flag on post data would be fine. I've been including those in the dumps; it's the sort of thing that would be trivial to scrape, anyway.
For that matter, I should incorporate "closed" status into the metatalk post data; I did a separate analysis of that a while back, anyway.
posted by cortex (staff) at 10:46 AM on January 22, 2008
I think including a 'deleted' flag on post data would be fine. I've been including those in the dumps; it's the sort of thing that would be trivial to scrape, anyway.
For that matter, I should incorporate "closed" status into the metatalk post data; I did a separate analysis of that a while back, anyway.
posted by cortex (staff) at 10:46 AM on January 22, 2008
If it's actually a data-processing headache for a lot of folks, I can put some polish on the query to keep it to seconds, but I was figuring it wasn't likely to cause harm.
It's not a problem at all, since it is always going to be the last four characters of that field. In case anyone is interested, here is some simple python code that I used to parse the 7 column format when I computed some stats in the other thread. (why, oh why, does <code> kill whitespace, and preview convert nbsps to actual spaces...)
import datetime, time, math, sys
class Post:
def __init__(self, pid, uid, ts, ccount, fcount):
self.postid = pid
self.uid = uid
self.timestamp = ts
self.comments = ccount
self.favorites = fcount
def __str__(self):
return "%s %s %s %s %s" % (self.postid, self.uid, self.timestamp, self.comments, self.favorites)
def parseline(line):
data = line.split()
postid = int(data[0])
userid = int(data[1])
rawdate = data[2] + " " + data[3][0:-4]
timestamp = datetime.datetime(*(time.strptime(rawdate, "%Y-%m-%d %H:%M:%S")[0:6]))
comments = int(data[4])
favorites = int(data[5])
return Post(postid, userid, timestamp, comments, favorites)
posted by advil at 10:46 AM on January 22, 2008 [2 favorites]
It's not a problem at all, since it is always going to be the last four characters of that field. In case anyone is interested, here is some simple python code that I used to parse the 7 column format when I computed some stats in the other thread. (why, oh why, does <code> kill whitespace, and preview convert nbsps to actual spaces...)
import datetime, time, math, sys
class Post:
def __init__(self, pid, uid, ts, ccount, fcount):
self.postid = pid
self.uid = uid
self.timestamp = ts
self.comments = ccount
self.favorites = fcount
def __str__(self):
return "%s %s %s %s %s" % (self.postid, self.uid, self.timestamp, self.comments, self.favorites)
def parseline(line):
data = line.split()
postid = int(data[0])
userid = int(data[1])
rawdate = data[2] + " " + data[3][0:-4]
timestamp = datetime.datetime(*(time.strptime(rawdate, "%Y-%m-%d %H:%M:%S")[0:6]))
comments = int(data[4])
favorites = int(data[5])
return Post(postid, userid, timestamp, comments, favorites)
posted by advil at 10:46 AM on January 22, 2008 [2 favorites]
Every time I see something MeFi-related that relies on user numbers it makes me feel a little weird that I have mine memorized.
posted by danb at 11:23 AM on January 22, 2008
posted by danb at 11:23 AM on January 22, 2008
Can we get a file with a list of the paypal accounts that people used to pay for their membership, as well as passwords and email addresses? Thanks!
On an unrelated note, does anyone know where I could sell some "random data" that I found on the internet containing email addresses, paypal accounts, and passwords? It's for a friend.
posted by blue_beetle at 11:29 AM on January 22, 2008
On an unrelated note, does anyone know where I could sell some "random data" that I found on the internet containing email addresses, paypal accounts, and passwords? It's for a friend.
posted by blue_beetle at 11:29 AM on January 22, 2008
Next priority: file detailing the status of blue_beetle's ban.
posted by cortex (staff) at 11:31 AM on January 22, 2008
posted by cortex (staff) at 11:31 AM on January 22, 2008
rbs, we've formed a committee to look into the possibility.
posted by pb (staff) at 11:50 AM on January 22, 2008 [10 favorites]
posted by pb (staff) at 11:50 AM on January 22, 2008 [10 favorites]
Matt has 68 deleted posts! Nyah!
posted by Burger-Eating Invasion Monkey at 11:56 AM on January 22, 2008
posted by Burger-Eating Invasion Monkey at 11:56 AM on January 22, 2008
Can I be on the XML DTD working committee? I'll bring watercress sandwiches.
posted by boo_radley at 11:57 AM on January 22, 2008
posted by boo_radley at 11:57 AM on January 22, 2008
5435 of davey_darling's 6028 favorites were written by ThePinkSuperhero.
He's a very loyal fan :-D
posted by ThePinkSuperhero at 11:58 AM on January 22, 2008 [1 favorite]
He's a very loyal fan :-D
posted by ThePinkSuperhero at 11:58 AM on January 22, 2008 [1 favorite]
Matt has 68 deleted posts! Nyah!
It's not as exciting as you think. "test", "another test", "just testing"...
posted by cortex (staff) at 12:04 PM on January 22, 2008
It's not as exciting as you think. "test", "another test", "just testing"...
posted by cortex (staff) at 12:04 PM on January 22, 2008
Will someone please crunch the numbers and tell us who is winning MetaFilter?
You silly! Victory can't be found by scratching and poking text files! There's too much subjectivity involved - tell me, brain_drain, where in those text files is LOVE? Where will you see who had the BEST INTENTIONS, and sent out the most POSITIVE ENERGY? Who has most consistently advocated the BEST COURSE OF ACTION, and brought the maximum amount of JOY and UTILITY into other members' lives?
posted by Meatbomb at 12:30 PM on January 22, 2008 [2 favorites]
You silly! Victory can't be found by scratching and poking text files! There's too much subjectivity involved - tell me, brain_drain, where in those text files is LOVE? Where will you see who had the BEST INTENTIONS, and sent out the most POSITIVE ENERGY? Who has most consistently advocated the BEST COURSE OF ACTION, and brought the maximum amount of JOY and UTILITY into other members' lives?
posted by Meatbomb at 12:30 PM on January 22, 2008 [2 favorites]
pony request: a metric to measure love
get on it, cortex!
and we want graphs! pretty, pretty pie charts!
posted by Kattullus at 12:33 PM on January 22, 2008
get on it, cortex!
and we want graphs! pretty, pretty pie charts!
posted by Kattullus at 12:33 PM on January 22, 2008
yah, fulltext and tags please. I want to prove my hypothesis that most people's post tags could just be replaced with the top 3 tfidf terms in the post (I've never understood tags on text)
I realize that is just at the idea stage but I'm not so sure that it would be sufficient. Some of the posts I make (and like) never specifically use the kinds of words that would make the best tags: A post about a website for Delta Blues, 1920-30, might list musician's names and some locations, or even just be the site title, without ever using the words 'music,' 'mississippi,' or 'history'.
posted by Miko at 12:53 PM on January 22, 2008
I realize that is just at the idea stage but I'm not so sure that it would be sufficient. Some of the posts I make (and like) never specifically use the kinds of words that would make the best tags: A post about a website for Delta Blues, 1920-30, might list musician's names and some locations, or even just be the site title, without ever using the words 'music,' 'mississippi,' or 'history'.
posted by Miko at 12:53 PM on January 22, 2008
I wish I were smart enough to be evil. However, I'm pretty squishy and like giving hugs, so here's a virtual one for everybody.
posted by Unicorn on the cob at 12:54 PM on January 22, 2008
posted by Unicorn on the cob at 12:54 PM on January 22, 2008
I dumped the contact data into Many Eyes, and it gave me a diagram too large to look at efficiently on my machine. Anyway, here it is. I didn't bother converting userids into usernames.
posted by monju_bosatsu at 12:54 PM on January 22, 2008
posted by monju_bosatsu at 12:54 PM on January 22, 2008
Is this something you need an abstinent asocial life to understand or get excited over?
Seriously though, this is pretty cool. I've bookmarked it, and I'm going to dive in later tonight...after I party with the cool kids and get laid hardy har har.
Every time I see something MeFi-related that relies on user numbers it makes me feel a little weird that I have mine memorized.
I never seem to remember mine, no matter how many times I've checked it for one reason or another. I do remember that I'm in the 15k club, but the exact number always escapes me.
posted by Devils Slide at 1:04 PM on January 22, 2008
Seriously though, this is pretty cool. I've bookmarked it, and I'm going to dive in later tonight...after I party with the cool kids and get laid hardy har har.
Every time I see something MeFi-related that relies on user numbers it makes me feel a little weird that I have mine memorized.
I never seem to remember mine, no matter how many times I've checked it for one reason or another. I do remember that I'm in the 15k club, but the exact number always escapes me.
posted by Devils Slide at 1:04 PM on January 22, 2008
Thanks for this, cortex.
Plutor wrote...
CSV is nice [but ...] But real men use awk | sort | uniq -c | sort -rn.
Real men also have no problem using awk -F, | sort | uniq -c | sort -rn, so we're sort of indifferent about getting CSV data or not.
posted by tkolar at 1:11 PM on January 22, 2008
Plutor wrote...
CSV is nice [but ...] But real men use awk | sort | uniq -c | sort -rn.
Real men also have no problem using awk -F, | sort | uniq -c | sort -rn, so we're sort of indifferent about getting CSV data or not.
posted by tkolar at 1:11 PM on January 22, 2008
Man! What's with the big spike in September 2001?
Oh. Right.
posted by ColdChef at 1:16 PM on January 22, 2008 [2 favorites]
Oh. Right.
posted by ColdChef at 1:16 PM on January 22, 2008 [2 favorites]
Since I'm not a real man, or any sort of man, can someone tell me where to put this "awk -F, | sort | uniq -c | sort -rn"? What program am I supposed to use?
posted by desjardins at 1:18 PM on January 22, 2008
posted by desjardins at 1:18 PM on January 22, 2008
-F, only separates by commas. Some of the data (usernames, for instance) will need quotes and complex quote-escaping rules since they can contain double-quotes and commas (see section 2 of RFC 3180).
posted by Plutor at 1:18 PM on January 22, 2008
posted by Plutor at 1:18 PM on January 22, 2008
...where in those text files is LOVE? Where will you see who had the BEST INTENTIONS, and sent out the most POSITIVE ENERGY? Who has most consistently advocated the BEST COURSE OF ACTION, and brought the maximum amount of JOY and UTILITY into other members' lives?
Do I have to register my old username again?
posted by waraw at 1:34 PM on January 22, 2008
Do I have to register my old username again?
posted by waraw at 1:34 PM on January 22, 2008
-F, only separates by commas. Some of the data (usernames, for instance) will need quotes and complex quote-escaping rules since they can contain double-quotes and commas (see section 2 of RFC 3180).
The same is going to be true of any other separator that is chosen. If we're going to go that route we should demand that the separator be !@#$%^&**&^%%$#@! on the theory that it's statistically very unlikely to occur in a comment. Except this one.
In any case, commas will make things easier for Excel/SQL folks in the vast majority of data files and won't hurt us Unix-y folks in the least.
posted by tkolar at 1:37 PM on January 22, 2008
The same is going to be true of any other separator that is chosen. If we're going to go that route we should demand that the separator be !@#$%^&**&^%%$#@! on the theory that it's statistically very unlikely to occur in a comment. Except this one.
In any case, commas will make things easier for Excel/SQL folks in the vast majority of data files and won't hurt us Unix-y folks in the least.
posted by tkolar at 1:37 PM on January 22, 2008
Nuts to the Excel folks.
Tab-delimited is easiest for MySQL users. (Don't know about other SQLs)
posted by Plutor at 2:06 PM on January 22, 2008
Tab-delimited is easiest for MySQL users. (Don't know about other SQLs)
posted by Plutor at 2:06 PM on January 22, 2008
The same is going to be true of any other separator that is chosen.
Not tabs as far as I am aware. Tab-delimited data works great with Excel, and SQL, and works fine with Unix tools without having to specify the delimiter explicitly. Tabs are the most awesome delimiter ever.
Disclaimer: this comment contained a tab in preview. Let's see if it stays.
posted by grouse at 2:07 PM on January 22, 2008
Not tabs as far as I am aware. Tab-delimited data works great with Excel, and SQL, and works fine with Unix tools without having to specify the delimiter explicitly. Tabs are the most awesome delimiter ever.
Disclaimer: this comment contained a tab in preview. Let's see if it stays.
posted by grouse at 2:07 PM on January 22, 2008
Metafilter in graph form.
posted by Burger-Eating Invasion Monkey at 2:11 PM on January 22, 2008 [2 favorites]
posted by Burger-Eating Invasion Monkey at 2:11 PM on January 22, 2008 [2 favorites]
If there's something specific you'd like to see added to or tweaked on the Infodump
yep - transform & load those suckers into a datamart.
posted by UbuRoivas at 2:15 PM on January 22, 2008
yep - transform & load those suckers into a datamart.
posted by UbuRoivas at 2:15 PM on January 22, 2008
it has the advantage of actually existing, which I think is a big step up.
[NOT SOLIPSIST]
posted by ersatz at 2:15 PM on January 22, 2008 [1 favorite]
[NOT SOLIPSIST]
posted by ersatz at 2:15 PM on January 22, 2008 [1 favorite]
But I think a prepop field of the top n salient terms from the post would be sufficient, especially for people that don't get what tags are and put the first sentence of their post in the tag box...
Yeah, it's an interesting question. I thought about some of that when I was trying to decide how to prioritize keywords for Word Clouds, actually; what a key word is in a post is a weird combination of absolute frequency in the post, relative frequency in the corpus, and a subjective whammy factor.
And by my thinking, doing a tag analysis on just the text of a post is a pretty limiting constraint—there just isn't much text there to work with, and its possible that some key tags would be implied rather than present in the text of the post. But when you're making a post, that's all there is to work with, so it's a useless complaint. However, doing a post-hoc tagging analysis by checking out not just the contents of the post text but also of all the comments in the thread? That could be a pretty neat approach to tagging not just the ostensible post topic but the actual conversation that resulted. Experiment for another day, I suppose.
Yahoo has an API for auto-tagging, actually; we experimented with it for the Backtagging project, though I'm not sure how much actual use it got. It wasn't bad, I can tell you that; the tags weren't perfect, but it was usually on the right track.
posted by cortex (staff) at 2:18 PM on January 22, 2008
Yeah, it's an interesting question. I thought about some of that when I was trying to decide how to prioritize keywords for Word Clouds, actually; what a key word is in a post is a weird combination of absolute frequency in the post, relative frequency in the corpus, and a subjective whammy factor.
And by my thinking, doing a tag analysis on just the text of a post is a pretty limiting constraint—there just isn't much text there to work with, and its possible that some key tags would be implied rather than present in the text of the post. But when you're making a post, that's all there is to work with, so it's a useless complaint. However, doing a post-hoc tagging analysis by checking out not just the contents of the post text but also of all the comments in the thread? That could be a pretty neat approach to tagging not just the ostensible post topic but the actual conversation that resulted. Experiment for another day, I suppose.
Yahoo has an API for auto-tagging, actually; we experimented with it for the Backtagging project, though I'm not sure how much actual use it got. It wasn't bad, I can tell you that; the tags weren't perfect, but it was usually on the right track.
posted by cortex (staff) at 2:18 PM on January 22, 2008
postid->tags would be super-cool.
posted by MetaMonkey at 2:46 PM on January 22, 2008
posted by MetaMonkey at 2:46 PM on January 22, 2008
Rainy day science project suggestion:
Graph comments-per-hour activity across a variety of threads, based on timestamp of thread and timestamps of comments in that thread. Is there a prototypical curve? Are there distinct modes that different groups of threads operate in, and what drives that difference?
Thread intensity vs. thread longevity: does a thread with lots of comments feature (a) more comments per hour during the thread's prime time, (b) more persistence over time, or (c) a little of column a and a little of column b? What about second winds?
What do the activity curves look like on the blue vs the green vs the grey?
Etc.
posted by cortex (staff) at 2:58 PM on January 22, 2008
Graph comments-per-hour activity across a variety of threads, based on timestamp of thread and timestamps of comments in that thread. Is there a prototypical curve? Are there distinct modes that different groups of threads operate in, and what drives that difference?
Thread intensity vs. thread longevity: does a thread with lots of comments feature (a) more comments per hour during the thread's prime time, (b) more persistence over time, or (c) a little of column a and a little of column b? What about second winds?
What do the activity curves look like on the blue vs the green vs the grey?
Etc.
posted by cortex (staff) at 2:58 PM on January 22, 2008
I would like to see favorite distribution graphs. I have a hypothesis that the more favorites you have on a paticular comment, the more likely you will get more favorites. Probably a lot in the 1-2 favorite range. Some in the 2-10, but then a lot of comments being in the 10 or more favorites, with a large percent of favorites coming for relatively few comments.
posted by geoff. at 3:09 PM on January 22, 2008
posted by geoff. at 3:09 PM on January 22, 2008
What does the value of favetype (1-12) mean?
posted by Burger-Eating Invasion Monkey at 3:21 PM on January 22, 2008
posted by Burger-Eating Invasion Monkey at 3:21 PM on January 22, 2008
Yeah, geoff., there's a lot of interesting detail to chew on there. As a preliminary examination, I talked about some of that a few months back, but my initial look was very broad.
I'd love to see an examination of, say, strong subgraph clumps (if they exist) where there is any grouping or categorical consistency to favoriting behavior. How does favoriting on Mefi compare to Askme? Who favorites posts, and who favorites comments? Who favorites both a post and comments within that post? Who favorites themselves?
Some folks are proportionally free with the faves, some are proportionally stingy. (I'm in the latter camp, as it turns out—I just don't favorite things too often). Are there users who get an uncommon percentage of the stingy folks' favorites, whether great or small?
How do the outliers change the picture? How does favoriting volume (both giving and getting) correspond to commenting volume? Posting volume? Does receiving favorites seem to lead to giving favorites, in a causal sense—is favoriting "paid forward", intentionally or not, does it resemble in some sense a synaptic network?
posted by cortex (staff) at 3:21 PM on January 22, 2008
I'd love to see an examination of, say, strong subgraph clumps (if they exist) where there is any grouping or categorical consistency to favoriting behavior. How does favoriting on Mefi compare to Askme? Who favorites posts, and who favorites comments? Who favorites both a post and comments within that post? Who favorites themselves?
Some folks are proportionally free with the faves, some are proportionally stingy. (I'm in the latter camp, as it turns out—I just don't favorite things too often). Are there users who get an uncommon percentage of the stingy folks' favorites, whether great or small?
How do the outliers change the picture? How does favoriting volume (both giving and getting) correspond to commenting volume? Posting volume? Does receiving favorites seem to lead to giving favorites, in a causal sense—is favoriting "paid forward", intentionally or not, does it resemble in some sense a synaptic network?
posted by cortex (staff) at 3:21 PM on January 22, 2008
I would like to see # of favorites vs. post length. I would expect a peak at short post lengths for pithy one-liners, and then a gradual slope up as the post length increases.
posted by Pyry at 3:22 PM on January 22, 2008
posted by Pyry at 3:22 PM on January 22, 2008
What does the value of favetype (1-12) mean?
It defines what is being favorited: posts and comments on various parts of the site, basically. I should put together a little bit of documentation for that, I guess.
[1 3 5] are posts to mefi, askme, and metatalk, respectively, I think.
[2 4 6] are comments to same.
The rest, I don't know offhand. I think [8 9] might be Music posts and comments, but don't quote me.
posted by cortex (staff) at 3:23 PM on January 22, 2008
It defines what is being favorited: posts and comments on various parts of the site, basically. I should put together a little bit of documentation for that, I guess.
[1 3 5] are posts to mefi, askme, and metatalk, respectively, I think.
[2 4 6] are comments to same.
The rest, I don't know offhand. I think [8 9] might be Music posts and comments, but don't quote me.
posted by cortex (staff) at 3:23 PM on January 22, 2008
Also, favorites still exist in Projects, for reasons completely nebulous to me.
posted by roll truck roll at 3:29 PM on January 22, 2008
posted by roll truck roll at 3:29 PM on January 22, 2008
i've found that once posts i've made hit the 'most popular in the past 7 days' page, the favorites get about another 20%-30% boost. I wonder how one would verify that?
posted by empath at 3:54 PM on January 22, 2008
posted by empath at 3:54 PM on January 22, 2008
cortex, would it be cool if I made a little webapp for exploring the data? I've been looking for a little test project to try out rapid app development. Nothing terribly nefarious, probably just basic stuff like input your username to find out who has favorited your posts the most, whose posts you've commented on... that sort of thing. Although it's quite possible I won't get the chance to do it anyway, as most of these sort of ideas don't make it past the wouldn't-it-be-interesting stage.
posted by MetaMonkey at 4:01 PM on January 22, 2008
posted by MetaMonkey at 4:01 PM on January 22, 2008
I have been curious for a while about how many active members there really are here. That 35K vs. 65K was a big part of that question. Now I also know that 12431 individuals commented in 2007, and AskMe has quite a few more commenters than MeFi:
AskMe: 9925
MeFi: 7895
MeTa: 3218
Music: 595
posted by team lowkey at 4:12 PM on January 22, 2008 [1 favorite]
AskMe: 9925
MeFi: 7895
MeTa: 3218
Music: 595
posted by team lowkey at 4:12 PM on January 22, 2008 [1 favorite]
I'd be interested to know how much activity comes from people in each band of 1000. I have reason to believe that us 17Kers ask more AskMe questions than anyone else.
posted by grouse at 4:20 PM on January 22, 2008
posted by grouse at 4:20 PM on January 22, 2008
grouse, here is a histogram of Mefi comments by userid, in bins of 1000. The 17k-ers are putting in a strong performance.
posted by Burger-Eating Invasion Monkey at 4:33 PM on January 22, 2008 [2 favorites]
posted by Burger-Eating Invasion Monkey at 4:33 PM on January 22, 2008 [2 favorites]
And, wow, yes 17k-ers ask far, far more questions than any other group.
posted by Burger-Eating Invasion Monkey at 4:39 PM on January 22, 2008 [2 favorites]
posted by Burger-Eating Invasion Monkey at 4:39 PM on January 22, 2008 [2 favorites]
I have a hypothesis that the more favorites you have on a particular comment, the more likely you will get more favorites.
I think this is true; would be interesting to see if there's data that proves/disproves it.
posted by ThePinkSuperhero at 4:48 PM on January 22, 2008 [1 favorite]
I think this is true; would be interesting to see if there's data that proves/disproves it.
posted by ThePinkSuperhero at 4:48 PM on January 22, 2008 [1 favorite]
I can't look at the first image. But the patterns of question-asking in the antequindenarian period are very interesting. Mainly how little activity there is for the 2Kers-9Kers.
posted by grouse at 4:49 PM on January 22, 2008
posted by grouse at 4:49 PM on January 22, 2008
BEIM, now run it again without the outlier to which grouse was obliquely referring.
posted by Partial Law at 4:51 PM on January 22, 2008 [1 favorite]
posted by Partial Law at 4:51 PM on January 22, 2008 [1 favorite]
I'd be interested to know how much activity comes from people in each band of 1000. I have reason to believe that us 17Kers ask more AskMe questions than anyone else.
A couple things you could do to see if any given 17ker is actually on average more prone to asking than any given xxker:
- Divide the results for each k-set by the number of activated accounts in that set (based on the user count in the users listing);
- Divide the results for each k-set by amount of time that has passed between the average/approximate join date for members of that k-set (or from the inception of Askme, for folks who've been around longer than the green has) to current day.
posted by cortex (staff) at 5:03 PM on January 22, 2008
A couple things you could do to see if any given 17ker is actually on average more prone to asking than any given xxker:
- Divide the results for each k-set by the number of activated accounts in that set (based on the user count in the users listing);
- Divide the results for each k-set by amount of time that has passed between the average/approximate join date for members of that k-set (or from the inception of Askme, for folks who've been around longer than the green has) to current day.
posted by cortex (staff) at 5:03 PM on January 22, 2008
Another thing that could be interesting is looking at account fall-off; when did any given account stop commenting/posting/favoriting? For every user who closes their account or gets banned, there's likely a lot more who just fade away. Approximate "death" of an account might be an interesting metric.
posted by cortex (staff) at 5:05 PM on January 22, 2008 [1 favorite]
posted by cortex (staff) at 5:05 PM on January 22, 2008 [1 favorite]
These are the top self-favoriters:
puke & cry 594
loquacious 34
tehloki 23
reenum 21
McLir 21
spiderwire 20
AceRock 20
MythMaker 19
quonsar 18
posted by Burger-Eating Invasion Monkey at 5:06 PM on January 22, 2008 [7 favorites]
puke & cry 594
loquacious 34
tehloki 23
reenum 21
McLir 21
spiderwire 20
AceRock 20
MythMaker 19
quonsar 18
posted by Burger-Eating Invasion Monkey at 5:06 PM on January 22, 2008 [7 favorites]
I feel like my tehloki favorites have been cheapened somehow.
posted by grouse at 5:13 PM on January 22, 2008
posted by grouse at 5:13 PM on January 22, 2008
Why on earth did puke & cry favorite himself almost 600 times?
posted by "Tex" Connor and the Wily Roundup Boys at 5:14 PM on January 22, 2008 [1 favorite]
posted by "Tex" Connor and the Wily Roundup Boys at 5:14 PM on January 22, 2008 [1 favorite]
D'oh, I can't believe I didn't remember anonymous.
Here is a revised graph, without him/her.
Here is a non-ImageShack copy of the first graph, Mefi comments by userid.
posted by Burger-Eating Invasion Monkey at 5:14 PM on January 22, 2008
Here is a revised graph, without him/her.
Here is a non-ImageShack copy of the first graph, Mefi comments by userid.
posted by Burger-Eating Invasion Monkey at 5:14 PM on January 22, 2008
I'm proud to say I've never favorited myself. You can go blind that way.
posted by jonmc at 5:15 PM on January 22, 2008
posted by jonmc at 5:15 PM on January 22, 2008
Note: these are huge.
You've obviously never worked with genomic data. A paltry couple hundred megs? Ha!
posted by chrisamiller at 5:18 PM on January 22, 2008
You've obviously never worked with genomic data. A paltry couple hundred megs? Ha!
posted by chrisamiller at 5:18 PM on January 22, 2008
That's not huge in "actually huge data" speak, chrisamiller. It's huge in "omg why didn't you tell me your link was a pdf" speak.
posted by cortex (staff) at 5:18 PM on January 22, 2008
posted by cortex (staff) at 5:18 PM on January 22, 2008
BEIM, chop chop with the askme and meta versions.
posted by cortex (staff) at 5:19 PM on January 22, 2008
posted by cortex (staff) at 5:19 PM on January 22, 2008
jonmc's telling the truth. Though the person you favourite the most often is your friend Divine_Wino. In nine years, you've only once favourited something mathowie wrote. Predictably, the user who favourites you most is tehloki, with a massive 121. Your next biggest fans are Blazecock and loquacious.
posted by Burger-Eating Invasion Monkey at 5:22 PM on January 22, 2008
posted by Burger-Eating Invasion Monkey at 5:22 PM on January 22, 2008
Though the person you favourite the most often is your friend Divine_Wino.
Really? Well, I'd favorite him even if he weren't my RL friend because he's a fucking genius and a truly humbling talent as a writer. In fact, we first started hanging out together based on our commonality in tastes that we discovered through MeFi.
Your next biggest fans are Blazecock and loquacious.
well, loq's trying to convert me to electronica [*holds up crucifix*] and Blazecock is a gay guy, so he probably thinks I'm hot. They all do.
posted by jonmc at 5:34 PM on January 22, 2008
Really? Well, I'd favorite him even if he weren't my RL friend because he's a fucking genius and a truly humbling talent as a writer. In fact, we first started hanging out together based on our commonality in tastes that we discovered through MeFi.
Your next biggest fans are Blazecock and loquacious.
well, loq's trying to convert me to electronica [*holds up crucifix*] and Blazecock is a gay guy, so he probably thinks I'm hot. They all do.
posted by jonmc at 5:34 PM on January 22, 2008
Dumping flags that are at least one year old, or something, wouldn't really cause any complaint problem.
posted by Chuckles at 5:41 PM on January 22, 2008
posted by Chuckles at 5:41 PM on January 22, 2008
I think you vastly underestimate our ability to hold grudges, Chuckles.
posted by ThePinkSuperhero at 5:42 PM on January 22, 2008 [3 favorites]
posted by ThePinkSuperhero at 5:42 PM on January 22, 2008 [3 favorites]
What's all this nonsense about separators?
If you use commas, then any field containing a comma should be in quotes. That's how it's done. Quotes inside those fields get escaped.
posted by AmbroseChapel at 5:52 PM on January 22, 2008
If you use commas, then any field containing a comma should be in quotes. That's how it's done. Quotes inside those fields get escaped.
posted by AmbroseChapel at 5:52 PM on January 22, 2008
You know, it'd be cool if we had lat/long data in the usernames file. Arguably, people have a reasonable expectation of privacy for this data, but in practice it's already been made public in the Google Earth KML file, so it wouldn't do any more harm to publish it.
Then you could, for instance, see if people disproportionately favorite people geographically close to them. If so, is it because nearby people tend to share opinions, or just because they met at meetups?
The trouble with all this data is the wild flights of fancy you end up on, dancing from API to API ( ... download a recently-tagged user photo off Flickr, post it to HotOrNot with cURL, scrape the rating ... what's the correlation between 'hotness' and favourites? ... are people who ask more AskMe's tagged 'fashion', 'makeup' and 'dating' rated 'hotter'? ... do people who listen to emo music on last.fm ask more human relations AskMe? ... etc)*
jon: Wasn't casting any aspersions about your favoriting habits.
* Don't actually do this, it's batshitinsane.
posted by Burger-Eating Invasion Monkey at 5:56 PM on January 22, 2008
Then you could, for instance, see if people disproportionately favorite people geographically close to them. If so, is it because nearby people tend to share opinions, or just because they met at meetups?
The trouble with all this data is the wild flights of fancy you end up on, dancing from API to API ( ... download a recently-tagged user photo off Flickr, post it to HotOrNot with cURL, scrape the rating ... what's the correlation between 'hotness' and favourites? ... are people who ask more AskMe's tagged 'fashion', 'makeup' and 'dating' rated 'hotter'? ... do people who listen to emo music on last.fm ask more human relations AskMe? ... etc)*
jon: Wasn't casting any aspersions about your favoriting habits.
* Don't actually do this, it's batshitinsane.
posted by Burger-Eating Invasion Monkey at 5:56 PM on January 22, 2008
I think in general I'd prefer to avoid making any particularly personal profile info a part of the canonical data sets, really. Coords are sort of a grey area, but I'd rather not, even though I can see the entertainment value in analyzing 'em.
posted by cortex (staff) at 6:03 PM on January 22, 2008
posted by cortex (staff) at 6:03 PM on January 22, 2008
I have a hypothesis that the more favorites you have on a particular comment, the more likely you will get more favorites.
Hmm, that's one of the more interesting hypotheses to test. I'll have a go.
posted by Burger-Eating Invasion Monkey at 6:04 PM on January 22, 2008
Hmm, that's one of the more interesting hypotheses to test. I'll have a go.
posted by Burger-Eating Invasion Monkey at 6:04 PM on January 22, 2008
Tracking the time over which favorites accrue would be an interesting way to crunch into that. Also, spike in that timeline after periods of dormancy would be a good indicator of sidebar or other site cross-reference. (I'd be curious, for example, to see the favorites timeline on scarabic's famous Body Disposal answer.)
posted by cortex (staff) at 6:12 PM on January 22, 2008
posted by cortex (staff) at 6:12 PM on January 22, 2008
Do the comments files include actual text of the comments?
posted by delmoi at 6:29 PM on January 22, 2008
posted by delmoi at 6:29 PM on January 22, 2008
tkolar writes "The same is going to be true of any other separator that is chosen. If we're going to go that route we should demand that the separator be !@#$%^&**&^%%$#@! on the theory that it's statistically very unlikely to occur in a comment. Except this one."
This is easily handled, pick whatever weird delimiter you want (that one above is a doozy) and place comments as the last field. As the other fields aren't user editable the delimiter is safe if you only look for it the first X times where x=# of fields - 1
posted by Mitheral at 6:36 PM on January 22, 2008
This is easily handled, pick whatever weird delimiter you want (that one above is a doozy) and place comments as the last field. As the other fields aren't user editable the delimiter is safe if you only look for it the first X times where x=# of fields - 1
posted by Mitheral at 6:36 PM on January 22, 2008
I have reason to believe that us 17Kers ask more AskMe questions than anyone else.
We're also prettier and smell nicer.
posted by pineapple at 6:47 PM on January 22, 2008
We're also prettier and smell nicer.
posted by pineapple at 6:47 PM on January 22, 2008
Yeah, I was thinking that the graph of favoriting of, say, the top 0.1% of comments would show a distinctive trend where there would be increasing returns for a time, as favoriting breeds more favoriting, as opposed to the decaying rate of favoriting of normal comments. But I don't have time right to now to start dealing with all the practical software-related problems of working with dates and durations. In theory, it's perfectly possible though.
Here's the graph of scarabic's answer being favourited. It's not that interesting, or even very readable, to be honest.
posted by Burger-Eating Invasion Monkey at 6:50 PM on January 22, 2008
Here's the graph of scarabic's answer being favourited. It's not that interesting, or even very readable, to be honest.
posted by Burger-Eating Invasion Monkey at 6:50 PM on January 22, 2008
Why oh why did you do this when I'm swamped at work?
the data processing geek inside me cries
posted by davejay at 7:12 PM on January 22, 2008
the data processing geek inside me cries
posted by davejay at 7:12 PM on January 22, 2008
AskMe questions by hour, server time.
When should you post your AskMe question in order to maximise the number of answers?
When should you post your AskMe question in order to maximise the number of favorites?
I don't have any webspace, so if any feels like putting my graphs somewhere more permanent than imageshack/picoodle, I'd be most grateful.
posted by Burger-Eating Invasion Monkey at 7:25 PM on January 22, 2008 [2 favorites]
When should you post your AskMe question in order to maximise the number of answers?
When should you post your AskMe question in order to maximise the number of favorites?
I don't have any webspace, so if any feels like putting my graphs somewhere more permanent than imageshack/picoodle, I'd be most grateful.
posted by Burger-Eating Invasion Monkey at 7:25 PM on January 22, 2008 [2 favorites]
The 17k-ers are putting in a strong performance.
I wonder why why the 5kers are so conspicuously silent, relative to their neighbors anyway.
This is getting more and more interesting. Thanks B-EIM, and of course cortex.
posted by Devils Slide at 7:34 PM on January 22, 2008
I wonder why why the 5kers are so conspicuously silent, relative to their neighbors anyway.
This is getting more and more interesting. Thanks B-EIM, and of course cortex.
posted by Devils Slide at 7:34 PM on January 22, 2008
astro zombie (comment #1) did you mean "brooch" instead of "broach"? sitting here, i'm having trouble with "broach" as a noun.
posted by bruce at 7:55 PM on January 22, 2008
posted by bruce at 7:55 PM on January 22, 2008
There are only 35000 usernames but the front page claims 65003 users...
It's people who completed the sign-up process vs people who just started it.
posted by jessamyn
So it's actually like I am user number 10,000 or so!
Ya! Kickin' it O.G.-style byotches!
posted by The Deej at 8:01 PM on January 22, 2008
Do the comments files include actual text of the comments?
Nope. It'd be (a) huge and (b) kind of weird.
posted by cortex (staff) at 8:02 PM on January 22, 2008
Nope. It'd be (a) huge and (b) kind of weird.
posted by cortex (staff) at 8:02 PM on January 22, 2008
User "deaths":
Total users who have ever commented on Mefi: 15684 (surprisingly low)
Users who have commented on Mefi since Jan 2007: 8058 (51.3%)
When did users last comment on Mefi? Note the Sep 2001 blip.
Right, that's enough for one day.
posted by Burger-Eating Invasion Monkey at 8:31 PM on January 22, 2008 [1 favorite]
Total users who have ever commented on Mefi: 15684 (surprisingly low)
Users who have commented on Mefi since Jan 2007: 8058 (51.3%)
When did users last comment on Mefi? Note the Sep 2001 blip.
Right, that's enough for one day.
posted by Burger-Eating Invasion Monkey at 8:31 PM on January 22, 2008 [1 favorite]
For the askme/userid histogram, I'd like to see a line separating those accounts made before askme existed and after, ie "which people may have joined solely for askme?"
posted by NortonDC at 8:47 PM on January 22, 2008
posted by NortonDC at 8:47 PM on January 22, 2008
Burger-Eating Invasion Monkey: Total users who have ever commented on Mefi: 15684 (surprisingly low)
Is that all of MetaFilter or just The Blue?
Fascinating statistics, by the way.
posted by Kattullus at 9:12 PM on January 22, 2008
Is that all of MetaFilter or just The Blue?
Fascinating statistics, by the way.
posted by Kattullus at 9:12 PM on January 22, 2008
Note the Sep 2001 blip.
Wow. Can anyone parse that for me? (I wasn't on MeFi then.) Were people just heebed out with newsfilter or is this, like, when they dug out Pompeii and could make plaster people from the cavity they left in the ash?
posted by cowbellemoo at 9:42 PM on January 22, 2008
Wow. Can anyone parse that for me? (I wasn't on MeFi then.) Were people just heebed out with newsfilter or is this, like, when they dug out Pompeii and could make plaster people from the cavity they left in the ash?
posted by cowbellemoo at 9:42 PM on January 22, 2008
This is cool and great. I wish I was geeky enough to actually do something with this data instead of just watching to see what the rest of you do with it (although you've already done some neat stuff).
May I suggest two things:
1. Sidebar this shit. Obvs.
2. Maybe a little extreme, but how about an under-the-radar subsite (a la travel) -- data.metafilter.com. It could be both a repository for infodumps like this, as well as a place for people to make posts highlighting interesting/insightful/humorous analyses of the raw data. I know it may seem a bit of overkill, but I'd much rather see this all in one place, rather than scattered all over MeTa over the next few weeks.
posted by Rock Steady at 9:43 PM on January 22, 2008
May I suggest two things:
1. Sidebar this shit. Obvs.
2. Maybe a little extreme, but how about an under-the-radar subsite (a la travel) -- data.metafilter.com. It could be both a repository for infodumps like this, as well as a place for people to make posts highlighting interesting/insightful/humorous analyses of the raw data. I know it may seem a bit of overkill, but I'd much rather see this all in one place, rather than scattered all over MeTa over the next few weeks.
posted by Rock Steady at 9:43 PM on January 22, 2008
Interestingly, the curves for average number of favorites given and average number of favorites gotten track each other very very closely.
That is to say, there are roughly the same number of people who have received 20 favorites as there are who have given 20 favorites.
Somehow I was expecting it to be more lopsided, although I can't say in which direction...
posted by tkolar at 10:05 PM on January 22, 2008
That is to say, there are roughly the same number of people who have received 20 favorites as there are who have given 20 favorites.
Somehow I was expecting it to be more lopsided, although I can't say in which direction...
posted by tkolar at 10:05 PM on January 22, 2008
Blazecock is a gay guy, so he probably thinks I'm hot
My eyes, the beer goggles do nothing! just kidding, you're smokin in that mustachioed U2 Edge kind of way
posted by Blazecock Pileon at 10:28 PM on January 22, 2008
My eyes, the beer goggles do nothing! just kidding, you're smokin in that mustachioed U2 Edge kind of way
posted by Blazecock Pileon at 10:28 PM on January 22, 2008
When should you post your AskMe question in order to maximise the number of answers?
Huh. It doesn't matter at all; you get 13-15 comments on average regardless. Interesting. Thanks, Burger-Eating Invasion Monkey.
posted by mediareport at 11:31 PM on January 22, 2008 [1 favorite]
Huh. It doesn't matter at all; you get 13-15 comments on average regardless. Interesting. Thanks, Burger-Eating Invasion Monkey.
posted by mediareport at 11:31 PM on January 22, 2008 [1 favorite]
Here's the graph of scarabic's answer being favourited. It's not that interesting, or even very readable, to be honest.
It'd probably look better as a cumulative line plot, where the y-axis is the cumulative number of favorites accrued by the date on the x-axis.
posted by grouse at 12:41 AM on January 23, 2008
It'd probably look better as a cumulative line plot, where the y-axis is the cumulative number of favorites accrued by the date on the x-axis.
posted by grouse at 12:41 AM on January 23, 2008
When should you post your AskMe question in order to maximise the number of answers?
It's sad that there are so many mean comments on AskMe :(
posted by prophetsearcher at 12:47 AM on January 23, 2008 [1 favorite]
It's sad that there are so many mean comments on AskMe :(
posted by prophetsearcher at 12:47 AM on January 23, 2008 [1 favorite]
Since I'm not a real man, or any sort of man, can someone tell me where to put this "awk -F, | sort | uniq -c | sort -rn"? What program am I supposed to use?
--desjardins
Hello desjardins, I'm sorry no one answered your question (unless I missed it). I'll try to give a brief overview of what you have in quotes.
The program that you're supposed to use is any unix-like shell. Then you start three other programs, namely awk, sort and uniq. The | character between the program names means that whatever the program on the left outputs, thats what the program on the right gets as input (as if you had typed it all in). That is referred to as a pipe line.
awk is a general purpose programming language that is designed for processing text-based data, sort sorts and uniq removes duplicated lines. The flag for awk (-F), tell the program to separate the input into different fields, based on whatever follows the flag, in this case it's a comma (which is appropriate for the CSV files). The flag for uniq tells the program to print the number of times each line occurred along with the line (as well as removing the duplicates). The flags for the last call to sort tell it to sort in reverse of however it would have sorted it, and to use numeric comparisons (instead of alphabetic).
I'll apologize ahead of time if I assumed you knew less than you actually do.
posted by philomathoholic at 1:01 AM on January 23, 2008 [4 favorites]
--desjardins
Hello desjardins, I'm sorry no one answered your question (unless I missed it). I'll try to give a brief overview of what you have in quotes.
The program that you're supposed to use is any unix-like shell. Then you start three other programs, namely awk, sort and uniq. The | character between the program names means that whatever the program on the left outputs, thats what the program on the right gets as input (as if you had typed it all in). That is referred to as a pipe line.
awk is a general purpose programming language that is designed for processing text-based data, sort sorts and uniq removes duplicated lines. The flag for awk (-F), tell the program to separate the input into different fields, based on whatever follows the flag, in this case it's a comma (which is appropriate for the CSV files). The flag for uniq tells the program to print the number of times each line occurred along with the line (as well as removing the duplicates). The flags for the last call to sort tell it to sort in reverse of however it would have sorted it, and to use numeric comparisons (instead of alphabetic).
I'll apologize ahead of time if I assumed you knew less than you actually do.
posted by philomathoholic at 1:01 AM on January 23, 2008 [4 favorites]
If you have a Mac, you have a Unix shell ready to go. In your Applications folder, look for the "Terminal" app in the Utilities folder. From there, try these commands:
curl -o postdata_mefi.txt.zip http://stuff.metafilter.com/infodump/postdata_mefi.txt.zip
unzip postdata_mefi.txt.zip
From there, you can use awk and so on. But be warned, it gets tricky. This one shows you the most active users by the number of posts:
awk '{print $2}' postdata_mefi.txt | sort | uniq -c | sort -rn | less
posted by waxpancake at 1:51 AM on January 23, 2008 [1 favorite]
From there, you can use awk and so on. But be warned, it gets tricky. This one shows you the most active users by the number of posts:
posted by waxpancake at 1:51 AM on January 23, 2008 [1 favorite]
Some interesting probabilities from the postdata file:
1.89% chance of a MeFi post getting 0 comments
3.70% chance of a MeFi post getting >100 comments
0.05% chance of a MeFi post getting >500 comments
0.01% chance of a MeFi post getting >1000 comments
0.04% chance of a MeFi post getting >100 favorites
posted by roofus at 2:24 AM on January 23, 2008
1.89% chance of a MeFi post getting 0 comments
3.70% chance of a MeFi post getting >100 comments
0.05% chance of a MeFi post getting >500 comments
0.01% chance of a MeFi post getting >1000 comments
0.04% chance of a MeFi post getting >100 favorites
posted by roofus at 2:24 AM on January 23, 2008
Someone try Tufte Sparklines, if only because I've been trying to use it but really haven't figured out how to get it to display in a meaingful way. It is not as intuitive as it first appears.
posted by geoff. at 4:58 AM on January 23, 2008
posted by geoff. at 4:58 AM on January 23, 2008
Could people who are doing analysis talk a bit about the tools/techniques they're using? And ponies?
Also, if anyone wants a pic hosted shoot me an email.
posted by Skorgu at 5:06 AM on January 23, 2008
Also, if anyone wants a pic hosted shoot me an email.
posted by Skorgu at 5:06 AM on January 23, 2008
Just had time for a quick look at who has posted how many of the 67492 posts from mefi (inc. deleted):
04387 or 06.5% posts submitted by the top .1% (6) of users
16001 or 23.7% posts submitted by the top 1% (67) of users
42501 or 63.0% posts submitted by the top 10% (675) of user
And here's the fairly boring graph of all users, and the marginally less boring graph for the top 675.
posted by MetaMonkey at 5:13 AM on January 23, 2008
04387 or 06.5% posts submitted by the top .1% (6) of users
16001 or 23.7% posts submitted by the top 1% (67) of users
42501 or 63.0% posts submitted by the top 10% (675) of user
And here's the fairly boring graph of all users, and the marginally less boring graph for the top 675.
posted by MetaMonkey at 5:13 AM on January 23, 2008
Ah, just noticed I should have been using the top 6, 63 & 632 of the 6325 users who made 1 or more posts, but it works out the same in the end percentage-wise. The perils of half-asleep hacking.
posted by MetaMonkey at 5:26 AM on January 23, 2008
posted by MetaMonkey at 5:26 AM on January 23, 2008
That usernames/numbers file reminds me unnervingly of the Vietnam Memorial.
Because the average age is 19?
posted by octobersurprise at 6:10 AM on January 23, 2008
Because the average age is 19?
posted by octobersurprise at 6:10 AM on January 23, 2008
skorgu my entirely unsophisticated analysis technique consists of importing the data into Excel, sorting it by whatever looks interesting, and then using the COUNT function.
posted by roofus at 6:21 AM on January 23, 2008
posted by roofus at 6:21 AM on January 23, 2008
Is that all of MetaFilter or just The Blue?
Just the blue.
I wondered if people with longer usernames (East Manitoba Kabaddi thing, Mr President, Tex Connor et al) get more attention, and hence more favorites (or is it that people who write interesting things tend to choose longer usernames for some reason; or is it just that the recent trend for longer usernames coincides with a natural increase in favoriting as more users join?). Anyhoo, this graph (with regression and 95% CI) shows that having a longer username increases, to some extent, your total favorite count for some reason.
posted by Burger-Eating Invasion Monkey at 6:31 AM on January 23, 2008
Just the blue.
I wondered if people with longer usernames (East Manitoba Kabaddi thing, Mr President, Tex Connor et al) get more attention, and hence more favorites (or is it that people who write interesting things tend to choose longer usernames for some reason; or is it just that the recent trend for longer usernames coincides with a natural increase in favoriting as more users join?). Anyhoo, this graph (with regression and 95% CI) shows that having a longer username increases, to some extent, your total favorite count for some reason.
posted by Burger-Eating Invasion Monkey at 6:31 AM on January 23, 2008
The previous graph only includes users who have ever been favorited.
posted by Burger-Eating Invasion Monkey at 6:33 AM on January 23, 2008
posted by Burger-Eating Invasion Monkey at 6:33 AM on January 23, 2008
Actually, here's a much better graph proving the above point. Logging was unnecessary. In effect, what's happening here is the number of favorites given to users of a particular username length (1-50) is summed and then divided by the number of users with that username length. Again, only users who have recent 1 or more favorite are included.
posted by Burger-Eating Invasion Monkey at 6:44 AM on January 23, 2008
posted by Burger-Eating Invasion Monkey at 6:44 AM on January 23, 2008
Turns out I mistyped one of the commands in the username-length thing, and although the graph is correct, I ended up with the wrong t-values. Username length is almost but not quite significant at the 5% level (p=0.058). If you treat East Manitoba Kabaddi Champion as an outlier, there is no significant relationship. Can't rush this stuff, it seems.
posted by Burger-Eating Invasion Monkey at 6:56 AM on January 23, 2008
posted by Burger-Eating Invasion Monkey at 6:56 AM on January 23, 2008
But did you do anything about the obvious heteroskedasticity? If you weren't using robust SEs, you'll probably drag the p-value back under 0.05 if you do.
posted by ROU_Xenophobe at 7:09 AM on January 23, 2008
posted by ROU_Xenophobe at 7:09 AM on January 23, 2008
Interesting graph, BEIM, but it looks like your odds of being favorited go down as your username lengthens. What it shows is if you get favorited at all, you'll be favorited a lot and that's what's yanking the curve upward.
There are a lot of people with names in the 5-15 character range, and fewer with more. There may only be one user each with 24 or 50 character names, and coincidentally their comments are popular. That's what's offsetting all the unique or nearly-unique users with long character names that everybody's ignoring.
posted by ardgedee at 7:22 AM on January 23, 2008
There are a lot of people with names in the 5-15 character range, and fewer with more. There may only be one user each with 24 or 50 character names, and coincidentally their comments are popular. That's what's offsetting all the unique or nearly-unique users with long character names that everybody's ignoring.
posted by ardgedee at 7:22 AM on January 23, 2008
This is awesome. Can we put all the analysis in one place, so we can quickly figure out whats been done already?
posted by thrako at 7:28 AM on January 23, 2008 [1 favorite]
posted by thrako at 7:28 AM on January 23, 2008 [1 favorite]
I too just saw the Boing Boing link. Shouldn't people have to be logged in to download? Or is that too paranoid?
posted by misterbrandt at 7:29 AM on January 23, 2008
posted by misterbrandt at 7:29 AM on January 23, 2008
This is awesome. Can we put all the analysis in one place, so we can quickly figure out whats been done already?
I think the Wiki would be an excellent place to organize this stuff.
I too just saw the Boing Boing link. Shouldn't people have to be logged in to download? Or is that too paranoid?
Heh. Good morning, Boing Boing!
For everything we're providing here, people wouldn't have to be logged in to scrape it. It's sort of a handwavy situation in either direction, on account of that. We talked about it a little yesterday in email, and we may do so at some point (pandora's box etc notwithstanding), but I'm more concerned with the spammers we actually deal with on a daily basis than I am with shadowy statisticians lurking in the middle distance.
posted by cortex (staff) at 7:40 AM on January 23, 2008
I think the Wiki would be an excellent place to organize this stuff.
I too just saw the Boing Boing link. Shouldn't people have to be logged in to download? Or is that too paranoid?
Heh. Good morning, Boing Boing!
For everything we're providing here, people wouldn't have to be logged in to scrape it. It's sort of a handwavy situation in either direction, on account of that. We talked about it a little yesterday in email, and we may do so at some point (pandora's box etc notwithstanding), but I'm more concerned with the spammers we actually deal with on a daily basis than I am with shadowy statisticians lurking in the middle distance.
posted by cortex (staff) at 7:40 AM on January 23, 2008
This is so awesome! Now I can be futzing with MeFish stuff all day and no one will know:
"What? It's just some database queries, it's totally work related! Leave me alone!"
posted by quin at 8:43 AM on January 23, 2008
"What? It's just some database queries, it's totally work related! Leave me alone!"
posted by quin at 8:43 AM on January 23, 2008
ROU_Xenophobe: Here is (touch wood) the final copy of the username-length graph. I used robust SE's, and got a coefficient of 2.836 on lenusername, with a t-value of 4.15. As an aside, don't robust SE's reduce t-values when heteroskedasticity is present?
posted by Burger-Eating Invasion Monkey at 8:47 AM on January 23, 2008
posted by Burger-Eating Invasion Monkey at 8:47 AM on January 23, 2008
don't robust SE's reduce t-values when heteroskedasticity is present?
You know, I originally thought you were a real statistician and was intimidated by your graphs. But now I *know* you're just making up words :-)
posted by tkolar at 9:07 AM on January 23, 2008
You know, I originally thought you were a real statistician and was intimidated by your graphs. But now I *know* you're just making up words :-)
posted by tkolar at 9:07 AM on January 23, 2008
shadowy statisticians lurking in the middle distance
Now I know what my nightmares tonight will be about. Thanks a lot.
posted by ook at 9:10 AM on January 23, 2008
Now I know what my nightmares tonight will be about. Thanks a lot.
posted by ook at 9:10 AM on January 23, 2008
Is there a list for most favorited comments (24 hours/week/month/all time)?
posted by starman at 9:50 AM on January 23, 2008
posted by starman at 9:50 AM on January 23, 2008
The most favourited comments on Metafilter (blue only) are:
Pastabagel on Mr Rogers 399
robocop on The Wheel 278
Pastabagel on Sears 277
The HD-DVD key 259
Unicorn on the cob on Raves 223
vito90 angry about something 192
robocop on House MD 190
This is a graph of the four top comments being favorited over time, and this [same colours as first graph] is a graph of the first 50% of their favourites over time (to cut off the 'long tail').
tkolar, I'm a long way away form being a real statistician...
posted by Burger-Eating Invasion Monkey at 10:53 AM on January 23, 2008 [1 favorite]
Pastabagel on Mr Rogers 399
robocop on The Wheel 278
Pastabagel on Sears 277
The HD-DVD key 259
Unicorn on the cob on Raves 223
vito90 angry about something 192
robocop on House MD 190
This is a graph of the four top comments being favorited over time, and this [same colours as first graph] is a graph of the first 50% of their favourites over time (to cut off the 'long tail').
tkolar, I'm a long way away form being a real statistician...
posted by Burger-Eating Invasion Monkey at 10:53 AM on January 23, 2008 [1 favorite]
Awesome, BEIM. That set of "notches" in the Mr. Rogers graph is exactly the sort of thing I was imagining when I asked about the scarabic corpsedump question.
It's interesting (if not particularly surprising) how pithiness and early placement of a comment in a thread can supercharge the rapidity with which a comment picks up favorites. The HD-DVD comment is short, obvious, posted within the first few comments/minutes of a high-visibility thread. Aggressive fave curve.
Rogers, Wheel, Sears? Later in their threads, longer comments (worth the read, but requires some time to read). Slower curves, though still quite sharp. The proximity of a sidebarring (I think all of those were) to the comment is an interesting factor; was Rogers sidebarred significantly later than it was posted, while HD-DVD and Sears were sidebarred "in time" with the flow? I might have to look into getting sidebar link-and-datestamp data from the sidebar blog some time.
posted by cortex (staff) at 11:21 AM on January 23, 2008
It's interesting (if not particularly surprising) how pithiness and early placement of a comment in a thread can supercharge the rapidity with which a comment picks up favorites. The HD-DVD comment is short, obvious, posted within the first few comments/minutes of a high-visibility thread. Aggressive fave curve.
Rogers, Wheel, Sears? Later in their threads, longer comments (worth the read, but requires some time to read). Slower curves, though still quite sharp. The proximity of a sidebarring (I think all of those were) to the comment is an interesting factor; was Rogers sidebarred significantly later than it was posted, while HD-DVD and Sears were sidebarred "in time" with the flow? I might have to look into getting sidebar link-and-datestamp data from the sidebar blog some time.
posted by cortex (staff) at 11:21 AM on January 23, 2008
My commenter stats were just through one of those little unix commands, which should probably work on a Mac terminal:
grep "2007-" commentdata_* | awk '{ print $3 }' | sort -n | uniq | wc
grep "2007-" commentdata_* gets all the lines that say "2007-" from all the commentdata files. That gives you all the comments from 2007, which then gets fed into awk '{ print $3 }', which just prints the 3rd column, which is the user ID. That gets handed to sort with a -n option to sort them numerically rather than alphabetically. That gets passed to uniq, which removes duplicates, leaving just a list of unique commenters. Pipe that into wc which is "word count", and you have your number. I then ran the same command on each file individually instead of all them at once to get the subsite statistics.
posted by team lowkey at 11:31 AM on January 23, 2008
grep "2007-" commentdata_* | awk '{ print $3 }' | sort -n | uniq | wc
grep "2007-" commentdata_* gets all the lines that say "2007-" from all the commentdata files. That gives you all the comments from 2007, which then gets fed into awk '{ print $3 }', which just prints the 3rd column, which is the user ID. That gets handed to sort with a -n option to sort them numerically rather than alphabetically. That gets passed to uniq, which removes duplicates, leaving just a list of unique commenters. Pipe that into wc which is "word count", and you have your number. I then ran the same command on each file individually instead of all them at once to get the subsite statistics.
posted by team lowkey at 11:31 AM on January 23, 2008
Metafilter: shadowy statisticians lurking in the middle distance
I have some number crunching I want to do here myself, which will happen as soon as I'm less asleep. In the meantime, I couldn't resist.
posted by Arturus at 11:39 AM on January 23, 2008
I have some number crunching I want to do here myself, which will happen as soon as I'm less asleep. In the meantime, I couldn't resist.
posted by Arturus at 11:39 AM on January 23, 2008
This is really cool. I love txt.
I raise my hand for JSON formatted reply as an option for the «MeFi.infoDump dData API»
That would be the best format for how I'd use this data: To make sweet ass ajaxy things that are of no practical value.
(first up will be a number to name/name to number lookup with rad-ass bouncy effects that will devastate browser memory)
posted by Jeremy at 11:41 AM on January 23, 2008
I raise my hand for JSON formatted reply as an option for the «MeFi.infoDump dData API»
That would be the best format for how I'd use this data: To make sweet ass ajaxy things that are of no practical value.
(first up will be a number to name/name to number lookup with rad-ass bouncy effects that will devastate browser memory)
posted by Jeremy at 11:41 AM on January 23, 2008
cortex, I think we also mentioned the Rogers and Sears posts on the Podcast, which often adds another week or two of delay, which could have contributed to more favorites.
posted by mathowie (staff) at 11:41 AM on January 23, 2008
posted by mathowie (staff) at 11:41 AM on January 23, 2008
Ah, yeah, that could do it. The podcast thread is probably a pretty good way to bump something like that up, what with the best-of nature of the post text.
One of my secret plans (that the Infodump really won't make possible, it'll have to be an internal project) is to create a big map of cross-reference links on the site; just parse every comment and every post for links to the metafilter.com domain and create some sort of distilled navelgazery graph out of it. Some day...
posted by cortex (staff) at 11:53 AM on January 23, 2008
One of my secret plans (that the Infodump really won't make possible, it'll have to be an internal project) is to create a big map of cross-reference links on the site; just parse every comment and every post for links to the metafilter.com domain and create some sort of distilled navelgazery graph out of it. Some day...
posted by cortex (staff) at 11:53 AM on January 23, 2008
The interesting thing about the Mr Rogers comment is that the two noticeable spikes in favoriting after the initial burst has worn off occur roughly 145 and 165 days (graph of Mr Rogers comment in days) after it was posted. Did sidebarring or podcasting really happen that long after the comment was made?
posted by Burger-Eating Invasion Monkey at 11:55 AM on January 23, 2008
posted by Burger-Eating Invasion Monkey at 11:55 AM on January 23, 2008
I added a MetaAnalysis page to the wiki to index what people have done with the data. Right now it is just some of the stuff people have linked to on this thread. People can add their analysis there, and check that I haven't botched the descriptions too badly.
posted by thrako at 12:04 PM on January 23, 2008
posted by thrako at 12:04 PM on January 23, 2008
Is there a list for most favorited comments (24 hours/week/month/all time)?
Yes
Kind of. Filtering comments only and being able to do all time (and monthly) would be nice.
(Oh, I see... comments on the right.. the layout is a little confusing there... still, all time would be nice).
posted by starman at 12:27 PM on January 23, 2008
Yes
Kind of. Filtering comments only and being able to do all time (and monthly) would be nice.
(Oh, I see... comments on the right.. the layout is a little confusing there... still, all time would be nice).
posted by starman at 12:27 PM on January 23, 2008
Here's a 11/14/07 primetime link to the Rogers comment. Explicit, right at the top of an askme thread that itself got 56 favorites. That accounts for the the Day 165 bump.
posted by cortex (staff) at 12:30 PM on January 23, 2008
posted by cortex (staff) at 12:30 PM on January 23, 2008
I'm number five! I'm number five!
/13-year-old joy dance
posted by Unicorn on the cob at 12:31 PM on January 23, 2008
/13-year-old joy dance
posted by Unicorn on the cob at 12:31 PM on January 23, 2008
So I took a gander at the data to see what were some other most-favorited comments (since revisiting 1-5 was such fun), but unfortunately it appears that info needs to be parsed from the raw document. Can some post or MeMail some of the other most-fav'd comments of all time?
posted by yeti at 1:23 PM on January 23, 2008
posted by yeti at 1:23 PM on January 23, 2008
The interesting thing about the Mr Rogers comment is that the two noticeable spikes in favoriting after the initial burst has worn off occur roughly 145 and 165 days (graph of Mr Rogers comment in days) after it was posted. Did sidebarring or podcasting really happen that long after the comment was made?
It was likely mentioned in another (metatalk?) thread. I've found some favorites that way. In fact, i just favorited robocop's post about "The Wheel" after seeing it linked in this thread.
posted by chrisamiller at 1:55 PM on January 23, 2008
It was likely mentioned in another (metatalk?) thread. I've found some favorites that way. In fact, i just favorited robocop's post about "The Wheel" after seeing it linked in this thread.
posted by chrisamiller at 1:55 PM on January 23, 2008
Some folks are proportionally free with the faves, some are proportionally stingy.
Another question is if there's a number of favorites, over which users start favoriting more (i.e. when they can't find anything in their Favorites page anymore).
posted by ersatz at 1:55 PM on January 23, 2008
Another question is if there's a number of favorites, over which users start favoriting more (i.e. when they can't find anything in their Favorites page anymore).
posted by ersatz at 1:55 PM on January 23, 2008
Oh, good question, ersatz. I suppose the thing to do would be to examine favorites-per-unit-time rates for folks with more than some threshold number of total favorites given, and see if there's some sort of inflection point.
I'd also be curious to see if there's a general inflection point for when faves went all AJAXy. Also: does the average individual rate of favoriting grow over time, or do folks by and large favorite at a steady pace individually with the significant and steady growth of favoriting over time being explained solely by more new users getting their fave on with each passing month?
posted by cortex (staff) at 2:03 PM on January 23, 2008
I'd also be curious to see if there's a general inflection point for when faves went all AJAXy. Also: does the average individual rate of favoriting grow over time, or do folks by and large favorite at a steady pace individually with the significant and steady growth of favoriting over time being explained solely by more new users getting their fave on with each passing month?
posted by cortex (staff) at 2:03 PM on January 23, 2008
I did some distributional stuff with the AskMe comments. This graph has the number of answers for each post on AskMe on the x-axis and the complimentary cumulative distribution on the y-axis (log-log scale). So each point represents a post, and for each point the x-coordinate is log(number of answers on that post), and the y-coordinate is log( (number of posts with more answers)/(number of answers) ).
The distribution looks pretty smooth, maybe even a power-law thing, up to about exp(5)=150 answers. Above 150 answers the posts get more answers than we might expect based on the rest of the data. Those top three posts are:
"can i EVER get revenge on a fraudulent ebay seller?" (303 answers)
"Who wants a gmail account?" (402 answers)
"What Is This Creepy Site Advertising?" (752 answers)
This is a picture showing the distribution of answers per user.
posted by thrako at 2:17 PM on January 23, 2008 [1 favorite]
The distribution looks pretty smooth, maybe even a power-law thing, up to about exp(5)=150 answers. Above 150 answers the posts get more answers than we might expect based on the rest of the data. Those top three posts are:
"can i EVER get revenge on a fraudulent ebay seller?" (303 answers)
"Who wants a gmail account?" (402 answers)
"What Is This Creepy Site Advertising?" (752 answers)
This is a picture showing the distribution of answers per user.
posted by thrako at 2:17 PM on January 23, 2008 [1 favorite]
So I took a gander at the data to see what were some other most-favorited comments (since revisiting 1-5 was such fun), but unfortunately it appears that info needs to be parsed from the raw document. Can some post or MeMail some of the other most-fav'd comments of all time?
Comments with more than 100 favorites, by number of favorites. I apologize for the clumsy formatting but I'm really supposed to be doing something else right now :-)
102
102
103
105
105
107
108
108
111
111
112
113
115
115
124
129
131
134
134
134
134
142
142
148
149
153
154
160
162
174
181
186
188
190
192
195
213
223
259
271
277
278
319
399
posted by tkolar at 2:20 PM on January 23, 2008 [16 favorites]
Comments with more than 100 favorites, by number of favorites. I apologize for the clumsy formatting but I'm really supposed to be doing something else right now :-)
102
102
103
105
105
107
108
108
111
111
112
113
115
115
124
129
131
134
134
134
134
142
142
148
149
153
154
160
162
174
181
186
188
190
192
195
213
223
259
271
277
278
319
399
posted by tkolar at 2:20 PM on January 23, 2008 [16 favorites]
It was likely mentioned in another (metatalk?) thread. I've found some favorites that way. In fact, i just favorited robocop's post about "The Wheel" after seeing it linked in this thread.
Yeah, most of my stuff hasn't been sidebarred - any "legs" they have come from Metatalk, the Podcast, or from offsite blogs referring back to the comment. The stuff that was sidebarred usually focused more deservedly on the much more talented folks like chrismear or cortex who put lyrics to music and performed.
Not that I'm bitter, yo.
posted by robocop is bleeding at 4:02 PM on January 23, 2008
Yeah, most of my stuff hasn't been sidebarred - any "legs" they have come from Metatalk, the Podcast, or from offsite blogs referring back to the comment. The stuff that was sidebarred usually focused more deservedly on the much more talented folks like chrismear or cortex who put lyrics to music and performed.
Not that I'm bitter, yo.
posted by robocop is bleeding at 4:02 PM on January 23, 2008
Of the comments labeled "134" in that list, the last two lead to posts and not actual comments. Were they deleted or mislinked or something?
posted by iamkimiam at 4:43 PM on January 23, 2008
posted by iamkimiam at 4:43 PM on January 23, 2008
This one time I linked to one of your comments (the writer's strike / House one) as a supplement to sidebarring someone else's straightfaced comment on the subject, but I did it like one minute after Jessamyn made her own sidebar entry that didn't mention your comment and then she erased my sidebar entry because it was, yes, a double post, and so robbed you of your deserved fame. Blame her!
posted by cortex (staff) at 4:45 PM on January 23, 2008
posted by cortex (staff) at 4:45 PM on January 23, 2008
Of the comments labeled "134" in that list, the last two lead to posts and not actual comments. Were they deleted or mislinked or something?
Ah ha! tkolar didn't account for subsite correctly on those; one's from askme, the other from metatalk, and so linking to the blue as he did would just about ruin it. Here they are:
134 - Adam Savage drops SCIENCE
134 - occhiblue on Kate Harding on sexism
posted by cortex (staff) at 4:49 PM on January 23, 2008 [1 favorite]
Ah ha! tkolar didn't account for subsite correctly on those; one's from askme, the other from metatalk, and so linking to the blue as he did would just about ruin it. Here they are:
134 - Adam Savage drops SCIENCE
134 - occhiblue on Kate Harding on sexism
posted by cortex (staff) at 4:49 PM on January 23, 2008 [1 favorite]
No worries. We should write Metafilter: The Rock Opera. Then we'd get the respect we deserve.
posted by robocop is bleeding at 5:13 PM on January 23, 2008
posted by robocop is bleeding at 5:13 PM on January 23, 2008
To those of you who prefer SQL queries, the Metafilter Data Playground. You can run MySQL queries on all the data and export the results to tab delimited files to import into excel or whatever.
Only select queries are allowed. Enjoy!
posted by null terminated at 6:12 PM on January 23, 2008 [4 favorites]
Only select queries are allowed. Enjoy!
posted by null terminated at 6:12 PM on January 23, 2008 [4 favorites]
I think your data or query may be broken? I have 88 posts but am not listed for users with 88 posts.
posted by Blazecock Pileon at 6:43 PM on January 23, 2008
posted by Blazecock Pileon at 6:43 PM on January 23, 2008
Have you had some deleted? Here they are all, listed. It says you have 119....
The data appears to have been space delimited which screwed up some usernames which have spaces.
posted by null terminated at 6:52 PM on January 23, 2008
The data appears to have been space delimited which screwed up some usernames which have spaces.
posted by null terminated at 6:52 PM on January 23, 2008
Have you had some deleted?
*blushes* They're all doubles, I swear! Thanks.
posted by Blazecock Pileon at 6:56 PM on January 23, 2008
*blushes* They're all doubles, I swear! Thanks.
posted by Blazecock Pileon at 6:56 PM on January 23, 2008
Actually, that query is returning posts created by other users. Looking into it.
posted by null terminated at 6:57 PM on January 23, 2008
posted by null terminated at 6:57 PM on January 23, 2008
The post data was off. I reimported it. Thanks and let me know if there are any other irregularities.
posted by null terminated at 7:04 PM on January 23, 2008
posted by null terminated at 7:04 PM on January 23, 2008
It now says you have 95 posts, which makes sense (7 deleted).
posted by null terminated at 7:05 PM on January 23, 2008
posted by null terminated at 7:05 PM on January 23, 2008
Usernames have been fixed. (sorry to post so much in this thread)
posted by null terminated at 7:12 PM on January 23, 2008
posted by null terminated at 7:12 PM on January 23, 2008
(sorry to post so much in this thread)
No worries, it helps people forget that I screwed up some links in my post...
posted by tkolar at 7:16 PM on January 23, 2008
No worries, it helps people forget that I screwed up some links in my post...
posted by tkolar at 7:16 PM on January 23, 2008
Can someone put these in CSV? Google docs only uses CSV and Bento only supports CSV.
Failing that, is there an easy way to convert this to a CSV file?
posted by empath at 7:28 PM on January 23, 2008
Failing that, is there an easy way to convert this to a CSV file?
posted by empath at 7:28 PM on January 23, 2008
I added CSV export. For example, click "download as CSV" to get a csv copy of that query. Running "Select * from TABLE" will give similar results for all the tables.
posted by null terminated at 7:41 PM on January 23, 2008
posted by null terminated at 7:41 PM on January 23, 2008
oh wow, Bento is completely worthless. I was looking for an excuse to try it out, but it's just a toy.. no relationships? No charts and graphs? What on earth would people use it for?
posted by empath at 8:05 PM on January 23, 2008
posted by empath at 8:05 PM on January 23, 2008
There are related record lists but they're very limited. If you want a real database product, it's not your app.
Bento has a lot more in common with an organizer than a database at this point. That's on purpose -- it's meant to be simple and attractive to people who would otherwise be intimidated.
Sales are through the roof and plenty of people are doing plenty of things with it, but if you are accustomed to working with data and graphs and charting, you are already well beyond the target demographic.
posted by tkolar at 1:03 AM on January 24, 2008
Bento has a lot more in common with an organizer than a database at this point. That's on purpose -- it's meant to be simple and attractive to people who would otherwise be intimidated.
Sales are through the roof and plenty of people are doing plenty of things with it, but if you are accustomed to working with data and graphs and charting, you are already well beyond the target demographic.
posted by tkolar at 1:03 AM on January 24, 2008
Also - Numbers sucks, too. 65,000 maximum records. Apple is full of fail.
posted by empath at 12:01 PM on January 24, 2008
posted by empath at 12:01 PM on January 24, 2008
Also - Numbers sucks, too. 65,000 maximum records. Apple is full of fail.
Don't know too much about Numbers myself, but I do know that Excel gets similarly weird around 65,000. As a programmer I have to say: WTF spreadsheet engineers?
posted by tkolar at 1:11 PM on January 24, 2008
Don't know too much about Numbers myself, but I do know that Excel gets similarly weird around 65,000. As a programmer I have to say: WTF spreadsheet engineers?
posted by tkolar at 1:11 PM on January 24, 2008
Excel gets similarly weird around 65,000
Yeah, I was dealing with that yesterday while playing with some of these docs, I was befuddled, and I ended up using a low tech system (cutting and pasting in blocks of 65k), but I was amazed that the import system didn't offer to open up multiple documents automatically. (It suggests that this is a solution in the 'too much data' error, but it doesn't give you any simple way of getting past it.)
Stupid design.
posted by quin at 1:23 PM on January 24, 2008
Yeah, I was dealing with that yesterday while playing with some of these docs, I was befuddled, and I ended up using a low tech system (cutting and pasting in blocks of 65k), but I was amazed that the import system didn't offer to open up multiple documents automatically. (It suggests that this is a solution in the 'too much data' error, but it doesn't give you any simple way of getting past it.)
Stupid design.
posted by quin at 1:23 PM on January 24, 2008
Excel 2007 goes up to 1m rows per sheet. But it's rubbish for this kind of thing; it's painfully, painfully slow with large data sets involving formulae.
posted by Burger-Eating Invasion Monkey at 2:51 PM on January 24, 2008
posted by Burger-Eating Invasion Monkey at 2:51 PM on January 24, 2008
Repeat after me, "Excel is NOT a database."
posted by blue_beetle at 3:33 PM on January 24, 2008
posted by blue_beetle at 3:33 PM on January 24, 2008
Average FPPs per user, grouped somewhat arbitrarily. Note the remarkable contributing power of the 20-per-day folks. Without them, us first-week-five-dollar-noobs would be the outliers.
Average questions per user, grouped the same way. In this chart, that same group is so goddamn overwhelming that I've created this chart without them.
posted by Plutor at 6:07 AM on January 25, 2008
Average questions per user, grouped the same way. In this chart, that same group is so goddamn overwhelming that I've created this chart without them.
posted by Plutor at 6:07 AM on January 25, 2008
Class of 2001: We're Not Hoggin' Metafilter.
Plutor, I'd love to see askme answers by this breakdown. Also, total number of users in each bracket, for normalizin' purposes.
posted by cortex (staff) at 6:39 AM on January 25, 2008
Plutor, I'd love to see askme answers by this breakdown. Also, total number of users in each bracket, for normalizin' purposes.
posted by cortex (staff) at 6:39 AM on January 25, 2008
Average answers per user, grouped the same way.
Information on the groups:
• Bin 0 ("pre-2001") has 2698 users, first user is 1
• Bin 1 ("2001") has 9900 users, first user is 2853
• Bin 2 ("2002") has 3148 users, first user is 13335
• Bin 3 ("20-per-day and sneak-ins") has 252 users, first user is 17319
• Bin 4 ("$5 first week") has 1798 users, first user is 17578
• Bin 5 ("rest of 2005") has 5391 users, first user is 19779
• Bin 6 ("2006") has 5509 users, first user is 31596
• Bin 7 ("2007") has 5704 users, first user is 47767
There's probably a flaw in this analysis, since the groups vary in size so much. The two groups with the highest averages in every chart are also the two smallest groups.
Lousy armchair statisticians.
posted by Plutor at 7:05 AM on January 25, 2008 [1 favorite]
Information on the groups:
• Bin 0 ("pre-2001") has 2698 users, first user is 1
• Bin 1 ("2001") has 9900 users, first user is 2853
• Bin 2 ("2002") has 3148 users, first user is 13335
• Bin 3 ("20-per-day and sneak-ins") has 252 users, first user is 17319
• Bin 4 ("$5 first week") has 1798 users, first user is 17578
• Bin 5 ("rest of 2005") has 5391 users, first user is 19779
• Bin 6 ("2006") has 5509 users, first user is 31596
• Bin 7 ("2007") has 5704 users, first user is 47767
There's probably a flaw in this analysis, since the groups vary in size so much. The two groups with the highest averages in every chart are also the two smallest groups.
Lousy armchair statisticians.
posted by Plutor at 7:05 AM on January 25, 2008 [1 favorite]
Ah, but users vs. dead users. Of the 9900 accounts in the 2001 bin, how many have actually, you know, made a comment or a post in the last year? The last three?
I have a vision of a graph 35K pixels high, with each row an account that was registered; and some three thousand pixels wide, with each pixel a day since day one. And on each of the 35K rows, a line from day of first activity to day of last recorded activity.
Fancy it up some and you could color each line on a spectrum to convey total amount of activity; so someone who made two comments, three years apart, would have a much "cooler" line than someone who made a thousand comments over three years.
posted by cortex (staff) at 7:19 AM on January 25, 2008
I have a vision of a graph 35K pixels high, with each row an account that was registered; and some three thousand pixels wide, with each pixel a day since day one. And on each of the 35K rows, a line from day of first activity to day of last recorded activity.
Fancy it up some and you could color each line on a spectrum to convey total amount of activity; so someone who made two comments, three years apart, would have a much "cooler" line than someone who made a thousand comments over three years.
posted by cortex (staff) at 7:19 AM on January 25, 2008
cortex: "Ah, but users vs. dead users. Of the 9900 accounts in the 2001 bin, how many have actually, you know, made a comment or a post in the last year? The last three?"
That was kind of the point of the graph, actually. But maybe it would have been more effective if the y-axis was "number of users with a post in this year" or some such.
posted by Plutor at 10:16 AM on January 25, 2008
That was kind of the point of the graph, actually. But maybe it would have been more effective if the y-axis was "number of users with a post in this year" or some such.
posted by Plutor at 10:16 AM on January 25, 2008
Yeah, I dig it. I'm just saying I'm curious about the activity-adjusted view as well.
posted by cortex (staff) at 10:42 AM on January 25, 2008
posted by cortex (staff) at 10:42 AM on January 25, 2008
Plutor: lots of us in the Bin 2 group are sneak-ins.
posted by timeistight at 10:43 AM on January 25, 2008
posted by timeistight at 10:43 AM on January 25, 2008
I used the user ID from the MeFi wiki timeline as the end of open signups. If that's wrong, I'd love to know the right one.
posted by Plutor at 11:15 AM on January 25, 2008
posted by Plutor at 11:15 AM on January 25, 2008
It looks like sign-ups were first closed in October, 2001, so most users ~12450 through ~14220 either begged, bribed, or snuck their way in.
posted by timeistight at 11:53 AM on January 25, 2008
posted by timeistight at 11:53 AM on January 25, 2008
Judging by the date gap from 17318 to 17319, I think the wiki (and thus your assumption) is spot on. Here's a thread referencing the fact that they're closed; I couldn't find any official announcement from Matt.
Related, and perhaps a big part of what prompted the closure decision: a couple days earlier, Matt admits he's thinking about closing the shop. A lot of good discussion and further comments from him in that thread.
I've added both of those to the Wiki.
Mostly, I'm kind of surprised to not have found more specific discussion of closures on or after the 11/08/02 event; I suppose it's partly topic fatigue—people had been talking about it, and Matt tweaking it, for a while at that point—but I expected a real storm of discussion. The difference six years makes, I suppose.
Tangent: this comment, from later in the Holy Shit thread, on Matt's dislike for the notion of shared admin duties. I'm glad things changed, there, for any number of reasons.
posted by cortex (staff) at 12:02 PM on January 25, 2008
Related, and perhaps a big part of what prompted the closure decision: a couple days earlier, Matt admits he's thinking about closing the shop. A lot of good discussion and further comments from him in that thread.
I've added both of those to the Wiki.
Mostly, I'm kind of surprised to not have found more specific discussion of closures on or after the 11/08/02 event; I suppose it's partly topic fatigue—people had been talking about it, and Matt tweaking it, for a while at that point—but I expected a real storm of discussion. The difference six years makes, I suppose.
Tangent: this comment, from later in the Holy Shit thread, on Matt's dislike for the notion of shared admin duties. I'm glad things changed, there, for any number of reasons.
posted by cortex (staff) at 12:02 PM on January 25, 2008
Bin 3 r00lz!
posted by robocop is bleeding at 12:05 PM on January 25, 2008
posted by robocop is bleeding at 12:05 PM on January 25, 2008
Also, I'm loving the "bin" system. I wonder if we could refine it some more and have a really good epochal partitioning. Or maybe it's just fun.
Also also, I'm actually going to beat Matt to 10,000 comments in the grey. I don't know exactly how to feel about that. Good god.
posted by cortex (staff) at 12:19 PM on January 25, 2008
Also also, I'm actually going to beat Matt to 10,000 comments in the grey. I don't know exactly how to feel about that. Good god.
posted by cortex (staff) at 12:19 PM on January 25, 2008
Yeah, that's not a bad idea. There's a bunch of things that I'm hoping to work on this weekend—some of the modifications and additions mentioned above, some documentation (however brief), etc—and getting all of that rolled into a big Here It Is seems reasonable enough for folks who are going to want to yank the works for updates on a regular basis.
posted by cortex (staff) at 12:41 PM on January 25, 2008
posted by cortex (staff) at 12:41 PM on January 25, 2008
Another minor request which might have come up before: Could you please mark the top-of-the-file comments a little better? My first thought is to put a '#' at the beginning of the comment lines, but that might only be because I'm using Perl to parse them.
posted by Plutor at 12:44 PM on January 25, 2008
posted by Plutor at 12:44 PM on January 25, 2008
More closure stuff:
12/13/02 - Metatalk thread noting that scoundrels like timeistight are managing to sign up despite the gates being closed. Matt acknowledges not knowing what's up. An explication. This thread is the first time Matt has actually explicitly discussed closure stuff at all on the grey, since the hammer fell, and even here that's not really the topic.
And then, apparently, nothing from Matt on the subject for a great while. That's not to say no one discussed signups, but I don't see any metatalk comments from him on the subject, until...
4/7/03 - twine42 asks about signups. Matt puts it to resources delays; re-opening is on the table. Also, I am a pain in the ass.
5/14/03 - more sneaker-inners noticed. Matt points out that he's been letting a few people in, but that he's planning to do limited signups again soon.
I don't really have time to carry that on, but you can see how quiet it is, at least from Matt's perspective, on the subject. Kinda fascinating to me.
(Random aside: Migs asks for data dump in 2003. Well, hey, progress at last, Cardoso!)
posted by cortex (staff) at 12:57 PM on January 25, 2008
12/13/02 - Metatalk thread noting that scoundrels like timeistight are managing to sign up despite the gates being closed. Matt acknowledges not knowing what's up. An explication. This thread is the first time Matt has actually explicitly discussed closure stuff at all on the grey, since the hammer fell, and even here that's not really the topic.
And then, apparently, nothing from Matt on the subject for a great while. That's not to say no one discussed signups, but I don't see any metatalk comments from him on the subject, until...
4/7/03 - twine42 asks about signups. Matt puts it to resources delays; re-opening is on the table. Also, I am a pain in the ass.
5/14/03 - more sneaker-inners noticed. Matt points out that he's been letting a few people in, but that he's planning to do limited signups again soon.
I don't really have time to carry that on, but you can see how quiet it is, at least from Matt's perspective, on the subject. Kinda fascinating to me.
(Random aside: Migs asks for data dump in 2003. Well, hey, progress at last, Cardoso!)
posted by cortex (staff) at 12:57 PM on January 25, 2008
Yeah, Plutor, definitely. Header cleanup and standardization is part of the plan.
posted by cortex (staff) at 12:59 PM on January 25, 2008
posted by cortex (staff) at 12:59 PM on January 25, 2008
Some earlier closure discussions:
December 26, 2001 - If sign-ups are closed, how come the member number keeps rising? First mention I found of closed sign-ups and back doors.
January 21, 2002 - About people posting under spouse's accounts because they couldn't get their own.
April 7, 2002 - New server; should Matt reopen sign-ups?
May 24, 2002 - If sign-ups are closed, how come the member number keeps rising? Mention of a 5K contest back door, which Matt says he's closed.
July 22, 2002 - Will sign-ups ever re-open? More talk of back doors.
July 24, 2002 - More on proxy users.
July 25, 2002 - If sign-ups are closed, how come the member number keeps rising? More talk of back doors.
July 26, 2002 - Limited sign-ups, 20 per day.
August 29, 2002 - Sign-ups have been open 4 - 6 weeks. When can we turn them back off?
November 8, 2002 - Sign-ups have been closed for a while.
posted by timeistight at 2:26 PM on January 25, 2008 [1 favorite]
December 26, 2001 - If sign-ups are closed, how come the member number keeps rising? First mention I found of closed sign-ups and back doors.
January 21, 2002 - About people posting under spouse's accounts because they couldn't get their own.
April 7, 2002 - New server; should Matt reopen sign-ups?
May 24, 2002 - If sign-ups are closed, how come the member number keeps rising? Mention of a 5K contest back door, which Matt says he's closed.
July 22, 2002 - Will sign-ups ever re-open? More talk of back doors.
July 24, 2002 - More on proxy users.
July 25, 2002 - If sign-ups are closed, how come the member number keeps rising? More talk of back doors.
July 26, 2002 - Limited sign-ups, 20 per day.
August 29, 2002 - Sign-ups have been open 4 - 6 weeks. When can we turn them back off?
November 8, 2002 - Sign-ups have been closed for a while.
posted by timeistight at 2:26 PM on January 25, 2008 [1 favorite]
Cortex, Here's the image you wanted, sort of [800K png, coral cached].
Sadly, an image 3100x34000 crashes Firefox, so this is the next best thing. It's compressed 10:1 vertically, so each row represents ten users, and each column represents one day. The color of the pixel is the total number of contributions (posts, plus comments) from those ten users in that day. There are 16 shades of grey, from white (0 contributions) to black (15 or more contributions).
posted by Plutor at 2:27 PM on January 25, 2008
Sadly, an image 3100x34000 crashes Firefox, so this is the next best thing. It's compressed 10:1 vertically, so each row represents ten users, and each column represents one day. The color of the pixel is the total number of contributions (posts, plus comments) from those ten users in that day. There are 16 shades of grey, from white (0 contributions) to black (15 or more contributions).
posted by Plutor at 2:27 PM on January 25, 2008
Bizarrely enough, my dayjob nannyfilter blocks your server. Look forward to checking it out when I got home.
And yeah, the image size struck me as a problem (I've choked programs on too-large images myself a couple times [NOT MEATBOMB'S COCKIST]). Another possibility would be to square it up a bit by doing weekly rather than daily increments and then lopping the now-even-skinnier image into a succession of adjascent cylindars. You could get it closer to something like 4000x4000 territory that way, I think.
posted by cortex (staff) at 2:37 PM on January 25, 2008
And yeah, the image size struck me as a problem (I've choked programs on too-large images myself a couple times [NOT MEATBOMB'S COCKIST]). Another possibility would be to square it up a bit by doing weekly rather than daily increments and then lopping the now-even-skinnier image into a succession of adjascent cylindars. You could get it closer to something like 4000x4000 territory that way, I think.
posted by cortex (staff) at 2:37 PM on January 25, 2008
(Or rather, it blocks the cache site. plutor.org works just fine, whew. And hey, neat, yeah! It's not exactly what I'm imagining, you're right, but it's really pretty great. Love the down-time striping.)
posted by cortex (staff) at 2:39 PM on January 25, 2008
posted by cortex (staff) at 2:39 PM on January 25, 2008
Huh. In the askme data you still get the comment data for deleted posts.
Not sure if that matters, but it's something to remember if you're striving for accuracy.
posted by tkolar at 10:44 PM on January 26, 2008
Not sure if that matters, but it's something to remember if you're striving for accuracy.
posted by tkolar at 10:44 PM on January 26, 2008
I think that might be the case in general; and I think I'll exclude it in future dumps, actually, just for consistency.
posted by cortex (staff) at 11:27 PM on January 26, 2008
posted by cortex (staff) at 11:27 PM on January 26, 2008
Curiously enough, it stops happening after post 32999.
That may correspond to a policy change in February 2006, however. It seems like at some point problematic threads went from being deleted to being closed in a different fashion...
posted by tkolar at 11:35 PM on January 26, 2008
That may correspond to a policy change in February 2006, however. It seems like at some point problematic threads went from being deleted to being closed in a different fashion...
posted by tkolar at 11:35 PM on January 26, 2008
There are a lot of weird little epochs in the db, where code changes led to subtle differences in how stuff gets handled behind the scenes, even though there was no apparent difference in the view of the site. Documenting those changes is one of the sick pleasures I indulge in. I spent some of this morning trying to make sense of varying behavior in the assignment of datestamps to tag data (which is coming, just you wait).
posted by cortex (staff) at 11:42 PM on January 26, 2008
posted by cortex (staff) at 11:42 PM on January 26, 2008
Some random wordcount stats, by indirect request:
Out of 77001 total askme questions, 3316 gave been anonymous. Of the anonymous questions, the average character (not word) -count, including, where extant, the more-inside portion, is ~1302, nearly twice the non-anonymous average of ~702. That's roughly 300 vs. 175 words.
What this doesn't do is account for followup answers from the OP of non-anony questions. That might be interesting to factor in some time, but is slightly less trivial to calculate.
posted by cortex (staff) at 7:58 AM on January 27, 2008
Out of 77001 total askme questions, 3316 gave been anonymous. Of the anonymous questions, the average character (not word) -count, including, where extant, the more-inside portion, is ~1302, nearly twice the non-anonymous average of ~702. That's roughly 300 vs. 175 words.
What this doesn't do is account for followup answers from the OP of non-anony questions. That might be interesting to factor in some time, but is slightly less trivial to calculate.
posted by cortex (staff) at 7:58 AM on January 27, 2008
Yeah, anon questions have to be longer because of all the extra explanation required. Good to see the data reflect that.
posted by mathowie (staff) at 9:12 AM on January 27, 2008
posted by mathowie (staff) at 9:12 AM on January 27, 2008
cortex writes "the average character (not word) -count,"
Count whitespace runs and add one to count words.
posted by orthogonality at 9:18 AM on January 27, 2008
Count whitespace runs and add one to count words.
posted by orthogonality at 9:18 AM on January 27, 2008
Yeah, if I sit down and do a more thorough run on the data, I'll include a proper word count and other fun stuff. This was the just-get-it-done rough draft.
posted by cortex (staff) at 9:36 AM on January 27, 2008
posted by cortex (staff) at 9:36 AM on January 27, 2008
cortex writes "What this doesn't do is account for followup answers from the OP of non-anony questions. That might be interesting to factor in some time, but is slightly less trivial to calculate."
There is a significant bias there before the built in more inside function was created. Lots of people used to roll their own MI in the first comment. On the questions pre automagic MI where I did a MI the MI is larger than the front page bit.
posted by Mitheral at 9:57 AM on January 27, 2008
There is a significant bias there before the built in more inside function was created. Lots of people used to roll their own MI in the first comment. On the questions pre automagic MI where I did a MI the MI is larger than the front page bit.
posted by Mitheral at 9:57 AM on January 27, 2008
Oh, hey. That's a really good point, Mitheral.
posted by cortex (staff) at 10:01 AM on January 27, 2008
posted by cortex (staff) at 10:01 AM on January 27, 2008
Can someone suggest a program on mac I can use to get into this data?
I'd really like to use something like crystal reports, ideally.
posted by empath at 6:15 PM on January 27, 2008
I'd really like to use something like crystal reports, ideally.
posted by empath at 6:15 PM on January 27, 2008
It'd be interesting to see a year-by-year breakdown of which users dominated Metatalk, by number of comments. A sort of timeline of active "policy" users.
posted by vacapinta at 10:49 PM on January 27, 2008
posted by vacapinta at 10:49 PM on January 27, 2008
A request, inspired by this comment in the most recent sexism thread: can someone cook up a graph for final site participation by a user by the length of the thread that they participated in last? You'd probably need to make a cutoff of recent user stuff by eyeballing the final participation by date chart so as to not let still potentially active users muck up the data.
posted by Arturus at 8:05 AM on January 28, 2008
posted by Arturus at 8:05 AM on January 28, 2008
Here's a list of all of the people who made at least 100 comments total, haven't participated since at least 1 Sept 2007 and whose last comment was in a thread with more than 100 comments:
• harmful's last comment was in a thread with 450 comments
• aaron's last comment was in a thread with 240 comments
• Cerebus's last comment was in a thread with 415 comments
• Aikido's last comment was in a thread with 269 comments
• catatonic's last comment was in a thread with 216 comments
• sylloge's last comment was in a thread with 365 comments
• donkeyschlong's last comment was in a thread with 365 comments
• mrmorgan's last comment was in a thread with 2686 comments
• rodney stewart's last comment was in a thread with 347 comments
• jam_pony's last comment was in a thread with 347 comments
• onegoodmove's last comment was in a thread with 347 comments
• Farengast's last comment was in a thread with 377 comments
• hummus's last comment was in a thread with 222 comments
• McBain's last comment was in a thread with 250 comments
• mowglisambo's last comment was in a thread with 303 comments
• hellinskira's last comment was in a thread with 257 comments
• maura's last comment was in a thread with 222 comments
• Stuart_R's last comment was in a thread with 216 comments
• Spoon's last comment was in a thread with 267 comments
• yarf's last comment was in a thread with 208 comments
Not nearly as huge (or useful) of a list as I was expecting. I played with the inclusion criteria for a while, but it was hard to get numbers that excluded joke accounts in joke threads (like 9622) and included the truly contentious threads.
posted by Plutor at 12:01 PM on January 28, 2008
• harmful's last comment was in a thread with 450 comments
• aaron's last comment was in a thread with 240 comments
• Cerebus's last comment was in a thread with 415 comments
• Aikido's last comment was in a thread with 269 comments
• catatonic's last comment was in a thread with 216 comments
• sylloge's last comment was in a thread with 365 comments
• donkeyschlong's last comment was in a thread with 365 comments
• mrmorgan's last comment was in a thread with 2686 comments
• rodney stewart's last comment was in a thread with 347 comments
• jam_pony's last comment was in a thread with 347 comments
• onegoodmove's last comment was in a thread with 347 comments
• Farengast's last comment was in a thread with 377 comments
• hummus's last comment was in a thread with 222 comments
• McBain's last comment was in a thread with 250 comments
• mowglisambo's last comment was in a thread with 303 comments
• hellinskira's last comment was in a thread with 257 comments
• maura's last comment was in a thread with 222 comments
• Stuart_R's last comment was in a thread with 216 comments
• Spoon's last comment was in a thread with 267 comments
• yarf's last comment was in a thread with 208 comments
Not nearly as huge (or useful) of a list as I was expecting. I played with the inclusion criteria for a while, but it was hard to get numbers that excluded joke accounts in joke threads (like 9622) and included the truly contentious threads.
posted by Plutor at 12:01 PM on January 28, 2008
...last comment was in a thread with more than 100 200 comments
posted by Plutor at 12:02 PM on January 28, 2008
posted by Plutor at 12:02 PM on January 28, 2008
Is there any way of getting a per user ratio of number of askme answers to "best answer" checkmarks?
posted by davey_darling at 6:56 PM on February 2, 2008
posted by davey_darling at 6:56 PM on February 2, 2008
I'm working on a raw "Best Answer" dump that'll make that sort of thing possible, yeah.
posted by cortex (staff) at 8:12 PM on February 2, 2008
posted by cortex (staff) at 8:12 PM on February 2, 2008
Here are some stats from last year to munch on while you wait.
posted by tkolar at 10:20 PM on February 2, 2008
posted by tkolar at 10:20 PM on February 2, 2008
I was worried that I commented too much on MetaTalk but was heartened to find I was only the 261st most frequent commenter on MeTa. Looking at the list of most frequent MeTa commenters I was completely unsurprised to find that cortex and mathowie occupied the top two spots, but what surprised me was that jessamyn came in at 7th. Four members have commented more in here than one of the moderators. I find that impressive. I doff my cap to the three metskateers who're still with us and pour down a forty for the brave cut-up who is no longer among us.
posted by Kattullus at 12:45 AM on February 3, 2008 [1 favorite]
posted by Kattullus at 12:45 AM on February 3, 2008 [1 favorite]
At the last meet-up I went to, someone called me an 'A-lister'. But, I'm not even in the top 200 metatalk commentors, I don't think.
posted by empath at 6:32 AM on February 3, 2008
posted by empath at 6:32 AM on February 3, 2008
If I counted right, you're #198.
posted by smackfu at 9:20 AM on February 3, 2008 [1 favorite]
posted by smackfu at 9:20 AM on February 3, 2008 [1 favorite]
If I counted right, you're #198.
And climbing!
posted by davey_darling at 10:56 AM on February 3, 2008
And climbing!
posted by davey_darling at 10:56 AM on February 3, 2008
You are not logged in, either login or create an account to post comments
Myself, well, I could make a hat, or a broach, or a pterodactyl ...
posted by Astro Zombie at 8:00 AM on January 22, 2008 [16 favorites]