Users and tags and dendrograms, oh my! July 22, 2008 9:01 PM   Subscribe

MeFi clustering analysis.

I ran a clustering algorithm on some data from the Mefi InfoDump, matching each user with the tags used in her posts. I created a dendrogram (same image as in first link). Here's a plain text version that's not as pretty but is searchable.
The data was pared down selecting the users with 25 or more posts and the tags that were used 10% as much as the tag that was used the most, which gives me 556 users and 80 tags.
I also inverted the data and clustered the tags themselves, which gives some idea of thematic areas.
If I accept users with >= 5 posts, and tags >= 1% max, I get 2268 users and 1274 tags, and a very tall dendrogram (and plain text).
The python script I used to extract the data from the dumps is here, the clustering algorithm was taken from Programming Collective Intelligence and the clustering & dendrogram drawing script is available from the author's site, under chapter 3.
posted by signal to MetaFilter-Related at 9:01 PM (81 comments total) 7 users marked this as a favorite

I don't understand this yet, but it looks like something that's going to make me go "whoa."

Whoa.
posted by Miko at 9:05 PM on July 22, 2008 [1 favorite]


Here's the wiki for Dendrogram. In the mefi analysis, 2 users are 'close' when they use similar tags.
posted by signal at 9:07 PM on July 22, 2008


It is very interesting. I'm honored to share a "branch" with the people I'm on the branch with, and it makes a lot of sense, since they are among the posters I frequently track and favorite. This is really cool in describing the subject-area 'neighborhoods' here.
posted by Miko at 9:09 PM on July 22, 2008


I propose that a part of the wiki be dedicated to quantitative (e.g. this) and qualitative (e.g. quartermass' thesis) studies of Mefi. I propose that I not be responsible for its creation or maintenance.
posted by unknowncommand at 9:12 PM on July 22, 2008 [2 favorites]


Just to finish explaining: it takes into account how many times you've used a tag, not just which tags you use. For the clustered tags dendrogram, closeness is based on what users use the tags.
posted by signal at 9:16 PM on July 22, 2008


Does this data exclude posts to Mefi Music?
posted by micayetoca at 9:22 PM on July 22, 2008


After I got my bachelor's in Math a couple of decades ago, I ran as far away as I could from it, which is probably good, because I'm pretty sure that part of my brain is broken now. I have no clue how to interpret this, even as I applaud its creation.

*applauds*
posted by stavrosthewonderchicken at 9:22 PM on July 22, 2008


I heard we have to sleep with our clusters. Is this true?!
posted by dobbs at 9:26 PM on July 22, 2008


I love the inverted graph. I especially love the bottom left-hand corner.

This is very neat.
posted by cortex (staff) at 9:27 PM on July 22, 2008 [1 favorite]


Yeah, I'm reading Programming Collective Intelligence myself right now...it's a great and very entertaining read. I recommend it to everyone. It's well worth the price, and since it's a recent book even the used books will turn up in good condition.
posted by Deathalicious at 9:28 PM on July 22, 2008


"clustered the tags themselves"

Canada is right next to deadlink. I'm not sure what that means but it can't be good.
posted by Mitheral at 9:35 PM on July 22, 2008


Fascinating, but still over my head a bit. The old me is there. Now I'm trying to figure out what it all means.
posted by Fuzzy Skinner at 9:40 PM on July 22, 2008


OK. So, soundsofsuburbia and I have a high level of overlap ("close") in the tags that we use. Divabat has a high level of overlap with the tags used by the cluster of, like, five other users, and a higher overlap with that as a cluster than she does with any given member of that cluster individually. Right?

Or, from the tags, "literature" has a higher overlap with "gay"+everything else than it does with just everything else.

Right?
posted by klangklangston at 9:47 PM on July 22, 2008


I'm glad I'm not the only one who goesn't quite "get it" but it's fun to peruse. My branch mate on the diagram is grumblebee, who seems like a cool person. I guess this means we get to copy each other's homework, or share gossip about the people on the other branches, or trade lunches, or something.
posted by amyms at 9:53 PM on July 22, 2008


Neato!
posted by ThePinkSuperhero at 9:54 PM on July 22, 2008 [1 favorite]


So basically, if I understand you right, this dendogram thingamajigger uses a whole bunch of fancy book-learnin' math to group users with other users who have used the same tags as each other in their posts, yes?

I'm also guessing that the further right on the dendogram you are, the less often a user has used a tag?

Brain hurts... retreating now...
posted by Effigy2000 at 9:54 PM on July 22, 2008


But I barely know those people!

Um, or something like that. *stares*
posted by jokeefe at 10:13 PM on July 22, 2008


Could you run this on patterns of favouriting-- who tends to favourite particular user's posts? Would any useful clusters appear, I wonder?
posted by jokeefe at 10:18 PM on July 22, 2008 [1 favorite]


Huh. My closest person is hermitosis. Who's been banhammered, as I recall. (I may be wrong.) I feel like I should edge out of this section of the cafeteria and work my way towards the table to where the cool kids are sitting. You don't mind if I sit with you, right, Miko?
posted by jokeefe at 10:21 PM on July 22, 2008 [1 favorite]


Oh shit, disregard the comment about hermitosis. How could I forget the 500 favourites? Oy. *smacks forehead*
posted by jokeefe at 10:24 PM on July 22, 2008


So if I wanted to dendrogrammically partner with a certain someone, then I should find posts that use their same tags. I am so going to post about gerbils.
posted by netbros at 10:31 PM on July 22, 2008


jokeefe writes "I feel like I should edge out of this section of the cafeteria and work my way towards the table to where the cool kids are sitting. You don't mind if I sit with you, right, Miko?"

I'm sitting at the cool kids table?! Alert Doc Brown I think a time rift or something is imminent.
posted by Mitheral at 10:44 PM on July 22, 2008


I heard we have to sleep with our clusters. Is this true?!

Clusterfucks are not generally a good thing.
posted by Rumple at 11:28 PM on July 22, 2008 [1 favorite]


Burhanistan and desjardins, I am your father. /vader_breathing
posted by Blazecock Pileon at 11:57 PM on July 22, 2008 [1 favorite]


The problem with dendrograms is that they are difficult to read or interprete with more than a handful of classes (here, "users").

After traversing beyond a small distance between clusters, the between-cluster distances become unreadable.

It might be more communicative to put tags into a small set of color-coded classes ("computing"->red, "newsfilter"->green, etc.), label usernames in color based on majority of overlap, and use a heuristic to shuffle colors near to each other.
posted by Blazecock Pileon at 12:05 AM on July 23, 2008


So euphorb is the person I should be targeting for spamming with religious texts and requests for political donations?
posted by Abiezer at 2:44 AM on July 23, 2008


This is rad.
posted by allkindsoftime at 2:58 AM on July 23, 2008


Ah, swell?
posted by oxford blue at 3:32 AM on July 23, 2008


At least I'm not dead last like outlawyr.
posted by emelenjr at 3:38 AM on July 23, 2008 [1 favorite]


Peter H and interrobang, where's my money?!
posted by Brandon Blatcher at 3:39 AM on July 23, 2008


Does that mean Iridic's my long-lost brother? 'Cause if he is my parents have got some 'splainin to do.
posted by Kattullus at 4:57 AM on July 23, 2008


In space, no one can hear you go batshitinsane.
posted by BrotherCaine at 5:10 AM on July 23, 2008


I don't understand this. (Okay, no, I do, sort of.)

*note to self: make more posts! must appear in next dendothingy update!*
posted by rtha at 5:33 AM on July 23, 2008


I'm situated near Steve_at_Linnwood? Jesus Christ.
posted by orthogonality at 5:43 AM on July 23, 2008


I'm not sure I understand this, but I'm sure it's nice to be on a little twig with peacay, being shaded by the leaves of jonson, madamjujujive and fandango_matt on the branch above.
posted by jack_mo at 6:27 AM on July 23, 2008


OK. So, soundsofsuburbia and I have a high level of overlap ("close") in the tags that we use. Divabat has a high level of overlap with the tags used by the cluster of, like, five other users, and a higher overlap with that as a cluster than she does with any given member of that cluster individually. Right?

Right.

I'm also guessing that the further right on the dendogram you are, the less often a user has used a tag?

No, the further right means that you're minimum distance to another person or group of persons is lower. unique snowflake <> common snowflake.

Could you run this on patterns of favouriting-- who tends to favourite particular user's posts? Would any useful clusters appear, I wonder?

That would be a different kind of metric, though interesting.

It might be more communicative to put tags into a small set of color-coded classes ("computing"->red, "newsfilter"->green, etc.), label usernames in color based on majority of overlap, and use a heuristic to shuffle colors near to each other.

Not sure how readable that would be with 80 different tags in the 'small' version.

I'm situated near Steve_at_Linnwood? Jesus Christ.

Keep in mind that this relates to tags, not comments. So maybe you both use 'bush' a lot, for example, but for opposite reasons.
posted by signal at 6:39 AM on July 23, 2008


Ok...but just wondering -what would this information be useful for?
posted by konolia at 6:43 AM on July 23, 2008


"Ok...but just wondering -what would this information be useful for?"

I for one am pretty happy whenever I see my name mentioned, so it was useful in that context.
posted by mr_crash_davis at 6:50 AM on July 23, 2008 [3 favorites]


Here's a less pretty but simpler to interpret 2d 'cloud', where, again, the distance between 2 users relates to the similarity in their tag use.
posted by signal at 6:51 AM on July 23, 2008


I CANNOT BE DEFINED OR CATEGORIZED! I AM UNIQUE!




(well as unique as klangklangston, soundofsuburbia, divabat, liam, blazecock pileon, burhanistan and desjardins).

Pretty cool, Signal!
posted by KevinSkomsvold at 7:10 AM on July 23, 2008


Dammit! I only have 24 posts!

Pay close attention to when I do another one so you can run this all over again, ok signal?

Thanks for doing this, it's kinda neat. I'm disappointed to see that its use hasn't taken off like I'd hoped, so I don't have a branch with other users of the "tutsnuts" tag.
posted by ibmcginty at 7:13 AM on July 23, 2008


This is somehow related to the cabal isn't it? Like an org-chart or hit-list or something.
posted by quin at 7:20 AM on July 23, 2008


signal writes "So maybe you both use 'bush' a lot,"

"Use"? I don't objectify bush.
posted by orthogonality at 8:11 AM on July 23, 2008


signal writes "Here's a less pretty but simpler to interpret 2d 'cloud', where, again, the distance between 2 users relates to the similarity in their tag use."

And in that one, I'm nowhere near Steve_at_Linnwood, but I'm right next to mathowie? WTF? (And I overlap with ericb, which does make sense.)
posted by orthogonality at 8:16 AM on July 23, 2008


Here, try this: reduce your tag-space by clustering it alone, then cluster contributers by that, the clustered group tags. See if that's clearer.
posted by orthogonality at 8:19 AM on July 23, 2008


emelenjr writes "At least I'm not dead last like outlawyr."

Or first, on these graphs first and last place are kind of arbitrary.
posted by Mitheral at 8:31 AM on July 23, 2008


Someday, I will exist!
posted by lunit at 8:36 AM on July 23, 2008


One of these images totally kills my FireFox, dead. Even relaunching with tabs kills it over and over, faster than I can kill the tab.

Why am I not in the user dendrogram? First post in 2/2007, so I should be in the infodump.
posted by DU at 8:45 AM on July 23, 2008


DU you only have 8 posts, the cut off is 25.
posted by Mitheral at 8:58 AM on July 23, 2008


OIC, I thought that was a different graph.
posted by DU at 9:10 AM on July 23, 2008


If the entire infodump was graphed in that manner it would be enormous...
posted by Burhanistan at 9:13 AM on July 23, 2008


Yeah, doing a full networked graph of pretty much any of the Infodump files is crazy talk. We're talking huuuuuge.
posted by cortex (staff) at 9:19 AM on July 23, 2008


That sounds like a challenge!
posted by DU at 9:35 AM on July 23, 2008


That sounds like a challenge!

Do it to it! Really, the problem comes in a couple forms: display space and computational complexity.

Display:

if you want to graph a network of, say, 30,000 usernames, you need to find a way to show that off. Showing it off in a .gif means 30K * the number of pixels an average node and it's surrounding spaces takes up. Imagining a very tightly, very uniformly packed graph where each node-and-buffer is only about 10*10 pixels, you're talking a big image already: 3 million pixels, or a largish photograph.

Your nodes are going to be glorified dots, though, so identifying them is going to be awfully hard; and you'd better hope you don't have too heavily connected a graph, because that leaves you nearly zero wiggle room for a lot of edges between nodes.

If you use usernames instead of dots, your size requirements blossom by an order of magnitude at least, all else remaining equal, and you're suddenly in the questionable territory of a, say, 6000*6000 pixel image. Doable, but a pain in the ass to navigate even on a huge monitor.

Add more space to accomodate less-than-uniform distribution or to open up pathways for a less-than-minimally-connect graph and things just get worse.

So there's that. You can say fuck it and just make a big image anyway—if it's neat, it's worth it, and smaller sizes can look pretty even if they're not so great for reading. You could also do something clever like use flash to make it a dynamic interface to surf through. But it's an issue.

Computational complexity:

Calculating networked graph relations gets expensive fast. Somebody knee-deep in this stuff can probably talk about some of the clever ways to reduce this cost, but we're generally talking O(n2) for laying out a bunch of nodes and edges with any kind of weighting relationships between individual nodes. For small n, that doesn't matter, but for big n it's a problem. And some layout issues are worse yet, cubic rather than quadratic—I think splined edges are probably a good common example of this.

So if you want a big graph, you need to have the patience to calculate it. (And look, and tear your hair out, and tweak your graphing method, and recalculate it. And look, and tear your hair out...) If you want a big pretty graph, you have to have a lot more patience. Memory requirements for all this is also an issue, though that depends a lot on the tools you're using, and the good news is that the tools available (even to lazy armchair dabblers like me) have been getting better over time.
posted by cortex (staff) at 9:51 AM on July 23, 2008 [1 favorite]


In the cloud I'm right next to owhydididoit while on the dendrogram I'm next to Iridic. Now, looking at the top 10 tags of each I seem to have a lot more in common, posting wise, with Iridic. Why is there such a radical difference and why am I so close to owhydididoit in the cloud? While we have some things in common our tags don't seem to be all that similar.
posted by Kattullus at 9:52 AM on July 23, 2008


Really awesome, thanks signal! Also, I'm clearly not pulling my weight post-wise.
posted by Kwine at 10:01 AM on July 23, 2008


orthonogality: Here, try this: reduce your tag-space by clustering it alone, then cluster contributers by that, the clustered group tags. See if that's clearer.

How do you define when to stop clustering? In the linked examples, the whole thing ends up as one big cluster.

So if you want a big graph, you need to have the patience to calculate it.

In the 'small' case, calculating the clustering took about 1/2 hour, and the big one about 12 hours. Drawing the dendrograms is fast, though, as its O(n).

I worked with a 10000 node highly connected graph for my master's thesis. Calculating stuff like Clustering or Betweenness Centrality on that took about 1/2 hour - 1 hour each. I didn't try to make pictures of the graph, as they would have been unintelligible.

Why is there such a radical difference

The 'cloud' algorithm doesn't have a single, definite solution, so it looks for a 'best' solution where everybody is sort of in the right neighborhood. If you run it again, it comes up with a different solution.
The dendrogram is always the same.
posted by signal at 10:45 AM on July 23, 2008


So how do I use this to get people to sleep with me.
posted by Astro Zombie at 11:15 AM on July 23, 2008


Explain how it is accomplished, in detail, and then snuggle up under the blanket once they've nodded off.
posted by cortex (staff) at 11:21 AM on July 23, 2008


That reminds me.

Note to self: Remember to stop falling asleep when Cortex is under the blankets with you.
posted by Astro Zombie at 11:30 AM on July 23, 2008


Not sure how readable that would be with 80 different tags in the 'small' version.

Not very readable at all. But if you further classify tags into larger and more abstract categories, say seven or eight "main" tag classes, then you could color code users. You might also use font size to communicate another dimension of information.

If you use usernames instead of dots, your size requirements blossom by an order of magnitude at least, all else remaining equal, and you're suddenly in the questionable territory of a, say, 6000*6000 pixel image.

If you use a vector representation, like SVG or Flash, this becomes a non-issue.
posted by Blazecock Pileon at 11:45 AM on July 23, 2008


Does this mean Smedleyman left because he didn't want to sit next to me?

*cries*
posted by homunculus at 11:45 AM on July 23, 2008


Can we get a >= 3 posts thingy? pretty please?
posted by cowbellemoo at 11:48 AM on July 23, 2008


Tell you what , cowbellmoo. Make 2 more posts, and I'll poke cortex with a stick so he re-dumps the data and I'll run the analysis again.
posted by signal at 12:21 PM on July 23, 2008


Can we get a >= 3 posts thingy? pretty please?

That'd be an awful lot of data to process, I think.

(And besides, it clearly needs to be >= 2.)
posted by Sys Rq at 12:25 PM on July 23, 2008


Hey, wait, what? Apparently I have made three crappy posts.
posted by Sys Rq at 12:27 PM on July 23, 2008


So if I'm reading this correctly, I'm an only child who acquired a passel of smart, good looking children via parthenogenesis. Alrighty.

But on the other hand, stavrosthewonderchicken is either my Grandpa or my Dad, and I have a serious mother *zeoslap* sister *zeoslap* mother *zeoslap* sister *zeoslap* dilemma.
posted by maudlin at 1:23 PM on July 23, 2008


Tell you what , cowbellmoo. Make 2 more posts, and I'll poke cortex with a stick so he re-dumps the data and I'll run the analysis again.

Oh, man. You had me at stick-poking.

It's a deal.
posted by cowbellemoo at 1:33 PM on July 23, 2008


I have learned that I need a much, much bigger monitor.
posted by adamrice at 4:25 PM on July 23, 2008


Taller at the least, adamrice. Like a moni-scroll or something.
posted by brundlefly at 5:21 PM on July 23, 2008


I don't really understand this, but I'm on the same branch as Miko, so I approve.
posted by languagehat at 5:56 PM on July 23, 2008


Heh, me and plep and hama7. That's pretty good company :) The 'japan' tag by any chance?

This is really cool, thanks signal!
posted by carter at 6:00 PM on July 23, 2008


languagehat writes "I don't really understand this, but I'm on the same branch as Miko"

I'm there too along with LarryC, Rumple and marxchivist and we all have History as our number one tag. So does our first cousin Carter.
posted by Mitheral at 7:13 PM on July 23, 2008


An idea:

Plot posting user's name against user names in that user's posts, ie. who responds to whom.
posted by five fresh fish at 10:27 PM on July 23, 2008


I'm not sure I understand this, but I'm sure it's nice to be on a little twig with peacay jack_mo, being shaded by the leaves of jonson, madamjujujive and fandango_matt on the branch above.
posted by peacay at 10:36 PM on July 23, 2008


Heh, I'm in a cluster with (among others) andrew cooke, saucy intruder and mecran01 so, even though I'm never going to sleep with them, I'm OK with their company.
posted by dg at 1:51 AM on July 24, 2008


I can't be bothered to read the comment stream, but signal: oooooooooh, very pretty. I love the colors.

Not sure what it means, man, but I'm sure it's quite good, so thank you very much.
posted by Marie Mon Dieu at 8:57 PM on July 24, 2008


Thanks, signal!
posted by nthdegx at 10:25 AM on July 25, 2008


"selecting the users with 25 or more posts ... which gives me 556 users"

So only 550 of us have created 25 or more posts? Out of 25,000 or whatever?

MAN! Pitch in, n00bzorz!
posted by mwhybark at 2:41 PM on July 25, 2008


25*556 = at least 13,900 posts. That's about a fifth of the total posting history, ever, on the front page. If everybody else pitched in at that rate, we'd be dead.
posted by cortex (staff) at 2:48 PM on July 25, 2008


So only 550 of us have created 25 or more posts?

Actually, it's 25 or more posts that each use at least one of the top 80 tags.
posted by signal at 6:47 PM on July 25, 2008


« Older Lesbian / lesbian   |   Your solitary compulsion becomes a shared... Newer »

You are not logged in, either login or create an account to post comments