Data set of links to comments made in other posts? May 16, 2014 9:06 AM   Subscribe

Is there a data set of links to comments from other threads, ranked by the number of times the comment is linked to? Or, could one be created?

For example, this comment in the Who Gets to Graduate thread links to Grumblebee's comment here, in an entirely different post about empathy and failure.

There are other famous comments, such as the ask vs. guess culture comment and the how to break up with someone comment, that routinely get linked to from other discussions as well.

I think this would be a fun resource to have, as it signals quality and provides insight into MetaFilter's culture in a way that favorites do not--a snarky one-liner in a busy thread could get a hundred favorites quickly, but never be linked to again. Grumblebee's comment has only 50 favorites right now, but I bet it's been linked to many times.

Could we add this to the infodump?
posted by jsturgill to Feature Requests at 9:06 AM (19 comments total) 1 user marked this as a favorite

There's no way to pull that out of the existing Infodump files, no. It's something I could look at running as a one-off job, but I don't think we'd add it as a standard recurring job because (a) it's kind of niche and (b) it's probably gonna be sort of a bear to run since it requires parsing the actual content of several million comments to work.
posted by cortex (staff) at 9:17 AM on May 16, 2014


Ok, what do we need to bribe cortex with in order to do this one-off job?
posted by Melismata at 9:21 AM on May 16, 2014 [3 favorites]


Maybe do it in small, spread out batches up to the present day, then ongoing additions for every thread as it closes? Use a cron job that runs once a day or week or whatever to keep it current?
posted by jsturgill at 9:21 AM on May 16, 2014


The site search sort of has a way to see where comments are linked. If I search for 135487 (part of the URL of the thread grumblebee commented in) it shows up with these search results: one comment, and two posts (one of which is this one; the other is a false positive—it has a link to some other site that happens to have the same number in its url). Similarly, searching for 55153 (the thread where the Ask vs. Guess comment appears) catches many links to that comment.
posted by ocherdraco at 9:22 AM on May 16, 2014 [1 favorite]


Maybe do it in small, spread out batches up to the present day, then ongoing additions for every thread as it closes? Use a cron job that runs once a day or week or whatever to keep it current?

I'd be more inclined to just do it all in a lump on a Sunday night and then rerun it every great once in a while. But either way step one will be sitting down to make it happen, which I'll think about.
posted by cortex (staff) at 9:24 AM on May 16, 2014


If I set up a script to grab threads and pull the information out myself, is there an acceptable request rate that would not degrade the servers and be problematic? Like, process 100 posts an hour or whatever? Use a certain user agent? Only run during a certain time?
posted by jsturgill at 9:26 AM on May 16, 2014


Ok, what do we need to bribe cortex with in order to do this one-off job?

$20, same as in town?
posted by sockermom at 9:29 AM on May 16, 2014 [2 favorites]


If you want to DIY a one-off scrape like that, doing it well throttled is the main thing, yeah. 100 threads an hour is totally reasonable and doesn't at that rate need to be an off-hours thing either.
posted by cortex (staff) at 9:42 AM on May 16, 2014


On a related note, sometimes an old comment of mine gets a favorites bump and I don't know why. I assume it was mentioned in someone else's comment, but, if so, how do I track it down? Help, please?
posted by MonkeyToes at 9:52 AM on May 16, 2014


This is a great idea. Cortex, I'd really appreciate it if you did this, though I could see this being computationally intensive. Do we have a count of the number of links that exist in the metafilter corpus?

My first (probably naïve) guess would be use a query with LIKE '%a href%metafilter.com%' to get the pertinent comments into a new table, then use the scripting language of your choice to iterate through them with a regex to extract the links themselves into yet another table. I'm not sure how to then make each link unique with a count of how many there were in a non traveling-salesman way.

Surely someone else has a better idea.
posted by double block and bleed at 9:53 AM on May 16, 2014


I'd be inclined to do some simple xpath scraping:

//*/div[@class="comments"]/a[contains(@href, "metafilter.com")]/@href

The fiddly bit is probably cleaning it up afterwards to catch stuff that has 'metafilter.com' in the URL, but not as the domain, and also try and canonicalize the comment links.
posted by zamboni at 9:55 AM on May 16, 2014


MonkeyToes: "On a related note, sometimes an old comment of mine gets a favorites bump and I don't know why. I assume it was mentioned in someone else's comment, but, if so, how do I track it down? Help, please?"

You could always ask the people who are favoriting it how they came across it. Many people have asked me that over the years via memail, because I have a tendency to favorite comments in old, closed threads.
posted by zarq at 11:11 AM on May 16, 2014


On a related note, sometimes an old comment of mine gets a favorites bump and I don't know why. I assume it was mentioned in someone else's comment, but, if so, how do I track it down? Help, please?

There's this Why Favorited greasemonkey script:

"Adds "why?" links to old stuff in Recent Favorites that were newly favorited. Links to a search for the URL, which will hopefully show you a link from another comment."

I've never used it, so can't vouch for it.
posted by inigo2 at 11:43 AM on May 16, 2014


I have a tendency to favorite comments in old, closed threads.

Oh, you're that guy.
posted by desjardins at 11:59 AM on May 16, 2014


Guilty as charged. :)
posted by zarq at 12:02 PM on May 16, 2014


Wouldn't looking for comments which have accumulated, say, more than 10% of their favourites post thread-closure be a good proxy for identifying heavily-linked comments?

I'm no programmer, but it seems like it might be easier to tally the number of favourites a comment has received after x date than to retrace the entire network of intra-site links.
posted by Diablevert at 12:20 PM on May 16, 2014


Cortex: it's probably gonna be sort of a bear to run since it requires parsing the actual content of several million comments to work.

This seems trivially parallelizable with a MapReduce-like framework.

In Apache Spark, a Python script to find the most-linked comments would look something like this:
# Load comments from some data source
comments = sc.textFile("hdfs://...")

def extractLinks(comment):
    return [ids of comments linked from this comment]  # possibly empty

# Distributed collection of (comment, [linking comment]) pairs:
linksByComment = comments.flatMap(
    lambda linker: [(linked, linker) for linked in extractLinks(linker)]).groupByKey()

# Transform it into a collection of (comment, numLinks) pairs
numLinksByPost = postLinkerPairs.mapValues(lambda x: len(x))

# Grab the top 100 most-linked comments, as (linkCount, comment) pairs:
topLinked = postLinkerPairs.map(lambda (x, y): (y, x)).sortByKey(ascending=False).take(100)
Most comments don't link to other comments, so I bet that the initial link-extraction step would whittle the dataset down to a reasonable size.
posted by Jotnbeo at 1:11 PM on May 16, 2014 [3 favorites]


Because you can never spend too much time gazing at your own navel.
posted by spitbull at 12:47 AM on May 17, 2014 [1 favorite]


On the subject of metafilter analysis, a relevant resource that isn't very well known is Common Crawl, an open repository of web crawl data. It's not complete, and they usually do one crawl per year so it gets stale, but for some purposes it might be a relatively easy way to access a large sample of metafilter content without any worries about putting load on metafilter servers (because they've already done the scraping).

I have some examples here of downloading metafilter URLs from their repository (they have at least 120000 metafilter.com URLs scraped): https://github.com/wiseman/common_crawl_index

The data is in a format that is designed to be used easily from hadoop, so it's pretty straightforward to run map-reduce jobs on it too. It's stored in S3, and I've written trivial jobs that can examine all their metafilter data in about 30 minutes, for a cost of $4 using Amazon Elastic MapReduce.
posted by jjwiseman at 9:59 AM on May 17, 2014 [1 favorite]


« Older MetaFilter Radio on the App Store   |   New FanFare Feature: testing the waters with... Newer »

You are not logged in, either login or create an account to post comments