Near-match link checker? July 29, 2010 11:21 AM Subscribe

Double post detectiong link checker pony/bug?

In this (soon-to-be deleted double of this) post, I assume the link checker didn't detect the double due to the arguments (the part after the ?) being different. Obviously sometimes different arguments will truly result in different pages (e.g., http://example.com/article?id=xxxx), but shouldn't the code that checks for potential doubles at least suggest the possibility of a double when the only difference between links is the part after the question mark?

posted by axiom to Bugs at 11:21 AM (26 comments total) 1 user marked this as a favorite

I also didn't get a warning about the use of the "hospice" tag in both posts. At least, I don't think I did.
posted by chunking express at 11:24 AM on July 29, 2010

The link checker is just a tool, we'll never reach perfection with it. Sometimes stories get posted to multiple sites with completely different domain names. We don't see this type of double-post every day, it's not hurting the site much. I think trying to parse every potential variation of every URL would lead to more problems than it would fix things.

I completely understand the frustration with double-posts, but I don't think they can be completely eliminated with technology. Checking tags first often works better than the link-checker.
posted by pb (staff) at 11:25 AM on July 29, 2010

Detectiong! It's midway between 'detection' and 'detecting' -- it's like my fingers couldn't decide which one I meant.
posted by axiom at 11:25 AM on July 29, 2010

Is the tag checker case sensitive?

And yeah, I find people on the site usually catch doubles shockingly fast.
posted by chunking express at 11:30 AM on July 29, 2010

I also didn't get a warning about the use of the "hospice" tag in both posts.

We only warn about similar tags on posts within the past 48 hours. So it's tough with a feature article like this that could be posted at anytime. And we have 11 posts tagged hospice, so it's not helpful to simply warn that tag has been used in the past.
posted by pb (staff) at 11:31 AM on July 29, 2010

Is the tag checker case sensitive?

Nope. HOSPICE would match Hospice and hospice.
posted by pb (staff) at 11:33 AM on July 29, 2010

I think trying to parse every potential variation of every URL would lead to more problems than it would fix things.

I wasn't really suggesting every potential variation (not sure how you'd even predict potential variations) so much as "ignore everything after the ?" but I get your point that it's something of a solution in search of a problem. I was just thinking it'd be super easy to implement, so hey, why not? OTOH I suppose the double got detected tout de suite anyway so all's well etc.
posted by axiom at 11:36 AM on July 29, 2010

speaking as a web coder, "ignore everything after the ?" would be severely problematic. in some instances, that stuff is extra info-garbage, but sometimes it's actually the part that tells you which article to go to.
posted by epersonae at 11:40 AM on July 29, 2010

For some sites, everything after the ? is critical, like this:

http://example.com/article?id=1010101

And for Google search results, everything after the ? is what makes the URL unique. It would work for the New York Times, but not for every site. And trying to keep track of sites it works for vs. those it doesn't would be a never ending game of whack-a-URL with a bunch of false positives along the way.
posted by pb (staff) at 11:40 AM on July 29, 2010

I used to often see people accidentally linking to the thread they were commenting in, instead of the obviously intended link. Reading this MeTa post, reminded me that I haven't seen this happening as much lately. Was that something that was fixed sometime in the last 6-12 months?
posted by marsha56 at 11:47 AM on July 29, 2010

It would be impossible to distinguish between arguments that are required (e.g. article ID as pb pointed out) and all the extraneous settings/tracks without which the link would still work. This would also have to be done on a site by site basis rendering this pony completely unviable.

*tosses this on top of the dead pony pile*
posted by special-k at 11:47 AM on July 29, 2010

Could be, marsha56. We recently tweaked the link button under comment forms to automatically add http:// if people forget it.
posted by pb (staff) at 11:51 AM on July 29, 2010

Maybe that could be tweaked without too much trouble?

Again, I think this is trying to come up with a blanket rule based on one or two events. If this is really happening all the time and bogging down the site it might be worth the false positives and headaches involved with checking for two versions of every URL. (One with www and one without.) We aren't going to be able to catch every permutation of a URL.
posted by pb (staff) at 12:16 PM on July 29, 2010

The other thing we can't come up with a technological solution for is people choosing to ignore the double-link warning. So we can hone the double-checker as much as we want and some out there are still going to post their post anyway. I don't know why.
posted by pb (staff) at 12:25 PM on July 29, 2010

"Other URLs beginning with http://www.youtube.com/watch have been used recently, so this is probably a double and you shouldn't bother posting it."
posted by Wolfdog at 12:37 PM on July 29, 2010 [1 favorite]

You would get WAY too many false positives if you ignored the query parameters. It would just swing the problem from being the occasional false negative to the very frequent false positive, and people would just have even more grounds to ignore what it says.

Besides, this is a problem you can solve yourself -- remove extraneous stuff like www and any query parameters that you can tell from inspection don't matter and plug it into the checker.
posted by Rhomboid at 1:57 PM on July 29, 2010

I meant "ignore everything after the ?" as a means of detecting possible doubles when comparing the links for potential doubles, not remove it from the actual link that goes into the post, which, yeah, would obviously break them. Basically, the code would just allow potential matches to include any recent FPPs with links that almost match one in the post you're creating (almost meaning "but for everything after the ?").
posted by axiom at 3:18 PM on July 29, 2010

Uh yeah, we got that. Please re-read Wolfdog's comment for an example of why this would cause the duplicate checker to spit out mountains of false positives which would render it useless.
posted by Rhomboid at 3:35 PM on July 29, 2010

Sorry, I should've made it more clear I was referring to special-k's comment, which seemed to suggest that I thought the arguments could be stripped from the actual link. Which on re-reading, I actually think can be interpreted either way, so what do I know?

And yes, you would get a lot of false positives if you ran it against every post ever. I was suggesting more like "in the past 7 days" or something along those lines. Obviously this would not solve the youtube problem, so yeah, instead of continuing to tweak the heuristic, at this point (as pb has already pointed out) it's not worth the work of modifying the heuristic just to catch an occasional double.
posted by axiom at 3:43 PM on July 29, 2010

You could, if you were really invested in this pony, scan the infodump for recurring URLs with query strings to get a feel for popular sites that use that format (like YouTube). Then implement the pony with a blacklist that exempts those sites but checks for anything else. Would reduce the false positive issue quite a bit.
posted by The Winsome Parker Lewis at 3:48 PM on July 29, 2010

I know nothing about high-end coding, but I often wonder if it's possible to check a percentage of a link....I posted a meta a while back after having a link deleted bc there was a referrer code tacked on to the original link, which was in every other way identical. I was remiss in not checking the tags, obviously, but....can the checker try to match a portion of the link? The actual ".com" portion plus 8 characters? Sure, you can remove it yourself, but doesn't that presume some level of familiarity with this stuff that some users might not possess?

As I said, my coding experience is limited to building some of my own stuff and some wordpress massaging, nothing that takes any knowledge, so I'm more asking out of curiosity.
posted by nevercalm at 3:54 PM on July 29, 2010

Wolfdog wrote what I meant to say much more succinctly.

but shouldn't the code that checks for potential doubles at least suggest the possibility of a double when the only difference between links is the part after the question mark?

You fail to realize that for most sites the stuff after the ? is the only thing that distinguishes different pages. If you ignore that then every single link (Wolfdog's example) would get flagged as a false positive.

Sorry, I should've made it more clear I was referring to special-k's comment, which seemed to suggest that I thought the arguments could be stripped from the actual link.

No, I didn't mean that arguments should be stripped from a URL. What I meant to say was that it would take an insane amount of work to determine which arguments are the page identifiers and which are not, both of which are vary among (and within) sites. So pb's would have to code a different rule for each site and constantly keep an eye out for changes, which entirely defeats the purpose of a simple link checker.

tl; dr: Bottom line is that people on Mefi are smart and flag double posts within minutes and they get taken down shortly after. We don't need a complicated link checker.
posted by special-k at 3:56 PM on July 29, 2010

I think you're all missing the point, which is that the admins do not want to be in a position of maintaining a bunch of site-specific rules -- and that's the ONLY way that the current situation can be improved, by hand coding rules that list in detail what parts of the URL matter for every site. For example with youtube the only way to know that http://www.youtube.com/watch?v=r2PM0om2El8&feature=player_embedded and http://www.youtube.com/watch?v=r2PM0om2El8&feature=related and http://www.youtube.com/watch?v=r2PM0om2El8 are all the same link is by writing a rule that says "if domain=youtube.com, then look for the 'v' query parameter, take its value and compare it to that of other youtube links." But even that is not foolproof as there are other styles of links to youtube videos, such as http://www.youtube.com/user/TwoTurntablesNMic#p/a/u/0/r2PM0om2El8 which is the same video but linked from the user page.

There is no way to automate this knowledge, you have to write rules for each site because every site does it differently. That is why they have said that the occasional false negative is worth not having to maintain a long list of rules, and why I said that if you're worried about it you can do this URL canonicalization yourself before composing your post if you want to be sure that the link checker has the best chance of working, as outlined in the crunchland method.
posted by Rhomboid at 4:05 PM on July 29, 2010

Or, y'know, just check posts with similar tags. Seriously.
posted by unSane at 9:05 PM on July 29, 2010

Rhomboid: "There is no way to automate this knowledge, you have to write rules for each site because every site does it differently."

I can imagine two general tools for link dup checking:
* fuzzy matching. Given two URLs, calculate an edit distance or something, and if it comes up below a threshold, alert the poster. I haven't yet posted in the blue, so I'm not sure how the existing dupe checker works.
* parameter matching. Given two urls with matching paths, convert their parameters into parameter lists. If there's a common variable name with different values, it's not a dupe.

The latter would work in the axiom's example, but isn't perfect.
posted by pwnguin at 10:16 PM on July 29, 2010

You don't have to post to test the duplicate checker, see the crunchland method. The current dupe checker works on full string matching of URLs, i.e. a dupe is detected if the exact URL has appeared in a post previously.

Your proposal don't work in a number of situations. Case 1: a New York Times article has the URL of "$foo", but the single page version of the same article has the URL of "$foo?pagewanted=all". The same is true with The New Yorker except the base URL is "$url" and the single page version is "$url?currentPage=all". Neither of these can be detected by parameter matching because the base version has no parameters at all while the single-page version does, and there's 14 or 15 characters edit difference. You have to code a rule that says for domain=nytimes.com or newyorker.com, it's okay to ignore query parameters.

Case 2: for sites like www.guardian.co.uk and www.economist.com the base url is "$url" and the printable verison is "$url/print". In these cases there are no query parameters at all, but the last 6 characters differ. But if you let the last 6 characters differ then you also consider things like "example.com/article.php?id=12345" and "example.com/article.php?id=56790" to be the same when they are completely different articles, so clearly that's not viable. There are even sites where the printable version of the article has a different hostname, starting with "printability.something.com" where the base version is "www.something.com".

Again, every site does it differently. The only way to truly improve on the current status quo would be to code a bunch of per-site rules, and I can sympathize with the desire to avoid that kind of micromanagement.
posted by Rhomboid at 11:16 PM on July 29, 2010

« Older Two (or more) heads are better than one? | Synthosaurus in Vancouver BC! Newer »

You are not logged in, either login or create an account to post comments

MetaTalk

Near-match link checker? July 29, 2010 11:21 AM Subscribe

Tags

Share