Make all links a hole in one. (sorry, that was terrible) July 16, 2012 2:43 PM   Subscribe

Pony Request - is it possible to automatically fix broken URLs that omit the "http://"?

I occasionally see malformed links that are posted without the http:// at the beginning, which ends up giving you a relative link like:
http://ask.metafilter.com/12345/www.amazon.com/

I will usually flag these as an HTML error and move on, but I was wondering if some protection could be put in place, since I can't see the need for having relative links on Metafilter.

Would it be possible to have a regex or something similar that looks for http(s):// and ftp:// and other relevant protocol, and adds an http:// at the beginning if nothing's found?
posted by Magnakai to Feature Requests at 2:43 PM (23 comments total)

We have protection for this at the "link" button level—it gets fixed there. But trying to use regex to figure out what people meant is difficult. There have been times when people use this very shortcut for in-thread links and we don't want to break those.

I completely understand the frustration. But right now flagging as HTML Error is the best system we have for this that won't unintentionally break things further.
posted by pb (staff) at 2:50 PM on July 16, 2012 [1 favorite]


Yeah, they don't come up a whole lot and if you flag 'em when you see 'em a human mod can fix them with something approaching a zero percent failure rate.
posted by cortex (staff) at 3:29 PM on July 16, 2012


Question: Why do malformed links turn into relative links? Would it be possible to have them not form a link at all, so that it's more obvious there's a problem?
posted by Conrad Cornelius o'Donald o'Dell at 4:38 PM on July 16, 2012


Fair enough! I generally wouldn't suggest a regex for general purpose use - as you've rightly said it's dangerous and it always feels clunky. I don't know how Coldfusion works, but if it was PHP, I'd do something like
$firstfourletters = substr($link, 0, 4);
if( $firstfourletters !== 'http' || $$firstfourletters !== 'ftp:' )
{do some magic that's seeming increasingly complicated as I try and work this out};


Yeah, okay, point totally seen. You'd then start having to watch for lots of cases, like htp://, htt::/, and doing complicated string removal and replacement, etc etc.

Tbh, it only really bothers me on mobile where it's much more clunky to copy, paste and edit a link - it just seemed like an unnecessary untidiness.

I never thought about using it for in-thread links. Do you just do something like this then? (For Delete the entire content of the link and just paste in #1008771, which refers to pb's post)
posted by Magnakai at 4:38 PM on July 16, 2012


Question: Why do malformed links turn into relative links?

It's actually default browser behavior. When a browser parses an <a href="xxx"> tag, it looks at that path string xxx and tries to decide whether it's a relative or an absolute link based on how it starts. If there's a protocol (most commonly "http://" but there's lots of others as well), it's an absolute link: a link that tells you what protocol to use, what domain to hit, and what path (directories and filename) at that domain to request.

In there is no protocol at the front, the browser will always assume it's a relative link, and will treat whatever you did put in for xxx as a directory-and-filename to try and request. That'll get you either a 404 error (because it's a nonsense path) or the current page (if xxx as interpreted by the web server just gets read as a weird way of requesting the page you're already on).

Note that the case where xxx is actually just an empty string (href="") is just a special version of this. A path of "" is just a request for the exact page you're already on.

This confuses folks sometimes when they think someone has literally, explicitly linked to the current page in a post or comment; actually, they generally didn't successfully link to anything at all, but the browser interprets that in a way that fails somewhat gracefully.

Would it be possible to have them not form a link at all, so that it's more obvious there's a problem?

Nope, not in any way that'd be less tricky than trying to automatically fix them to be proper links. Either way it'd be a matter of trying to parse and make guesses about the intentions of people's tag construction and errors.
posted by cortex (staff) at 5:35 PM on July 16, 2012 [1 favorite]


Do you just do something like this then? (For Delete the entire content of the link and just paste in #1008771, which refers to pb's post)

We don't use any relative links on the site for server-generated stuff, and it's actually not a great idea to use them in comments or posts, because people may be reading the post or comment content from a view other than the thread in which it was made. User profile page activity views, Recent Activity, etc. Better to use a fully-qualified path.

One of the quote type scripts used to have this problem, actually, which lead to some folks routinely creating floating anchor tag links without realizing it that would then seem busted on Recent Activity. Caused a lot of confusion.
posted by cortex (staff) at 5:38 PM on July 16, 2012


Yeah, they don't come up a whole lot and if you flag 'em when you see 'em a human mod can fix them with something approaching a zero percent failure rate.

Not every one that is flagged gets fixed, though. I'm pretty sure I flagged this comment of mine for HTML/display error, as I see I've flagged it and don't imagine it was for any other reason.

Do HTML/display error flags have any kind of priority notification on the mod side?
posted by 6550 at 6:07 PM on July 16, 2012


There's no special priority alert for them or anything, but we tend to get on top of them. If something slips through the cracks you can always drop us a line, of course.

That one's fixed now.
posted by cortex (staff) at 6:17 PM on July 16, 2012


Yeah, they don't come up a whole lot and if you flag 'em when you see 'em a human mod can fix them with something approaching a zero percent failure rate.

SO YOU ADMIT YOU ARE DEVELOPING A ROBOMODERATOR! DAMN YOU CABA
posted by Salvor Hardin at 7:04 PM on July 16, 2012 [1 favorite]


So you're balancing extra confusion against extra coldfusion?
posted by elephantday at 8:09 PM on July 16, 2012 [1 favorite]


If I were looking for issues like this, I'd look for links containing www., .com, .org, etc. but no http:, https:, ftp:, etc.
posted by Pronoiac at 9:20 PM on July 16, 2012


when I link to comments in-thread I use a relative link (because that's what it is), so this would break things for me.

well I could always change my behaviour, but I'm old and stuck in my ways.
posted by russm at 2:23 AM on July 17, 2012


I had the same thought as Pronoiac -- a regex that found links consisting of "stuff-that's-not-a-slash followed by .com|.net|.org followed by a slash or the end of the link" would probably catch 90% of the mistakes with no false positives. Probably not worth the extra complexity, but if fixing the links manually became an annoyance it would be a viable option.
posted by jhc at 7:16 AM on July 17, 2012


We already use a good bit of regex on the site for a bunch of different things. It's very tempting to feel like I can write a bulletproof method to change what people enter for the better. The part of me that loves coding things would love to write this and implement this. The part of me that plays whack-a-mole with the ingenious ways people accidentally route around our existing regex knows that the path isn't as straightforward as it seems.

We have some admin tools to help us spot link shorteners. We try to change those to the full URL when we see them. If this becomes very annoying we could build some tools to help us humans spot and fix these quickly.
posted by pb (staff) at 8:29 AM on July 17, 2012


And of course, required link.
posted by pb (staff) at 8:36 AM on July 17, 2012 [1 favorite]


Pronoiac: “If I were looking for issues like this, I'd look for links containing www., .com, .org, etc. but no http:, https:, ftp:, etc.”

jhc: “I had the same thought as Pronoiac -- a regex that found links consisting of ‘stuff-that's-not-a-slash followed by .com|.net|.org followed by a slash or the end of the link’ would probably catch 90% of the mistakes with no false positives. Probably not worth the extra complexity, but if fixing the links manually became an annoyance it would be a viable option.”

So what you're talking about is a brief, handy little regular expression that checks for two hundred and seventy-five different top-level domains, probably including the several dozen internationalized variants. Please let me know when you get done writing that regular expression, as I imagine it'll be fun reading.

Regex should never, never be used to parse HTML. Never. That is not what it is for.
posted by koeselitz at 9:30 AM on July 17, 2012


Regex should never, never be used to parse HTML. Never. That is not what it is for.

I completely, wholeheartedly agree with that. But. Using an HTML parser for user-generated HTML is also a headache. Unless we force valid HTML at the time of entry we're dealing with messy code. And parsers choke on messy code. So sometimes regular expressions just get things done.
posted by pb (staff) at 9:35 AM on July 17, 2012 [1 favorite]


True. There are probably cases where it's just easier, so "never" is going too far. I mostly mean that, in this case, the rule holds, because it's about a billion times more complex than it might seem when you first think of it.
posted by koeselitz at 9:43 AM on July 17, 2012


russm: when I link to comments in-thread I use a relative link (because that's what it is), so this would break things for me.

Why are you doing that? To save Metafilter a few bytes of html? It seems like you'd have to carefully edit after copying and pasting, and then the links don't work in Recent Activity or on your own comments page.


koeselitz: So what you're talking about is a brief, handy little regular expression that checks for two hundred and seventy-five different top-level domains, probably including the several dozen internationalized variants.

When I wrote that, I considered the new TLDs, like .app, kinda shrugged, and figured that if I were doing it, I'd bother with maybe five, and I may not even include .uk.

How about just looking for "." in the non-regex sense? I don't think any relative links on Mefi have them, though I could be wrong.


I'm getting a strong feeling of deja vu here. I think we've discussed small "hey, you might have faulty HTML here" warnings before - maybe this post. I wonder if a Greasemonkey script would be useful here.
posted by Pronoiac at 12:39 PM on July 17, 2012


For this particular issue, you might be able to solve some portion of it by making the 404 page smarter. this, for example, could either automatically redirect, or if that is viewed as too confusing just add a properly-formatted link with some message like "perhaps you meant this?"

And/or it could automagically notify the mods, who could then fix the offending link.
posted by contrarian at 3:49 PM on July 17, 2012


Yeah, I think there are a couple issues there. First of all, the link you wrote there should be sending people to our 404 page not a blank page like that.

And yeah, we can definitely use our 404 reports to spot bad links.
posted by pb (staff) at 4:11 PM on July 17, 2012


After looking a bit moreā€”the issue is that we allow any link stub in thread URLs. So when you have a URL that doesn't start with http://, it looks like a link stub.

For example, the URL stub for this page is: Make-all-links-a-hole-in-one-sorry-that-was-terrible. When you have a link without http:// it just looks like an alternate link stub. In your example, www.amazon.com. So when the site gets a request with an alternate it link stub, it just serves up the page. There's no 404 involved.

So this URL (the real URL) is just as valid as this URL or even this URL. It doesn't matter what the link stub is. So there's no 404 page that gets served up where we could add some extra info or log the missed hit.

We can think about changing this. But having very permissible link stubs is useful in many cases. It ends up with more people at the thread, even if the URL is munged in some way. As long as the thread ID is accurate, people get to where they're going.
posted by pb (staff) at 4:28 PM on July 17, 2012


Interesting and cool feature with the link stubs (which I guess was broken for metatalk, thus confusing me). Since I'd bet that the number of people who end up at this is higher than the number of badly formatted links, my solution is pretty much not useful. Thanks for looking!
posted by contrarian at 5:27 AM on July 18, 2012


« Older Romney/Wenlock 2012   |   It's raining.... babies..... Newer »

You are not logged in, either login or create an account to post comments