Is there any clever way to make sure any href has a URI? March 2, 2006 12:50 PM   Subscribe

In this post the malformed URL comes from leaving off the http:// from the address. Is there any clever way to make sure any href has a URI? I have not clever solution, I just thought I would ask.
posted by mzurer to Feature Requests at 12:50 PM (34 comments total)

I have done that before and caught it on preview - Perhaps it happens some other way as well...
posted by mzurer at 12:52 PM on March 2, 2006


u r rite
posted by thirteenkiller at 12:57 PM on March 2, 2006


if ($post =~ /<a href="/i and $post !~ /<a href="(http|https|ftp|gopher):\/\//i) {
     #discard it
}

Of course, I don't know if RegExps are possible or easy in Cold Fusion, but it should still be possible without them.
posted by Plutor at 1:09 PM on March 2, 2006


Problems that stand between automation and nirvana:

1. People might actually be trying to make a relative link to on-site content.
2. Appending http:// to, say, an https:// document could be problematic.
posted by cortex at 1:11 PM on March 2, 2006


(and using a prompted warning rather than auto-fixing might be a better idea, regardless -- informs and doesn't overrule)
posted by cortex at 1:12 PM on March 2, 2006


I struggled with this problem a while ago when I was writing some little PHP thing. The solution I came up with was basically:

1. look in the a href tag for ://
2. if there's no :// there, add http://

Simple, but effective.
posted by reklaw at 2:05 PM on March 2, 2006


/agrees with cortex
posted by ijoshua at 2:34 PM on March 2, 2006


1. People might actually be trying to make a relative link to on-site content.

Eh? You mean, like linking to a comment in a thread, or even the same thread? But the whole URL is still needed even then, no?
posted by Gator at 2:36 PM on March 2, 2006


Eh? You mean, like linking to a comment in a thread, or even the same thread? But the whole URL is still needed even then, no?

No, actually relative URLs don't require the entire URL and will work on MeFi. It's a bit pathological, but one could link to the immediately preceding MetaTalk thread (i.e. 11401) with the following markup:

<A HREF="11401">

Note that the above link won't work when you view the comment anywhere other than in the main thread. In the other places that it might appear (such as My Comments) the relative URL will lead to a bad location.
posted by RichardP at 2:44 PM on March 2, 2006


I'm highly embarrassed that I didn't know/had forgotten that. Thanks.

Still, does anyone actually use them here?
posted by Gator at 2:49 PM on March 2, 2006


If you view the source to this page, you'll see that my above comment actually makes use of the relative markup that I mentioned. The 11401 link that appears in the comment is a relative URL.
posted by RichardP at 2:51 PM on March 2, 2006


Oops, I guess you were asking if anyone used the for some legitimate purpose. Not that I've seen.
posted by RichardP at 2:52 PM on March 2, 2006


While I agree in principle with cortex, in not making fixes that can break other cases, the objections don't cut much ice: There is no real reason to make a relative link; And in the case of an https:// that is missing, the URL won't be any more borken with http:// rather than with http://www.metafilter.com/ preceding it.

But a detect and warn sounds like a great idea.
posted by mzurer at 2:56 PM on March 2, 2006


Yeah, I just wondered if anyone actually uses relative links on MetaFilter, aside from the example you provided. Even though it seems like a nifty shortcut (now that I know/have been reminded of it), I'll probably continue to use full URLs for consistency's sake, myself.
posted by Gator at 2:57 PM on March 2, 2006


mzurer: granted, I was arguing principle to see what stuck. I doubt more than a handful of weirdos have ever bothered to use relative links, and a borked link is, indeed, a borked link.

Detect and warn! Educate the masses!
posted by cortex at 5:30 PM on March 2, 2006


ColdFusion having regexes would mean that it was good for something.
posted by cellphone at 8:19 PM on March 2, 2006


There is no real reason to make a relative link

Sites that use relative links can be moved.
posted by George_Spiggott at 8:56 PM on March 2, 2006


Sites that use relative links can be moved.

But they also bork things up if you're reading off-site (in an RSS reader, for example).
posted by joshuaconner at 11:27 PM on March 2, 2006


Sites that use relative links can be moved.

Yes, but relative links in this context is only relative to Metafilter.com, and I don't think it's going anywhere.

I thought of a solution. Turn off the prepending of the metafilter domain in all cases. Don't know if that is easy or clever though. I also don't know if that breaks relative (to metafilter) hrefs.
posted by mzurer at 8:21 AM on March 3, 2006


mzurer: "I thought of a solution. Turn off the prepending of the metafilter domain in all cases. Don't know if that is easy or clever though. I also don't know if that breaks relative (to metafilter) hrefs."

It's a browser feature, not part of the MeFi code. When it doesn't see http:// (or any protocol), the browser assumes it's a relative URI. That's the problem here. There's no way (or rather, it's very difficult) to determine if something without a protocol is a relative URI or is just a malformed absolute one.
posted by Plutor at 10:29 AM on March 3, 2006


But they also bork things up if you're reading off-site (in an RSS reader, for example).

Then the RSS reader is doing it wrong. Relative or partial URLs are part of the www specification (rfc1630, among others). It's not up to the implementers of a new technology to decide that standards are void because they can't be bothered to support them. And it's not like it's difficult.

Yes, but relative links in this context is only relative to Metafilter.com, and I don't think it's going anywhere.

It doesn't matter what you think, various things happen. Sometimes webmasters have to rehost due to temporary hardware or DNS issues; or you may rehost to ease the transition to a new host while DNS propagation occurs (I've done this several times, it allows continuity of service irrespective of visitors' DNS cache times). Disallowing relative URLs breaks the web.
posted by George_Spiggott at 10:45 AM on March 3, 2006


Actually RFC1808 has the first detailed form of the spec.
posted by George_Spiggott at 10:47 AM on March 3, 2006


George_Spiggott: "But they also bork things up if you're reading off-site (in an RSS reader, for example).

Then the RSS reader is doing it wrong. Relative or partial URLs are part of the www specification (rfc1630, among others). It's not up to the implementers of a new technology to decide that standards are void because they can't be bothered to support them. And it's not like it's difficult.
"

Say you're an RSS reader. You get an RSS feed, which as part of the content has, say <a href="foo.html">Click here!</a>. Where would you send a click on that? Relative to the feed's link (i.e. the blog's home page)? Relative to the item's link (i.e. the entry page)? Which follows the "WWW spec"? Which is DWIM? Which will prevent people accusing you of "doing it wrong"?

Not all is black and white, you know.
posted by Plutor at 11:25 AM on March 3, 2006


Good point -- it's not the reader that's broken, it's the feed. Since the reader isn't in a position to know the original context, the feed will have to rewrite relative links to be absolute, since it's the only component of the process that knows where it goes. As I've written RSS feeds, I'm now going to go and make sure that mine actually do that.

This does mean that in some of scenarios described above, syndicated content that's cached on the client could be wrong for a time, but situations like this abound, and as long as they're temporary (RSS readers should not cache for long periods and certainly not permanently) it becomes a marginal case.
posted by George_Spiggott at 11:36 AM on March 3, 2006


Not unexpectedly, this issue has been explored before. Here's a good discussion of the issue and standards-based approaches to mitigating it.
posted by George_Spiggott at 11:42 AM on March 3, 2006


It's a browser feature, not part of the MeFi code.
That possibility was bouncing around in my head somewhere, but I never looked into it deeply. Interesting.

It doesn't matter what you think, various things happen.
...
Disallowing relative URLs breaks the web.

So what is your suggestion on how to address the problem?
posted by mzurer at 12:07 PM on March 3, 2006


I encounter this problem all the time, and there's no provably correct solution, you have to code up something that more or less fits your site and catches the majority of cases.

To reiterate the problem, how can the server detect the intent to post a link to another site, i.e. an absolute URL when the scheme portion ("http://") is missing? About the only thing you can do is craft up some regexps (coldfusion apparently supports a function called "REFind" for this purpose) that matches pretty well on domain names. As long as your site doesn't have relative paths that look like domain names from a pattern matching perspective, you can pretty well be sure the poster was trying to post an absolute URL and you can prepend the scheme as needed.
posted by George_Spiggott at 1:07 PM on March 3, 2006


Well, there's another possible approach: give the server a heuristic to decide whether there was an intent to post a non-external (and hence relative) link.

In other words, given a url string, test it for plausibility as an internal link, according to some site-specific grammar of possible internal links.

(Mefi's grammar would include, for example:
  • mefi/\d+
  • mefi/\d+\#\d+
  • user/\d+
  • contribute/mycomments.mefi
  • tags/.*
and so on.)

Which is not to say that the above is any more practical than the regex matching of external domains.

(An alternate take on the above: attempt to actually resolve the suspected relative link, and throw a warning if it doesn't return healthy content.)
posted by cortex at 3:10 PM on March 3, 2006


cortex: That last is actually completely practical (barring ColdFusion weaknesses, I don't recall offhand if CF includes an HTTP client function, but it would surprise me). If you displayed the full URL that you were attempting to resolve in the warning message it would tip the user off that they'd entered a relative link when they'd meant to enter an absolute, by saying "Could not open http://metafilter.com/mefi/www.cat-scan.com, please re-enter."
posted by George_Spiggott at 3:35 PM on March 3, 2006


The real fun, of course, would be in using both techniques—match for a legal domain name, match for a valid relative link—and then do something clever with the edge cases that return both false (mysterious href!) or both true (valid-looking external domain that is also a valid mefi relative link!).

Heh.
posted by cortex at 3:45 PM on March 3, 2006


Aside from the example used in this thread, has anyone actually ever used relative links on this site?
posted by mzurer at 3:58 PM on March 3, 2006


I cannot categorically state that I have never done so, but, practically speaking, no. Why would they? It'd be a nutso thing to do, and should not be encouraged.
posted by cortex at 3:59 PM on March 3, 2006


If you've ever implemented the tiny_mce editor plugin, it automatically converts absolute links to relative ones if they refer to the current site, unless you tell it not to in the configuration options.

Relative links are not a nutso thing to do ("are too!" "are not!") at least if you're linking to another thread. And they'd continue to work if Matt ever rearranged the site, let alone had to move it.
posted by George_Spiggott at 4:10 PM on March 3, 2006


I mean user-generated relative links [with very few exceptions] on Metafilter, specifically. Relative links in general are, indeed, awesome.
posted by cortex at 4:32 PM on March 3, 2006


« Older CNN ignored us   |   What happened to airnxtz? Newer »

You are not logged in, either login or create an account to post comments