Duplicate search issues with x.com vs. www.x.com December 15, 2002 9:24 AM   Subscribe

This is a double-post. Presumably the author wasn't told because he used "http://idea-a-day.com" while the first post used "http://www.idea-a-day.com". Perhaps a little bit of code could be implemented to solve this problem?
posted by Pretty_Generic to Feature Requests at 9:24 AM (30 comments total)

A search for "idea-a-day" found it quite easily, and without any extra coding involved.
posted by ecrivain at 9:33 AM on December 15, 2002


You can't make things too easy for people. Actually, I think mathowie should come and make me pancakes right now.
posted by Pretty_Generic at 9:40 AM on December 15, 2002


Yeah, but the double-post checked really should be looking at the domain and what follows it, not what comes before the first . for exactly this reason. It's not that big a deal to prune whatever comes before the index of the first dot, when compared to hoping people do searches correctly.

Something like:
CFSET firstDot = FindNoCase(".", urlVariable)
CFSET urlVariable = Right(urlVariable, firstDot)

Completely untested: I just feel presumptuous saying it's not a big deal for someone to do something for us for free. Also assumes "http://" has already been removed.
posted by yerfatma at 9:47 AM on December 15, 2002


Yes, OK, it was a double post. I emailed Matt as soon as I found out and it's gone now. But was first up a year ago. Have you got a really long memory? How did you know?
posted by feelinglistless at 10:10 AM on December 15, 2002


Thanks for the code snippet yerfatma, I'll give it a try and see if it works, it certainly looks like it will.
posted by mathowie (staff) at 10:48 AM on December 15, 2002


I don't understand cold fusion mark-up, but if you strip out whatever is to the left of the first dot, then wouldn't metafilter.com end up searching on just com?
posted by willnot at 11:41 AM on December 15, 2002


Just checking n-1 levels of domains if n are given will give false positives, for instance people posting first "foo.a-free-webhost.com/some/much-used/directory/and/filename" and then "bar.a-free-webhost.com/some/much-used/directory/and/filename", which could well be entirely different sites (and of course the "foo.com" and "bar.com" willnot mentioned)..

And as you cannot easily determine at which points different sites begin (as in: "foo.com", "foo.co.uk" or "foo.free-webhost.co.uk"), this is a non-trivial problem.

Still, as long as it's just used as a warning I guess it'd be a worthwhile heuristic.
posted by fvw at 12:17 PM on December 15, 2002


you could check if either url is a substring of the other (you have to do it both ways)
posted by andrew cooke at 2:37 PM on December 15, 2002


To keep it simple, why not just discard "http://www." in all searches?
posted by Pretty_Generic at 2:44 PM on December 15, 2002


To keep it simple, why not just discard "http://www." in all searches?

Subdomains. http://www.metafilter.com/mefi/22333 is the same as http://seetheproblem.metafilter.com/mefi/22333, but that test wouldn't catch it.

fvw, your point is more troubling. I should really be checking for the last dot before the first slash (again, assuming http:// has been hacked off) and finding the dot before that for cases like www.froogle.google.com, but that doesn't fix the .co.uk example you mentioned. Crap. I'm not sure what the answer is, but obviously my snippet causes as much trouble as it fixes.

andrew, your idea is a lot more robust than mine, but it seems resource intensive (if I understand it correctly) and I'm not sure it addresses each case. Maybe discarding "http://" and "www." separately really is the best improvement-- while it doesn't solve a lot of problems, I don't think it'll cause any new ones either.
posted by yerfatma at 3:02 PM on December 15, 2002


don't forget there's also this case:

http://blabla.com
and
http://blabla.com/
posted by quonsar at 4:43 PM on December 15, 2002


...not to mention
http://blabla.com/index.html
and
http://012.345.678.901 (blabla.com's IP address).
posted by timeistight at 4:56 PM on December 15, 2002


A rule that identifies the same content through different urls can only use defacto standards. Because it's defacto, and urls are just a string, you're forever patching the method and you can only reasonably expect it to work most of the time. For example,
  • Some web servers are case sensitive;
  • Depending on the encoding spaces might be + or %20, I think. Some sites use & when & will reach the same content;
  • The www. subdomain isn't obeyed by all web servers;
  • Sites could have different content at users.metafilter.com and partners.metafilter.com. Discarding subdomains will catch some and ignore others;
  • Files hosted on Akamai have their url change;
  • Sites mirror others;
  • Some sites have referrer ids;
  • Some sites have https:// and http:// ;
  • http:/ (with a single slash) works in some versions of Mozilla;
  • Redirects;
If I had more time I could probably double that list. Without actually downloading the content and saving it (or an MD5) you can't be sure because content doesn't have a unique url. Even then, if they correct a typo you don't want to regard it as different content.

What I'm eventually saying is that using urls this way is as difficult as trying to parse human language. Catch the basics (remove the protocol and www. from the search) but doing any more will make make your hair go white.
posted by holloway at 4:58 PM on December 15, 2002


the problem is that things aren't well defined - there's a mish-mash of conventions that can't be simplified into one set of rules. from a technical point of view, www.foo.com, foo.com, bar.foo.com and 123.456.789.10 are all distinct. if they produce identical results it's only because someone is following a convention. and discarding the protocol ("http://") is also dangerous (particularly if other web devices eventually take off).

personally, when something like this happens, i try to fall back on someone else's work - see what solutions others have tried (in java this often means using the library classes i should have ben using anyway...). so, for example, does cold fusion have a method that comapres urls (i suspect not, unfortunately)? is there code that already exists written by someone else?

there is one other solution, but it's a fair amount of work - construct a hash of some kind for the contents of each link and then compare those. this is the kind of thing that errr, some startup is doing. the name escapes me. they were focussing on audio files, iirc, and they had a tree-like hash system with some nice properties, but which didn't (oddly, it seemed to me) work well with streams. you'd think it would make sense to have some kind of progressive hash that flagged differences early in the data. but i digress...

on preview: bugger. and now holloway has said it all anyway.
posted by andrew cooke at 5:09 PM on December 15, 2002


hollo: & in url's? Don't you mean %26? (& is html)

(This, incidentally, is something the specs are pretty clear about. Encoding with %'s is always equivalent to the plain-text version, apart from for + and & (iirc, there may be others, always check the rfc first..)
posted by fvw at 6:33 PM on December 15, 2002


As I understand it ampersands have two encodings,

When ampersand is a semantic character then & is correct. By semantic I mean that & should be used for url parameters.
When an ampersand is literal content then %26 is correct.

I follow teachings of DPawson.
posted by holloway at 7:17 PM on December 15, 2002


not sure i'm going to help clarify, but there are two levels of encodings.

first you have the url: http://foo.com?a=and&c=%26 where parameter a is equal to the string "and" and parameter c is equal to the string "&". here the "&" is encoded following http/1.0 (ie 8 bit latin-1 values).

second, when you want to include that url as text inside an xml or xhtml (i suspect it was ok in html) document, you need to escape the remaining "&" using &amp; because an "&" in xml/xhtml indicates the start of a "character entitiy". so inside an anchor tag, you'd use: <a href="http://foo.com?a=and&amp;c=%26">

so again, i'm repeating holloway, but this time i hope i'm adding something - the &amp; is only necessary when the url exists inside an xml document.
posted by andrew cooke at 3:28 AM on December 16, 2002


Wouldn't it be (relatively, compared to doing deep heuristic stuff) simple to a) get the IP address from the name server and compare that rather than the server name, and then b) compare the actual filepath, not including the server (which you could presumably do just by finding the third slash and getting everything beyond it)?

Not that it seems terribly worth it to me, but....
posted by IshmaelGraves at 10:11 AM on December 16, 2002


That wouldn't work because many smaller sites share IP addresses.
posted by timeistight at 10:22 AM on December 16, 2002


For example, all the sites on Matt Haughey's server (metafilter.com, a.wholelottanothing.org, megnut.com, blogroots.com, haughey.com) share one IP address.
posted by timeistight at 10:29 AM on December 16, 2002


andrew cooke: Although I hate those who say that "the validator says".... the validator says that &amp; is neccessary in sgml like html 4. & has long been reserved for entities and you don't get a weird exemption for ampersands in urls, or xhtml / html.
Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values. - HTML 4
posted by holloway at 10:41 AM on December 16, 2002


ok! my summary was wrong - i used "xml document" when i'd already made it clear that i wasn't sure about html.

the important distinction was/is between "text within a page" (when you'd used &amp; assuming it's sgml) and a URL in some other context (eg as a string specifying a URL in a "programming" (eg Java - not "markup") language, where you would use &). i think that's still correct(?).
posted by andrew cooke at 10:59 AM on December 16, 2002


It works, but it's not correct. Programming languages aren't different because & is still ambiguous.
posted by holloway at 12:00 PM on December 16, 2002


no, or at least, that's where we differ. as far as i understand things (including that article you referenced, although it could have been clearer):

the & is a problem inside sgml markup because it's used to mark character entities. but this specific to writing any & inside sgml. it's related to sgml, not to urls in particular.

in contrast, urls can exist outside sgml documents. in their "natural" or "platonic" state, they have plain a &. when you write them in a particular context then, depending on that context, you might have to change some characters. sgml is funny about & so there you need to replace & with &amp;. in another context (say you're writing some C code) "\" might be the tricky character and any plain "\" needs to be replaced by a pair ("\\").

if another context/language, apart from sgml, also uses & to access non-ascii/unicode characters then you'd have to follow the appropriate rules for encoding a plain & - they need't be the same as sgml.
posted by andrew cooke at 12:24 PM on December 16, 2002


I think we're arguing the same point. I have a different take on natural state, I think -- you're saying that an url's natural state has & encoded as %26 already, so & isn't ambiguous, right?
posted by holloway at 12:44 PM on December 16, 2002


...make that "when an url's natural state has literal content &s as %26 already"
posted by holloway at 12:51 PM on December 16, 2002


well, i wrote a big explanation, but preview got rid of all the &amp;s and i then i saw your extra bit. yes, we agree...
posted by andrew cooke at 1:19 PM on December 16, 2002


Where are we on &?
posted by yerfatma at 2:38 PM on December 16, 2002


Oh you sumbitch-- that was supposed to read, "Where are we on &#38;?" but I was too stupid to think through the preview then post thing.
posted by yerfatma at 2:39 PM on December 16, 2002


Worthwhile discussion. Should double post more often ...
posted by feelinglistless at 3:03 PM on December 16, 2002


« Older IRC is Down   |   What happened to this post with an image? Newer »

You are not logged in, either login or create an account to post comments