Search is broken again. December 28, 2004 5:03 PM Subscribe

Search is broken, part 532. But this time it's a Google bug. Here's an example: I couldn't find a recent excellent AskMe, until iconomy kindly answered my MetaAsk, noting it was called MySonHatesMeFilter. But using Google, even with that unique string, I don't find the original article, but Yahoo and MSN find it. I'm afraid Google is no longer necessarily "your best bet". How about adding Yahoo and MSN search boxes?
posted by Turtle to Bugs at 5:03 PM (28 comments total)

It worked for me.

Have you got SafeSearch turned on?
posted by mr_crash_davis at 5:10 PM on December 28, 2004

It's like right there dude.... safesearch?
posted by dabitch at 5:34 PM on December 28, 2004

Looks like I wasn't clear enough: the Google search does return two results, but one is from a list of all of mathowie's comments, and the other is my MetAsk question. It doesn't return the original question, but Yahoo and Google do.

Another example? Sorry, it's even more of a self-link. I titled a FPP "Dream Bloat", which I was happy to notice was an original coinage. Google doesn't find the FPP (only RSS-type lists pointing to it). Yahoo and MSN do, in first place, as they should.

There's a possible explanation for that last one: "Dream Bloat" only appears in the title. Maybe Google doesn't index titles. Which would be silly, but not tragic. But recently, I've noticed other ~~disturbances in the Force~~ surprising failings by Google. When it happens once, I think, OK, maybe one of the servers on the farm is off-line. When it happens a bunch of times, I start to suspect (heresy!) that Google isn't working right.

I have Safe Search off.

How many results does Google give you for MySonHatesMeFilter? I get 2 (but neither is the right one). 11 with Yahoo (right one in first place), and 1 with MSN (but it's the right one)
posted by Turtle at 5:43 PM on December 28, 2004

Mysonhatesmefilter
posted by azazello at 5:43 PM on December 28, 2004

ooh, the force is definitely out of whack there.
posted by dabitch at 5:52 PM on December 28, 2004

Hm, azazello, that is an odd bug. I never use Froogle, dunno why it's saying the list of all of mathowie's AskMe replies costs $77.00. I got it for free!

By the way, it's the same link that regular Google lists as its top choice, and it's not the right one. Can someone confirm that? Otherwise I'm going to start feeling paranoid that it's because I'm using Google from France or something.

On preview: dabitch, I'll take that as a confirmation? :-)
posted by Turtle at 5:55 PM on December 28, 2004

Somethign similar happened before in connection with some discussion of Moab.
posted by weston at 6:31 PM on December 28, 2004

More evidence Google sucks, unrelated to Metafilter, this time:

An article was written two months ago in Libération, a major French daily, about three bars in my neighborhood. Google doesn't find it. Yahoo and MSN do. Starting to see a pattern?

I now believe it when double-posters say, "I searched, and didn't find it". So, why not have three different engines on the search pages? It's a tiny little pony (I'd be happy to provide the necessary HTML code, Matt, to save time).

OK I'll shut up now
posted by Turtle at 6:59 PM on December 28, 2004

MSN is a little too good:
Results 1-15 of about 1686046 containing "Mercure Folies "Neuf billards" Liberation"

More than one and a half million results? I don't think so. What's going on there?
posted by languagehat at 7:21 PM on December 28, 2004

Turtle, I found it right away using the second option on the search page - the first one never seems to work correctly, for me, anyway. It's useless. Try using the second one exclusively, the one with the grammatically incorrect phrasing: "If you've prefer to do an exact phrase search on questions or answers, try this search instead". You'll get much better results!
posted by iconomy at 7:25 PM on December 28, 2004

I think MSN is skipping the "neuf billards" bit.
posted by dhruva at 7:36 PM on December 28, 2004

yep, confirmation turtle, that's completly outtawhack. Oh, and here's a Xmas jingle that suits you. my turtle, he is. cute isn't he?

*snort* Half a million "Neuf billards" results? Everything is outtawhack.
posted by dabitch at 7:39 PM on December 28, 2004

languagehat: I think unlike Google and Yahoo, MSN searches for any word rather than all words. If you specify the all option you get the same result as the others.

iconomy: OK. But my point is also for general metafilter searches. I think people often recommend using the Google search. By the way, the "customized" Google search sucks, because (I think) there's no way to get to the "Advanced Search" options, such as limiting the period over which you're searching (which also doesn't work well on Google, but that's another issue).

dabitch: Thank you! But it also made me sad, because my turtle died in mysterious circumstances this summer, and I still miss it, strange as it may seem. Also because it never had such a clean and pretty tank.
posted by Turtle at 7:46 PM on December 28, 2004

I agree. Google is rubbish for finding things on Metafilter. I'm not sure exactly why that is but on occasion I've been looking for a single word, and Google turns up nothing. The internal Metafilter search would then turn up exactly the page I was looking for.
posted by seanyboy at 12:45 AM on December 29, 2004

As a side note: I've had remarkable failure in using google to find things on my blog that I KNOW that I've written about.
posted by dhruva at 1:13 AM on December 29, 2004

Something to consider is that both freshness and depth (although in this case, the thread in question was from the middle of November) are difficult to maintain simultaneously.

To maintain freshness, you have to keep scanning the web over and over again, and adapt to sites based on their popularity as well as their frequency of update. If you re-scan a frequently changing site too often- such as an article on MeFi- you kill their bandwidth by re-requesting the same 20k page over and over just to get the latest 500 bytes of new text, or the latest post on that thread. You also may adopt depth limitations, so you don't spend too much time crawling one site or get trapped in link farms bouncing from dynamic URL to the next dynamic URL, etc. The net result is that you won't be able to keep a complete and up-to-date searchable index of a site like Metafilter or Fark, or others of that nature.

By comparison, if you have a slowly changing site, any web crawler will/should start crawling it increasingly less often, and as such start missing updates for several days. Ideally, you begin to intelligently weight sites and pages through a combination of their popularity- i.e., how important is the freshness of the site to search users- as well as the frequency of update to determine how often you go back to get the pages (you can also use http tricks like 304's to determine if the page is updated at all, before you download it).

All search engines use the same basic process to crawl the web, namely that something else has to link or reference the next page; that page gets added to a urllist, and crawled as appropriate. In the case of Turtle's original post, that would have to be either a) found when it was a front page post at askmefi, or b) found by digging several levels deep through links from an archives page, then the monthly archives, etc.

A worthwhile math exercise to consider this: back of the napkin, if you wanted to maintain, say, an 8 billion document collection- not including image/binary files, etc- and an average page size of 20k (roughly), you need before any redundancy storage for 100 terabytes of data (~30GB hd space per server over 5000 servers, say).

But imagine not only storing, but collecting those files. If you wanted an average page to get refreshed every two weeks or so (knowing that you'll far more frequently update some pages, but almost never refresh other pages because they rank highly but rarely change, and discover new pages along the way, we'll just keep the math simple and say refresh each page every 14 days) you will have to be downloading around 6,500 web pages every second. Not including overhead, etc, this is more than 1 gigabit per second of ingress, 24 hours a day, 365 days a year.

So, you've got ungodly amounts of new and/or refreshed pages coming in every second, and your crawling system has to keep building and updating your index all the time. In doing so, it has to again be keeping track of a page's or site's intrinsic "value", namely how popular it is, how often it updates, etc. Those pages that don't have nearly enough value will start getting forced out by new pages streaming in all the time that may have more value.

So to sum it up:

The faster you crawl/recrawl the web, the quicker you will discover new documents and be forced to replace older ones, out of necessity (you really can't store everything out there). Fresh => Older/less linked documents disappear
The larger your index, the older the docs you can store before expiring, but the slower you will be to keep pages up-to-date because of limitations on how quickly you can re-crawl. Depth => Less freshness per page/site

In all likelihood- and don't take it personally, Turtle- your thread was not well linked or highly ranked, being prior to today likely linked only from the Archives link on AskMe, as well as user pages which are so numerous as to cancel each other out in terms of value, so being more than a month old was expired out of their index in favor of newer pages.

However, the presence of this page itself might very well put your Nov. 18th link right back into Google's index, since it's now better linked-to than most AskMe entries, and will be likely found on the next recrawl of the MeTa page. :)

MSN/Yahoo (mostly the same Index behind the scenes, at least for now) may simply not crawl MeFi as much (mathowie can tell us for sure from his web logs) and/or the web as much, and thus aren't forced to expire that Nov. 18th page yet. And naturally, a dude like M. Haughey, and his user page containing a reference to your Nov. 18th post, is far more prominently linked and thus keeps sticking around the cache- which is why the various engines also include that link as well, and in the case of Google (for now) only that link.

All that said, I can say that yeah... maybe adding another couple of search engine choices might not be the worst thing in the world.

Doesn't wholly solve the problem, though- the only place that can offer true and complete search capabilities for MeFi would be MeFi itself, by say doing transaction replication of the main (and likely quite large, with some 20,000 users, 60,000 posts and 60,000*N total entries where N = avg post per article) MeFi databases to a dedicated read-only server for dedicated searching and indexing. The reverse index- the one with the word table that links keywords to specific article/post IDs could probably be contained entirely in memory, allowing search results to come back phenomenally fast, while the caption lookup (which isn't done by MeFi's own search but the big search engines all have- the part that shows you the specific text section of the page with the keywords highlighted/bolded) would require more heavy disk I/O streaming, although with a higher amount of RAM could probably keep the whole DB entirely in memory, too....

So it's a very solvable technical problem, just not one the H-man necessarily wants to spend his time or money on, making the multi-engine choice an easier one to implement: last I heard MeFi is still his side-hobby and not his full-time job.
posted by hincandenza at 2:21 AM on December 29, 2004

Thanks hincandeza. That cleared up some questions I've had for awhile too.
posted by rooftop secrets at 2:30 AM on December 29, 2004

I can't believe you saw my post that quickly, read it, and responded- in 9 minutes. Man, I thought I was the only weirdo night owl... screw it, I'm going to bed!

But thanks for the appreciation, r.s. ;)
posted by hincandenza at 2:34 AM on December 29, 2004

hicandenza, that post deserves a link on the side bar. Thank you.
posted by Apoch at 6:27 AM on December 29, 2004

damn, I'm only worth $10. I feel sad.
posted by shepd at 10:01 AM on December 29, 2004

$5103.50!
posted by keli at 11:55 AM on December 29, 2004

$39.95 plus tax, shipping and handling not included.

And the Euphorb antidote is only $2.50.
posted by euphorb at 12:18 PM on December 29, 2004

Thanks for the answer, hincandenza, though I'll admit I'm still a bit mystified. I can see how if you have a large index, it would limit how often you can re-crawl URLs, but not how it would limit crawling and indexing every URL at least once. And I'd think there's ways to keep track of popular pages that consistently change over time, such as the various MeFi front pages, and only crawl those regularly, without overwhelming the web site, which for the most part doesn't change. I'd be very surprised if Google wasn't crawling MeFi's front page several times a week, so it shouldn't miss any article URLs.

Also I don't get why you can't keep everything, at least for a few months. And I doubt that transient URLs are such a problem that if the search engine doesn't see a URL several times, it deletes all trace of it within a month.

But I'm no expert, so I'd be happy to read more explanations. Any links on this stuff? Search Engine Watch seems to be one place to look, but I'd be interested in some recent comparative analysis.

An update: as you predicted, Google now returns the original Mysonhatesmefilter article. And indeed, Google is very reactive: this article we're in right now has already been indexed by Google, but not by the other search engines (though your contribution hasn't made it into the cache yet)
posted by Turtle at 6:05 AM on December 30, 2004

It's kind of amazing that can all happen in only 2 days. That an automated process can do all that. For those of you more tech savvy than myself, perhaps you take it for granted. But I think it's pretty impressive.
posted by raedyn at 11:01 AM on December 30, 2004

Turtle: Also I don't get why you can't keep everything, at least for a few months...

See it's not up to you- as a person. We're talking about numbers of pages, and links, beyond human comprehension, and thus decisions as to what to crawl, and how often, have to be completely automated- outside of potential nudges, like URL submission tools or URL blocking for legal reasons, the system has to be deciding the rank of pages, frequency of update, and thus what to recrawl, how often, etc...

All Search Engines that have their own index must spider- they download a page, look for links in that page, then follow those links, download those pages, find links in them, etc, etc. The crawler doesn't know what value the page has yet, after all- it acts simply, just gathering pages with basic ability to throttle off or respect robots.txt, etc. This is how Google et al can find new pages- just keep digging around, only stopping until the leash is yanked and the crawler is told "enough docs! We gotta recrawl the popular ones, and rank these new ones".

So to answer your "why can't they store everything?" question: again, imagine how many servers it takes to store not only an index of keywords, but all those cached copies of pages themselves- that amount is measured in petabytes, which is millions of gigabytes- it would take 10,000 servers with 100GB hard drives at full capacity, just to store one petabyte of data- which may hold only a little more than a single copy of your 5-10 billion document index! You'll need multiple copies for failover/fault tolerance/scaleability of querying. Buying, powering, managing, repairing, upgrading, deploying, etc this many servers is a monumental task of operational support. Asking Google or others to then store an order of magnitude more, much less crawl an order of magnitude more, just to have a complete copy of Metafilter is rather unfair to them. :)

And I'd think there's ways to keep track of popular pages that consistently change over time, such as the various MeFi front pages, and only crawl those regularly, without overwhelming the web site...

It's only later, once a bunch of pages are downloaded, that the Search Engine will rank them (in the case of Google with Page Rank, etc) and categorize them, etc., in refreshing the Index. When that happens, less popular/valuable pages will get bumped in favor of new, fresh, cool pages. After all, the engine can't store every page of Metafilter from day 1- that data is in the many gigabytes, and that's just one site among hundreds of thousands!

Think of it this way: let's say you have a server farm dedicated just to crawling, hundreds of machines getting dozens of documents a second. They pour these newly found documents- from links on pages they've just gotten- into a massive, distributed list of Url's to crawl. Knowing that there is a limitation of how many total docs to store- see above on how much it takes to just store 5-8 billion documents- the crawling servers will basically each create only as many segments of the Search Engine index as the system can hold.

After that, it'll stop digging deeper finding new sites or exploring the hidden nooks and crannies of existing ones like Metafilter, and start re-crawling existing ones that are probably a little stale. As it finds new docs, it'll add them to be ranked and considered for inclusion in the index.

To answer again the question about caching a copy of an old page, documents that don't get bumped by newer, cooler, more popular documents- say a popular FAQ page on logical fallacies that is frequently linked to as a resource but is rarely updated- will stick around indefinitely; however, the indexing engine will, if the crawlers no longer find the page in the future, or if it stops being as popular, eventually expire it out of the index to make room for pages (this is why Google and other SE's cached pages need to be saved to disk if you want a copy; soon enough the page will be re-crawled, and the cached copy updated, or just expired altogether).

I can't speak for Google, but it's quite likely they can and do have those popular pages- home pages of popularly linked sites like www.cnn.com, www.metafilter.com, www.fark.com, or whatever- being frequently recrawled, without manually forcing it. Grouping blocks of URLs by various types- high ranked pages that update infrequently , like online resources; high ranked pages that update frequently like the metafilter home page; middle-ranked pages that update only moderately, etc, etc, would allow them to adjust crawls accordingly- the frequently updated, popular sites will get recrawled perhaps several times a day, but the unpopular, infrequently updated sites may get recrawled only every few weeks. But these decisions will be made by the system, they literally cannot be made by humans.
posted by hincandenza at 12:57 AM on December 31, 2004

Wow, sorry for being so verbose... I guess I just find it an interesting topic (and raedyn: trust me, no one takes this for granted, it's ferociously difficult to do. The equipment costs, the elegance of engineering to make it all work... none of it is done handily!). And yeah, if it's not patently obvious by now, I work for one of those "other" search engines that isn't Google. =)
posted by hincandenza at 1:01 AM on December 31, 2004

Hey, hincandenza, no need to apologize, I think at this point it's OK to give long answers and to be talking about stuff that is only marginally related to MetaTalk... I doubt many people are reading this thread at this point who aren't into this stuff. And since everyone does rely on searching, it's worth trying to understand it better and go beyond the "Google is God" idea.

Further update: MSN and Yahoo have now indexed this thread. Google, however, no longer has it, and the cached copy of this thread mentioned above no longer works. So your comparative freshness theory holds.

I'm still amazed that the cache changes so quickly. I appreciate that indexing is a big job. Still, Google does its job so well that it seems like a bug to me that a fairly popular article like the original Mysonhatesmefilter gets discarded so quickly. Obviously, a lot depends on the algorithms used: the trick is getting them to do what a user would expect.

By the way, let's say that a MeFi article is 15k on average, and that there are currently less than 40k articles: that's half a gigabyte, not "many gigabytes". Ask and Meta probably add another 200 meg(*). Furthermore, most of those URLs will never change again, so the crawler can spare itself the trouble by keeping track of when a URL has last changed and adjusting its crawl frequency, not per site as you suggested, but for each URL. Oh, also, I never suggested any of this was being done manually.

One more thing: why do you say Yahoo and MSN mostly have the same index? Yahoo has apparently indexed 128,000 www.metafilter.com pages, MSN far fewer: 27,254 . Google indexes 137,000.

(*) Since Google seems to index user comment pages as well, the actual total is probably roughly twice that. Still not many gigs that I can see.

PS: Happy New Year! (French time) Yes, I don't have a life at the moment
posted by Turtle at 2:56 PM on December 31, 2004

Yahoo owns Overture and Inktomi, as of last/this year. MSN Search gets their results, both paid and unpaid, largely from Overture and Inktomi. MSN Search, like Yahoo, is trying to "beat" Google and has their own engine in development at beta.search.msn.com; Yahoo has revamped their search engine with a larger Index (which may now account for the difference: MSN may be getting a smaller index via Inktomi)

15KB-20KB per article, times 38,000+ mefi articles... but also 8,700+ meta articles and 13,000+ askme and ~20,000 user pages = ~80,000 pages, not counting the fact that many pages contain redundant data (which does make one wonder how Google and Yahoo can proclaim 128-139,000 results for site:www.metafilter.com- this would require having nearly every user page, and comments page, etc, linked, and yet they don't return mysonhatesmefilter often enough!). That's about 1.2-1.6GB by my calculations.

And yes, the recrawl rate is probably done per page, or at least per site; but you're still asking for a lot of storage and bandwidth to have every page kept, most especially if it hasn't been updated in a long time! Why waste storage on pages from years ago that probably no one but metafilter's archive or user page links still link to, when you could be taking up space on the latest news about the tsunami?

All pages are ranked (usually) out of all pages; so individual pages from the same site can stick around forever, or disappear/reappear quickly because they're on the margins.

Believe me, Google, Yahoo, MSN, and every other search engine would love to boast that they have the most complete engine out there. But it's exceptionally hard to store that much content, or query it speedily for users.

Anyway... happy new year! :)
posted by hincandenza at 7:51 PM on December 31, 2004

« Older Broken RSS on MeFi | Anonymous Questions Newer »

You are not logged in, either login or create an account to post comments

MetaTalk

Search is broken again. December 28, 2004 5:03 PM Subscribe

Tags

Share