To Index or Not? January 4, 2002 11:58 AM   Subscribe

The site is getting pummelled lately, so I ran stats on the past few days to see if there was a national news story or something. Of the 300k page views in the past four days, 100k, or 1/3 of the traffic was solely due to the googlebot.

It appears that having 13k threads filled with 200k comments of google-loving ascii is acting as some sort of honeypot, attracting the google indexers like mad. Broken down by day, the Googlebot appears to visit over 25k pages at metafilter.com PER DAY. If you look at browser/OS stats, the googlebot visits metafilter more often than all Netscape clients combined. Also, the googlebot exceeds all visits by people using Mac operating systems.

Although I'm impressed with the results (google searches are the #1 referrer), is it worth basically bringing down the machine and keeping humans from being able to access it? If I were to include a robots exlusion file and block all search bots, would the net community be at a loss for not being able to find information discussed here?

I guess the big question is, does the utility of having the site indexed outweigh the problems the indexing causes?
posted by mathowie (staff) to MetaFilter-Related at 11:58 AM (32 comments total)

FWIW, I'll be contacting someone at google today as well. It appears maybe there's a dark side to google putting fresh stuff at the top of their results. It requires their bots to visit, and revisit, constantly checking for new comments to index.
posted by mathowie (staff) at 12:00 PM on January 4, 2002


I'm not going to try to tell you what to do, but I will say that I have personally found real value in Google's indexing of MeFi. I've had more success finding half-remembered passages via Google than MeFi's search apparatus.
posted by NortonDC at 12:04 PM on January 4, 2002


I guess it depends what the bottle-neck is, horsepower or bandwidth. If older threads (say more than a year old) were archived to a static format, wouldn't that alleviate a lot of the DB overhead caused by bots thrashing the site? I would say that a robots.txt file would be a good tactical soultion, but in a strategic sense, I'd like to see something that minimizes the rendering of pages from the DB. And a pony.
posted by machaus at 12:05 PM on January 4, 2002


Can't you have Google set it up so it only indexes the site say once a week?
posted by riffola at 12:07 PM on January 4, 2002


after seeing some of my profile information (email, aim screenname, etc. -- all since taken down, not that it does much help at this point) indexed on google after searching for "moz", i would really like for google to be kicked out of metafilter if possible.
posted by moz at 12:12 PM on January 4, 2002


moz, you can ask google to remove any cached page from the index. I've emailed the googlebot group (question 9), so we'll see what happens.

My guess is the problem is two-fold. One problem is all the pages are built on the fly, even when threads more than 7 days old rarely get new information added to them. Then, there are all the *.metafilter.com domains being seen in google as discreet sites, so microsoft.sucks.metafilter.com/foo gets indexed along with www.metafilter.com/foo

I hope I can get this worked out, without having to exclude all search bots (though I've had to ban one web.archive.org index bot, which was hitting the site hundreds of times per day).
posted by mathowie (staff) at 12:24 PM on January 4, 2002


NortonDC, I've got to agree with you on the usefulness of the Google search compared to MeFi's own. Matt - is there some way you could arrange to send updates to Google at your end (ie trigger an update), without have to resort to a "robots not welcome" sign? Sort of like push, rather than pull...
posted by dlewis at 12:31 PM on January 4, 2002


This must be a problem that isn't unique to MeFi. This sounds like something that could gather a lot of press, and in turn produce a whole new modification to the RFCs for robot exclusion. Just trying to cast a positive light on this.
posted by machaus at 12:39 PM on January 4, 2002


matt:

thanks for the tip. however, the page you link to requires the use of meta tags; will those work outside of the opening and closing head tags? browsers might care, though i suppose certain indexing robots might not. worth a try at any rate.
posted by moz at 12:46 PM on January 4, 2002


Matt: Could you run an SQL script that plugged in "META NAME="GOOGLEBOT" CONTENT="NOINDEX, NOFOLLOW" into threads older than x days?
posted by machaus at 12:57 PM on January 4, 2002


just an aside. if you've never done it before, do a google search for your mefi user name--"jpoulos" or "machaus" or "dlewis"--and nothing else. Your mefi profile will likely be on top of the list. MeFi is clearly being overindexed, and since we're a self-selecting demographic, we're prime spam targets. I don't know if I'm getting a little paranoid, but [knock, knock, knock].... hold on, someone's at the door.... hello? can I help y--AAAAAAHHHHHHHHHH!!!!!
posted by jpoulos at 12:59 PM on January 4, 2002


One problem is all the pages are built on the fly, even when threads more than 7 days old rarely get new information added to them.

So if I request an old thread today, I'll get a Last-Modified header saying January 4, 2002. Is there some way you could get CF to change this header field to the date of the last entry?
posted by dlewis at 12:59 PM on January 4, 2002


Are they using mroe than one bot, and, if so, any chance you could exclude *some* of the Google bots rather than all?

Littlegreenfootballs had a similar problem with Avantgo; here's what they did.
posted by metrocake at 12:59 PM on January 4, 2002


moz, you can remove individual pages right from their site: Remove URL.
posted by riffola at 12:59 PM on January 4, 2002


Although I'm impressed with the results (google searches are the #1 referrer)

Remember too, Matt, that you have a Google Search built right into the site--which will account for much of that referrer traffic. I'm sure google is the most popular engine among MeFites, but I'd bet that the step before the Google referrer is often MeFi's search page itself.

Not that that has anything to do with the indexing bot...
posted by jpoulos at 1:33 PM on January 4, 2002


riffola, you're right, but the webpage states that "[i]n order for this automated process to work, your webmaster must first insert the appropriate meta tags into the page's HTML code." that's why i asked if it mattered that the meta tags were not in the head of the webpage. moreover, the remove-URL page requires an email address, and complicating things is the fact that while it may be my profile page it is not my website.
posted by moz at 1:39 PM on January 4, 2002


Matt, that you have a Google Search built right into the site--which will account for much of that referrer traffic.

I don't think so, since the results page shows a long query string that sticks out of the search results in the access log. If you check the referrer report, searches from google.com/custom (where the search script is pointing) only accounted for 223 referrers where the standard google query url was over 5600.
posted by mathowie (staff) at 1:41 PM on January 4, 2002


So if I request an old thread today, I'll get a Last-Modified header saying January 4, 2002. Is there some way you could get CF to change this header field to the date of the last entry?

I haven't checked this out, but you may be right. I'll see if I can put in a header hack to stop this.
posted by mathowie (staff) at 1:42 PM on January 4, 2002


do a google search for your mefi user name ... and nothing else. Your mefi profile will likely be on top of the list.

Metafilter comes up as #3 for mine.
posted by crunchland at 1:52 PM on January 4, 2002


I'm adding a last-modified header in coldfusion, setting the date to the last comment on a thread page. That should cut down on the revisits.
posted by mathowie (staff) at 1:52 PM on January 4, 2002


Matt - I was trying to retrieve a copy of the home page by connecting directly to the server. In particular, I wanted to see the HTTP headers. Unfortunately, the server puked saying:

Error resolving parameter USER_AGENT

(Sure enough, I didn't give it one. But it seems odd that it should choke on what is otherwise an optional request header.)

Anyway, I was doing this to see if there might be something that could be done with the HTTP/1.1 dynamic content headers to encourage a crawler to be better behaved. Offhand, I don't know the answer to that.

Another thought I had was looking for their USER_AGENT and throttling those requests as necessary to keep the load managable. Pretty easy with Apache; don't know about IIS.

Bottom line is, I think the primary goal has to be to preserve server operations. Attracting new users is secondary to that. After all, if new users come to a site that doesn't work, that's no help. Therefore, it's completely reasonable to block misbehaving user agents.
posted by chipr at 1:56 PM on January 4, 2002


chipr - it's much easier to "ngrep *" and then point a browser at the site (maybe ngrep is only linux?).

i'm surprised google aren't already noticing that pages aren't changing as quickly as last-modified says (after all, they have a cache to check against) and they've got limited resources too (so want to minimize their work). presumably i'm wrong, or this problem wouldn't be so bad - but why?

maybe i'm an exception here, but i've not found google useful with metafilter - i use google for looking up "hard" information, not wild opinions... :o)

how complicated is it to cache old pages? does cold fusion do anything helpful (i suspect not)? if the problem continues after the last-modified date fix then you might trying to pipe all requests through a local instance of squid or similar.
posted by andrew cooke at 2:26 PM on January 4, 2002


Andrew, I'm guessing that Google either uses the If-Modified-Since request header, or else does a pre-check using a head request. If either of these are the case (and if old page traffic is indeed the problem), then Matt's header fix should solve the problem.
posted by dlewis at 2:37 PM on January 4, 2002


I can't click into the comments on the top post. Could this be a side effect of the new last-modified header? (The thread has 0 comments, perhaps the new code isn't handling that case properly.)
posted by sudama at 2:42 PM on January 4, 2002


oops, you were right sudama, I fixed that.
posted by mathowie (staff) at 2:56 PM on January 4, 2002


Why has no one noticed...

Listing the top 50 queries by the number of requests, sorted by the number of requests.

reqs: search term
----: -----------
190: metafilter
63: triacetone triperoxide
54: moose porn
38: shittiest fucking website ever
30: ann coulter
24: bukakke
20: palandrome
18: armenian porn
18: simpsons porno

... I knew there was some reason I was visiting here every day ...
posted by feelinglistless at 4:04 PM on January 4, 2002


Is there any particular reason that *.metafilter.com resolves to the frontpage? If google is re-indexing just because it thinks each are a different site then wouldn't it be worthwhile putting up a redirect instead of showing the front page?
posted by gi_wrighty at 4:49 PM on January 4, 2002


dlewis - Bingo! You nailed it. Googlebot is not using HEAD operations. It is using If-Modified-Since headers. I confirmed this by looking at my web logs; lots of 304 - not modified responses there. So matt, maybe there is some way you can set the server to key off of that?

andrew cooke - I haven't used ngrep but I'll check it out. I was just trying to telnet to port 80 and issue a command manually.
posted by chipr at 5:05 PM on January 4, 2002


What gi_wrighty said, sorta.

At least part of the problem could be Google indexing everything multiple times. But I'm not sure how to fix that without disabling the wildcard domain names... though I suppose that wouldn't be the worst thing ever.

The last-modified header sounds pretty promising, though.
posted by whatnotever at 9:08 PM on January 4, 2002


Matt, I spider MeFi as well for a project of mine. I can see why Google loves MeFi: lots of links to branch out from, high degree of corellation. However, I think that for pages with a large number of comments, Google is not getting out of MeFi what they would like: large threads create too many links, offen not relevant. Plus, for MeFi, a large thread getting hit again and again by bots is more bandwidth and more CPU time...

I would either disallow indexing of the threads (not the front page, please :-), or stick a no-archive meta-tag in a thread after it has grown past some threshold. Even better, dynamically create your robots.txt file, shutting down access to popular or large threads.
posted by costas at 1:58 AM on January 5, 2002


Matt, I spider MeFi as well for a project of mine. I can see why Google loves MeFi: lots of links to branch out from, high degree of corellation. However, I think that for pages with a large number of comments, Google is not getting out of MeFi what they would like: large threads create too many links, offen not relevant. Plus, for MeFi, a large thread getting hit again and again by bots is more bandwidth and more CPU time...

I would either disallow indexing of the threads (not the front page, please :-), or stick a no-archive meta-tag in a thread after it has grown past some threshold. Even better, dynamically create your robots.txt file, shutting down access to popular or large threads.
posted by costas at 1:59 AM on January 5, 2002


Matt: If you can you serve robots.txt dynamically, you can get all of the goofy MetaFilter aliases out of Google by displaying the following lines in robots.txt whenever the visited domain is not "official" (i.e. www.metafilter.com, metatalk.metafilter.com, etc.):

User-agent: Googlebot
Disallow: /
posted by rcade at 5:56 AM on January 5, 2002


« Older Just in time compilation error   |   XML strips out HTML including links Newer »

You are not logged in, either login or create an account to post comments