Suggested use of robots.txt for better searching of MeFi and related subdomains January 15, 2006 8:40 AM   Subscribe

Robots.txt for the <tagname>.metafilter.com domains should exclude all rebots. Currently the robots reindex the entire site for each subdomain.
posted by Sharcho to Bugs at 8:40 AM (23 comments total)

I believe that fixing this and implementing "304 Not Modified" would significantly improve the search results from search engines.
posted by Sharcho at 8:47 AM on January 15, 2006


Mod note: fixed the brackets in the post, once you do a hard preview the gt; lt; get literalized
posted by jessamyn (staff) at 9:05 AM on January 15, 2006


I think it's too early to tell if search results will be at all affected by tagname subdomains.
posted by mathowie (staff) at 10:02 AM on January 15, 2006


*sadly shakes head*
posted by quonsar at 10:21 AM on January 15, 2006


mathowie, I think it's pretty logical that if instead of indexing 50,000 pages, it now needs to index 50,000,000 pages, you obviously won't get better/fresher results, and it obviously causes undeeded load on the server, and adds repeated results.
posted by Sharcho at 10:49 AM on January 15, 2006


Better yet, all of the thread links, etc from foo.metafilter.com should explicitly go to www. Therefore, just the "home pages" on the tag subdomains would be retrieved.
posted by Plutor at 10:54 AM on January 15, 2006


Ok, I can make sure all front page links are hardcoded to www...
posted by mathowie (staff) at 11:06 AM on January 15, 2006


Also, a few people volunteered to create an offsite search. How about one big RSS file of every post to the front page to date (in RSS 2.0, with usernames and dates shown) and then search engine authors could just use the daily rss feed for updates and offer a search of the 45k threads using that data?
posted by mathowie (staff) at 11:08 AM on January 15, 2006


mathowie, once you create that huge RSS file, I recommend that you submit that RSS file to Google Sitemaps (which can work with RSS feeds). Shouldn't take more than 10 minutes once you have the RSS feed.
posted by Sharcho at 11:27 AM on January 15, 2006


You'd have to change things in links in the entire site to point to www., not just the home page. e.g. in the URL http://art.metafilter.com/mefi/11111 (links to previous older/newer, etc.) , it still has references to relative links, so it will again start to crawl the entire subdomain. It would be rather difficult to get rid of all the relative links. The robots.txt solution is much simpler.
posted by Sharcho at 11:41 AM on January 15, 2006


i only just understood what you're advocating (after staring at the robots.txt standard defn for ages) - have a robots.txt that blocks everything and return that for queries on X.metafilter.com where X is not a member of (www,ask,meta), right? because you don't specify the URL in robots.txt (which is what i thought you meant).

not sure why i'm posting this, really. in case anyone else was as stupid, i guess...
posted by andrew cooke at 12:48 PM on January 15, 2006


andrew cooke, exactly. For www/ask/meta/projects it should return the current robots.txt, but for all the other subdomains it should return a different robots.txt that blocks everything, (or everything except the root page)
posted by Sharcho at 2:03 PM on January 15, 2006


Sharcho, the bummer of that idea is the foo.metafilter.com page is just the index page of www.metafilter.com. I can't think of a simple way to make the robots.txt file available for all non-www domains.

I could just set a base ref on the subdomain headers, so every link goes to www.mefi.com/mefi/123, etc.
posted by mathowie (staff) at 4:45 PM on January 15, 2006


mathowie, you could do it easily with an htaccess rewrite,
e.g.
RewriteRule ^robots.txt$ /robots.cfm [nc]

And in the /robots.cfm do (pseudo-code):

if (subdomain is www/ask/projects/metatalk) then {
output current_robots.txt
} else {
output other_robots.txt
}

posted by Sharcho at 6:05 PM on January 15, 2006


Hmm, that apache directive didn't seem to work, it still loaded up my existing robots.txt. Instead I just set apache/cf to parse .txt files as cfm and made a custom robots.cfm file that seems to work. www, foo robots.
posted by mathowie (staff) at 7:31 PM on January 15, 2006


robots.txt on metatalk/projects/ask subdomains brings an error
posted by Sharcho at 7:34 PM on January 15, 2006


It's a 404 error, so it doesn't actually matter.
posted by Sharcho at 8:09 PM on January 15, 2006


yeah, it never existed for those other sites, nor is there a need really.
posted by mathowie (staff) at 10:18 PM on January 15, 2006


maybe i dont understand but why can't [tagname].metafilter.com just list fpp's that match [tagname] and have the links to the comments, username, etc. go to the sites that host them (ask,meta,www) wont that solve all this?

e.g. batshitinsane.metafilter.com lists amongst other things:

Voters in the US state of Minnesota may find a self-proclaimed vampire on the ballot for the office this year when Jonathon "The Impaler" Sharkey of the Vampyres, Witches and Pagans Party announces his plan to run for Governor, expected later today. Acknowledging that "politics is a cut-throat business", Sharkey has let voters know that whilst he is a Satanist, he dosen't hate Jesus, "just God, the Father."
posted by Effigy2000 at 9:16 PM EST - 39 comments


(mouseover the links)
posted by jojomnky at 9:45 AM on January 16, 2006


Huh. What jojomnky said.
posted by cortex at 10:24 AM on January 16, 2006


mathowie, another thing that needs a 301 redirect is
http://www.metafilter.com/comments.mefi/12345 ->
http://www.metafilter.com/mefi/12345

http://www.metafilter.com/user.mefi/12345 ->
http://www.metafilter.com/user/12345
posted by Sharcho at 11:39 AM on January 16, 2006


They're the exact same file requests Sharcho, one (/mefi/foo) is just a apache mapping to the other (/comments.mefi/foo).
posted by mathowie (staff) at 2:53 PM on January 16, 2006


mathowie, yes, I know, that's the problem, you should change the mapping so it will be a 301 redirect (like you already did for metafilter.com -> www.metafilter.com)

Duplicate content causes problems for the search engines, for the same reasons that are mentioned above.

The .htaccess rewrite rule should use [R=301,L]

BTW: MeFi thread 12345 above returns a 500 error
posted by Sharcho at 3:31 PM on January 16, 2006


« Older What if these threats became real?   |   Meetup: Jan 15, Ottawa Newer »

You are not logged in, either login or create an account to post comments