Suggested use of robots.txt for better searching of MeFi and related subdomains January 15, 2006 8:40 AM   Subscribe

Robots.txt for the <tagname> domains should exclude all rebots. Currently the robots reindex the entire site for each subdomain.
posted by Sharcho to Bugs at 8:40 AM (23 comments total)

I believe that fixing this and implementing "304 Not Modified" would significantly improve the search results from search engines.
posted by Sharcho at 8:47 AM on January 15, 2006

Mod note: fixed the brackets in the post, once you do a hard preview the gt; lt; get literalized
posted by jessamyn (staff) at 9:05 AM on January 15, 2006

I think it's too early to tell if search results will be at all affected by tagname subdomains.
posted by mathowie (staff) at 10:02 AM on January 15, 2006

*sadly shakes head*
posted by quonsar at 10:21 AM on January 15, 2006

mathowie, I think it's pretty logical that if instead of indexing 50,000 pages, it now needs to index 50,000,000 pages, you obviously won't get better/fresher results, and it obviously causes undeeded load on the server, and adds repeated results.
posted by Sharcho at 10:49 AM on January 15, 2006

Better yet, all of the thread links, etc from should explicitly go to www. Therefore, just the "home pages" on the tag subdomains would be retrieved.
posted by Plutor at 10:54 AM on January 15, 2006

Ok, I can make sure all front page links are hardcoded to www...
posted by mathowie (staff) at 11:06 AM on January 15, 2006

Also, a few people volunteered to create an offsite search. How about one big RSS file of every post to the front page to date (in RSS 2.0, with usernames and dates shown) and then search engine authors could just use the daily rss feed for updates and offer a search of the 45k threads using that data?
posted by mathowie (staff) at 11:08 AM on January 15, 2006

mathowie, once you create that huge RSS file, I recommend that you submit that RSS file to Google Sitemaps (which can work with RSS feeds). Shouldn't take more than 10 minutes once you have the RSS feed.
posted by Sharcho at 11:27 AM on January 15, 2006

You'd have to change things in links in the entire site to point to www., not just the home page. e.g. in the URL (links to previous older/newer, etc.) , it still has references to relative links, so it will again start to crawl the entire subdomain. It would be rather difficult to get rid of all the relative links. The robots.txt solution is much simpler.
posted by Sharcho at 11:41 AM on January 15, 2006

i only just understood what you're advocating (after staring at the robots.txt standard defn for ages) - have a robots.txt that blocks everything and return that for queries on where X is not a member of (www,ask,meta), right? because you don't specify the URL in robots.txt (which is what i thought you meant).

not sure why i'm posting this, really. in case anyone else was as stupid, i guess...
posted by andrew cooke at 12:48 PM on January 15, 2006

andrew cooke, exactly. For www/ask/meta/projects it should return the current robots.txt, but for all the other subdomains it should return a different robots.txt that blocks everything, (or everything except the root page)
posted by Sharcho at 2:03 PM on January 15, 2006

Sharcho, the bummer of that idea is the page is just the index page of I can't think of a simple way to make the robots.txt file available for all non-www domains.

I could just set a base ref on the subdomain headers, so every link goes to, etc.
posted by mathowie (staff) at 4:45 PM on January 15, 2006

mathowie, you could do it easily with an htaccess rewrite,
RewriteRule ^robots.txt$ /robots.cfm [nc]

And in the /robots.cfm do (pseudo-code):

if (subdomain is www/ask/projects/metatalk) then {
output current_robots.txt
} else {
output other_robots.txt

posted by Sharcho at 6:05 PM on January 15, 2006

Hmm, that apache directive didn't seem to work, it still loaded up my existing robots.txt. Instead I just set apache/cf to parse .txt files as cfm and made a custom robots.cfm file that seems to work. www, foo robots.
posted by mathowie (staff) at 7:31 PM on January 15, 2006

robots.txt on metatalk/projects/ask subdomains brings an error
posted by Sharcho at 7:34 PM on January 15, 2006

It's a 404 error, so it doesn't actually matter.
posted by Sharcho at 8:09 PM on January 15, 2006

yeah, it never existed for those other sites, nor is there a need really.
posted by mathowie (staff) at 10:18 PM on January 15, 2006

maybe i dont understand but why can't [tagname] just list fpp's that match [tagname] and have the links to the comments, username, etc. go to the sites that host them (ask,meta,www) wont that solve all this?

e.g. lists amongst other things:

Voters in the US state of Minnesota may find a self-proclaimed vampire on the ballot for the office this year when Jonathon "The Impaler" Sharkey of the Vampyres, Witches and Pagans Party announces his plan to run for Governor, expected later today. Acknowledging that "politics is a cut-throat business", Sharkey has let voters know that whilst he is a Satanist, he dosen't hate Jesus, "just God, the Father."
posted by Effigy2000 at 9:16 PM EST - 39 comments

(mouseover the links)
posted by jojomnky at 9:45 AM on January 16, 2006

Huh. What jojomnky said.
posted by cortex at 10:24 AM on January 16, 2006

mathowie, another thing that needs a 301 redirect is -> ->
posted by Sharcho at 11:39 AM on January 16, 2006

They're the exact same file requests Sharcho, one (/mefi/foo) is just a apache mapping to the other (/comments.mefi/foo).
posted by mathowie (staff) at 2:53 PM on January 16, 2006

mathowie, yes, I know, that's the problem, you should change the mapping so it will be a 301 redirect (like you already did for ->

Duplicate content causes problems for the search engines, for the same reasons that are mentioned above.

The .htaccess rewrite rule should use [R=301,L]

BTW: MeFi thread 12345 above returns a 500 error
posted by Sharcho at 3:31 PM on January 16, 2006

« Older What if these threats became real?   |   Meetup: Jan 15, Ottawa Newer »

You are not logged in, either login or create an account to post comments