Suggested use of robots.txt for better searching of MeFi and related subdomains January 15, 2006 8:40 AM Subscribe
Robots.txt for the <tagname>.metafilter.com domains should exclude all rebots. Currently the robots reindex the entire site for each subdomain.
Mod note: fixed the brackets in the post, once you do a hard preview the gt; lt; get literalized
posted by jessamyn (staff) at 9:05 AM on January 15, 2006
posted by jessamyn (staff) at 9:05 AM on January 15, 2006
I think it's too early to tell if search results will be at all affected by tagname subdomains.
posted by mathowie (staff) at 10:02 AM on January 15, 2006
posted by mathowie (staff) at 10:02 AM on January 15, 2006
mathowie, I think it's pretty logical that if instead of indexing 50,000 pages, it now needs to index 50,000,000 pages, you obviously won't get better/fresher results, and it obviously causes undeeded load on the server, and adds repeated results.
posted by Sharcho at 10:49 AM on January 15, 2006
posted by Sharcho at 10:49 AM on January 15, 2006
Better yet, all of the thread links, etc from foo.metafilter.com should explicitly go to www. Therefore, just the "home pages" on the tag subdomains would be retrieved.
posted by Plutor at 10:54 AM on January 15, 2006
posted by Plutor at 10:54 AM on January 15, 2006
Ok, I can make sure all front page links are hardcoded to www...
posted by mathowie (staff) at 11:06 AM on January 15, 2006
posted by mathowie (staff) at 11:06 AM on January 15, 2006
Also, a few people volunteered to create an offsite search. How about one big RSS file of every post to the front page to date (in RSS 2.0, with usernames and dates shown) and then search engine authors could just use the daily rss feed for updates and offer a search of the 45k threads using that data?
posted by mathowie (staff) at 11:08 AM on January 15, 2006
posted by mathowie (staff) at 11:08 AM on January 15, 2006
mathowie, once you create that huge RSS file, I recommend that you submit that RSS file to Google Sitemaps (which can work with RSS feeds). Shouldn't take more than 10 minutes once you have the RSS feed.
posted by Sharcho at 11:27 AM on January 15, 2006
posted by Sharcho at 11:27 AM on January 15, 2006
You'd have to change things in links in the entire site to point to www., not just the home page. e.g. in the URL http://art.metafilter.com/mefi/11111 (links to previous older/newer, etc.) , it still has references to relative links, so it will again start to crawl the entire subdomain. It would be rather difficult to get rid of all the relative links. The robots.txt solution is much simpler.
posted by Sharcho at 11:41 AM on January 15, 2006
posted by Sharcho at 11:41 AM on January 15, 2006
i only just understood what you're advocating (after staring at the robots.txt standard defn for ages) - have a robots.txt that blocks everything and return that for queries on X.metafilter.com where X is not a member of (www,ask,meta), right? because you don't specify the URL in robots.txt (which is what i thought you meant).
not sure why i'm posting this, really. in case anyone else was as stupid, i guess...
posted by andrew cooke at 12:48 PM on January 15, 2006
not sure why i'm posting this, really. in case anyone else was as stupid, i guess...
posted by andrew cooke at 12:48 PM on January 15, 2006
andrew cooke, exactly. For www/ask/meta/projects it should return the current robots.txt, but for all the other subdomains it should return a different robots.txt that blocks everything, (or everything except the root page)
posted by Sharcho at 2:03 PM on January 15, 2006
posted by Sharcho at 2:03 PM on January 15, 2006
Sharcho, the bummer of that idea is the foo.metafilter.com page is just the index page of www.metafilter.com. I can't think of a simple way to make the robots.txt file available for all non-www domains.
I could just set a base ref on the subdomain headers, so every link goes to www.mefi.com/mefi/123, etc.
posted by mathowie (staff) at 4:45 PM on January 15, 2006
I could just set a base ref on the subdomain headers, so every link goes to www.mefi.com/mefi/123, etc.
posted by mathowie (staff) at 4:45 PM on January 15, 2006
mathowie, you could do it easily with an htaccess rewrite,
e.g.
RewriteRule ^robots.txt$ /robots.cfm [nc]
And in the /robots.cfm do (pseudo-code):
if (subdomain is www/ask/projects/metatalk) then {
output current_robots.txt
} else {
output other_robots.txt
}
posted by Sharcho at 6:05 PM on January 15, 2006
e.g.
RewriteRule ^robots.txt$ /robots.cfm [nc]
And in the /robots.cfm do (pseudo-code):
if (subdomain is www/ask/projects/metatalk) then {
output current_robots.txt
} else {
output other_robots.txt
}
posted by Sharcho at 6:05 PM on January 15, 2006
Hmm, that apache directive didn't seem to work, it still loaded up my existing robots.txt. Instead I just set apache/cf to parse .txt files as cfm and made a custom robots.cfm file that seems to work. www, foo robots.
posted by mathowie (staff) at 7:31 PM on January 15, 2006
posted by mathowie (staff) at 7:31 PM on January 15, 2006
robots.txt on metatalk/projects/ask subdomains brings an error
posted by Sharcho at 7:34 PM on January 15, 2006
posted by Sharcho at 7:34 PM on January 15, 2006
yeah, it never existed for those other sites, nor is there a need really.
posted by mathowie (staff) at 10:18 PM on January 15, 2006
posted by mathowie (staff) at 10:18 PM on January 15, 2006
maybe i dont understand but why can't [tagname].metafilter.com just list fpp's that match [tagname] and have the links to the comments, username, etc. go to the sites that host them (ask,meta,www) wont that solve all this?
e.g. batshitinsane.metafilter.com lists amongst other things:
Voters in the US state of Minnesota may find a self-proclaimed vampire on the ballot for the office this year when Jonathon "The Impaler" Sharkey of the Vampyres, Witches and Pagans Party announces his plan to run for Governor, expected later today. Acknowledging that "politics is a cut-throat business", Sharkey has let voters know that whilst he is a Satanist, he dosen't hate Jesus, "just God, the Father."
posted by Effigy2000 at 9:16 PM EST - 39 comments
(mouseover the links)
posted by jojomnky at 9:45 AM on January 16, 2006
e.g. batshitinsane.metafilter.com lists amongst other things:
Voters in the US state of Minnesota may find a self-proclaimed vampire on the ballot for the office this year when Jonathon "The Impaler" Sharkey of the Vampyres, Witches and Pagans Party announces his plan to run for Governor, expected later today. Acknowledging that "politics is a cut-throat business", Sharkey has let voters know that whilst he is a Satanist, he dosen't hate Jesus, "just God, the Father."
posted by Effigy2000 at 9:16 PM EST - 39 comments
(mouseover the links)
posted by jojomnky at 9:45 AM on January 16, 2006
mathowie, another thing that needs a 301 redirect is
http://www.metafilter.com/comments.mefi/12345 ->
http://www.metafilter.com/mefi/12345
http://www.metafilter.com/user.mefi/12345 ->
http://www.metafilter.com/user/12345
posted by Sharcho at 11:39 AM on January 16, 2006
http://www.metafilter.com/comments.mefi/12345 ->
http://www.metafilter.com/mefi/12345
http://www.metafilter.com/user.mefi/12345 ->
http://www.metafilter.com/user/12345
posted by Sharcho at 11:39 AM on January 16, 2006
They're the exact same file requests Sharcho, one (/mefi/foo) is just a apache mapping to the other (/comments.mefi/foo).
posted by mathowie (staff) at 2:53 PM on January 16, 2006
posted by mathowie (staff) at 2:53 PM on January 16, 2006
mathowie, yes, I know, that's the problem, you should change the mapping so it will be a 301 redirect (like you already did for metafilter.com -> www.metafilter.com)
Duplicate content causes problems for the search engines, for the same reasons that are mentioned above.
The .htaccess rewrite rule should use [R=301,L]
BTW: MeFi thread 12345 above returns a 500 error
posted by Sharcho at 3:31 PM on January 16, 2006
Duplicate content causes problems for the search engines, for the same reasons that are mentioned above.
The .htaccess rewrite rule should use [R=301,L]
BTW: MeFi thread 12345 above returns a 500 error
posted by Sharcho at 3:31 PM on January 16, 2006
You are not logged in, either login or create an account to post comments
posted by Sharcho at 8:47 AM on January 15, 2006