I still think that 304 HTTP headers should be implemented. February 21, 2006 12:14 PM   Subscribe

"304 Not Modified" redux: I still think that 304 HTTP headers should be implemented. (previous thread)
posted by Sharcho to Bugs at 12:14 PM (24 comments total)

Doing a search for site:metafilter.com in Google returns over a million results. It means that the search engine must crawl millions of pages on every crawl. Since most crawlers place limits on how many pages are to be crawled, this results in search results that are incomplete and not fresh. With 304 headers, the search engines will only need to retrieve the pages that have changed since the last crawl (thousands instead of millions).
posted by Sharcho at 12:21 PM on February 21, 2006

Describe how to go about doing this. I know how to send http headers as 304, but how could/should an app figure out modifed if changed when bots are randomly hitting it all the time and comments are coming in all the time?
posted by mathowie (staff) at 12:30 PM on February 21, 2006

Also, this comment in the previous thread is spot on. Everything is dynamic, so I don't see much gain if I have to track all sorts of new bits of data just to provide this.
posted by mathowie (staff) at 12:32 PM on February 21, 2006

There's a pretty simple 304 tutorial for Coldfusion here

Once the browser/crawler recieved a Last-Modified header, the next time the browser will send a If-Modified-Since, the server returns the page as normal if it has been modified. Otherwise the server will return a 304 not modified response.
posted by Sharcho at 12:36 PM on February 21, 2006

Yeah, but that seems like a waste of cycles, requiring every pageview to do a last-commment-date lookup and check to see if the page is new.

Perhaps that extra processor and db hit would negate any savings from reduced search bot visits. How about I just force a 304 header on all content no longer getting comments (>1 month old).
posted by mathowie (staff) at 12:41 PM on February 21, 2006

mathowie, I suggest adding the Last-Modified header just to threads that are older than 30 days. In the worst case people looking at old threads will see the older cached version. This won't affect logged in users anyway due to cookies.
posted by Sharcho at 12:50 PM on February 21, 2006

It would be more efficient to add a last-updated field in the table, and to update it every time that someone posts a comment, rather than scan all the comments in thread for the last date.
posted by Sharcho at 12:57 PM on February 21, 2006

That would be duplicated data in multiple tables! The horror!
posted by smackfu at 1:51 PM on February 21, 2006

Think of it as a cache instead, smackfu.
posted by grouse at 1:54 PM on February 21, 2006

grouse: "Think of it as a cache instead, smackfu."

A terrible, hideous, burning cache. The DBMS should do all the caching you need.
posted by Plutor at 2:24 PM on February 21, 2006

The DBMS should do all the caching you need.

But apparently there are fears that it isn't.
posted by grouse at 2:40 PM on February 21, 2006

Plutor: The DBMS can only cache what it knows about (Eg, it doesn't know html changes), and multiple levels of cache are inevitable. Any system will have to trade off the hassle of maintaining cache information vs "if it's not broke, don't fix it".

The comment mathowie linked is presumably going under the assumption that it'll be a new field in the database, and that it'll be a lot of work. It could be a simple as a bit of CFMX code that sends it for 30 day-old stories. It all depends on how granular you want to go.

Also once you figure out the headers (both http codes and vary-by headers) you can put Squid in front of your server and it'll let the load off your webserver.
posted by holloway at 3:22 PM on February 21, 2006

^^ as in, it's not just about browsers and search engines. It could be a useful bit of infrastructure.
posted by holloway at 3:24 PM on February 21, 2006

Weird, if I set a 304 on pages older than 1 month on MetaTalk, the pages show up as unparsed HTML in Firefox, even after forcing a content type of text/html. Weird.
posted by mathowie (staff) at 4:15 PM on February 21, 2006

Huh, I didn't know that...
10.3.5 304 Not Modified

If the client has performed a conditional GET request and access is allowed, but the document has not been modified, the server SHOULD respond with this status code. The 304 response MUST NOT contain a message-body, and thus is always terminated by the first empty line after the header fields.
Is the 304 based on If_Modified_Since / etag headers?

(my use has always been via Squid, which I guess fixed my mistake)
posted by holloway at 5:32 PM on February 21, 2006

mathowie, you can use an http://web-sniffer.net/ and the Cacheability engine to compare the HTTP response headers to see what went wrong.
posted by Sharcho at 5:33 PM on February 21, 2006

mathowie, make sure that after you've fixed the content type header to do a Shift-Reload. Otherwise, even after you fix it, it might show the cached version.

The Live HTTP Headers Firefox extension would also be useful for debugging the problem.
posted by Sharcho at 5:42 PM on February 21, 2006

Here's more tips along the same lines, including 304 information.
posted by kcm at 8:14 PM on February 21, 2006

Sharcho, I have the FF liveheaders extension and I am pulling up new files. The server is serving up plain text files instead of HTML.
posted by mathowie (staff) at 8:42 PM on February 21, 2006

You know, it really fucking turns me on when you people talk all that technical computer shit.
posted by the quidnunc kid at 3:10 AM on February 22, 2006

Press it again then, it'll turn off.
posted by NinjaPirate at 3:41 AM on February 22, 2006

mathowie, here's a long shot: you might have accidently added whitespace that is sent before the HTTP headers, and that might cause the server to interpret it as plain text.
posted by Sharcho at 2:16 PM on February 22, 2006

The logic here is suspect.

First of all, search engines don't crawl sites all at once. They do it spread out over time, so that the load is essentially constant. To google, it really matters little whether it's got to check a thousand or a million pages on *.metafilter.com. They index billions and billions of pages, and they have to infrastructure to do it. As long as it's not excluded in robots.txt and it's linked from somewhere they will spider it.

Second of all, being able to reply with 304 doesn't save googlebot any requests. The spider still has to make the same number of page requests. The only thing that changes is whether the metafilter server responds with the page itself or responds with and empty response and "Not Modified." Either way, the cost of making the connection and serving a response are still there, they don't magically go away. All you save is the bandwidth of sending the page and the CPU+DB hit of having to generate the page. However, you still have a DB hit since you have to check the DB to see if the requested page has been modified or not.

Third, if you start making dynamic pages cacheable then you run the risk of people's browsers -- not to mention transparent proxy caches -- getting in the way. I know the HTTP standard is very precise about when and how you can cache a page but it seems that browsers are overzealous about this. If you start making pages return Last-Modified or ETag headers then that is a signal to the browser that it's cacheable. And they will often cache it more aggressively than a standard dynamic page. This can result in pages getting "stuck" where someone posts a comment but it doesn't show up until you refresh a bunch of times or until you force a refresh with Ctrl-F5. Nobody wants that because it makes participating in threads annoying.

The only real argument that could be made for savings is bandwidth, and that is much better dealt with by using mod_gzip (or however your server implements HTTP compression / Transfer-Encoding: Deflate.) In fact many sites are not anywhere near to hitting their bandwidth limits but they are near to running out of CPU or database resources, and in those cases the extra bookkeeping of making pages cacheable can often work in the opposite direction. We can't know for sure without knowing details of metafilter's hosting arrangement.
posted by Rhomboid at 7:51 PM on February 22, 2006

* It isn't going to affect logged in users (pages with cookies are always stale), and it's only for pages over 30 days old.
* There's no overhead involved, and anyway it will save a lot of bandwidth and CPU.
* It makes it possible to put a proxy server (e.g. squid) in front of the server to reduce the load off the webserver.
* The crawler reduces the crawling speed everytime it sees slow responses, timeouts, 500 errors, etc. which happen very frequently here.
* Once the crawler knows the last changed date, it can better prioritize the crawling.
* It is recommended by Yahoo and Google
posted by Sharcho at 2:38 AM on February 23, 2006

« Older Self-links in askme posts?   |   And all 3 of these people mention the same online... Newer »

You are not logged in, either login or create an account to post comments