Doesn’t anyone use static web pages anymore? I was playing around with some code, just doing a quick prototype of an app for my Coding4Fun column and (inspired by the cool and useful PAINT tool) I decided to create an app that would check for changes to a website and give you a little quick-launch menu of updated sites on your system tray (who couldn’t use one more icon there?). I thought it would be cool to show how to use the HTTP If-Modified-Since request header combined with the Last-Modified response header to show how quick and easy it is to check for new content using the System.Net classes. Well, the code works fine, it is the web that is broken 😉
Every site I tried this on (Penny Arcade, PVP Online, even MSDN) happens to be a dynamic page (php, asp, aspx, whatever), so there is either no Last Modified header and the web server always says it has been modified (when it hasn’t) or there is a Last Modified header but it is always equal to the date/time of your request. Either way, you can’t use those headers to check if a site has been updated. Fine and dandy, I’m already losing interest in this as a sample for my column but I still want to finish it; so I figure that a quick hash of the HTML will accomplish the same thing. Save the hash and compare it against a hash of the new HTML later, and we’ll be set. Nope, no go. If there is a single bit of HTML on the page that is dynamic (such as the page request time being embedded into the HTML somewhere) then, of course, the hash will change and I’ll think the page has been updated.
I realize that by carefully scraping the page you could certainly get around these problems, but my two goals had been:
- to show how to use the headers to avoid having to even look at the response body (the HTML in this case) and
- to create a general purpose utility that would work with most websites.
On to the next idea!
Oh, and if you are wondering how that PAINT tool handles this? They check a special (likely static!) file on the web site that is automatically updated when the page is changed. That is a perfectly good solution if you are dealing with a single site, but it doesn’t really help me with the general case. And, if you are going to spit out a file that helps people (and their tools) know when your page has changed… why not just a RSS feed with links back to your page so that the advertisers are still getting all their hits?