I recently delivered a presentation to PHP London about scaling web applications. One of the most interesting things to come out of it was talk of Edge-side includes, which is a technology invented and implemented by the content delivery network Akamai to enable content publishers to set different expiry times on different portions of their web content.
For example, if you have a page with a 'hot news' section that changes every few minutes, but the rest of the page remains static for periods of days or weeks, then in order to cache the page as a whole at the CDN level, you would need to set an expiry time of just a few minutes. You then suffer an unnecessarily high number of requests from the CDN for your page.
The problem is, we don't use Akamai at Assanka, we use Edgecast, who don't support ESI, at least not yet. So how can you mitigate this problem? There are other levels of caching and includes that you could use to avoid having to regenerate your entire page in order to serve the Hot News section - on your own network you could implement a cache that supports ESI, like Varnish or Squid, so that even though you're still receiving an annoyingly large number of requests, they are not making it all the way to your application servers (except the requests for hot news). Or you could do it purely at the application level - for example if you have a Wordpress blog that implements WP-cache, you could set the cache time to an hour, and then hook in a process that runs after the cache lookup to add the Hot News section uncached. In Wordpress, this is actually quite a common thing to do when you want to print a login status message like 'Welcome Andrew'.
So how about browser side includes?
For those of us who don't have Akamai's CDN, and still want the benefits of those really long expiry times, you can always use JavaScript for that Hot News section:
<script type='text/javascript' src='/hotnews'></script>
The hotnews script then outputs nothing more complicated than this:
document.write('Hot news section html');
The key to this method is that the main page is served with Cache-Control and Expires headers that instruct caches such as CDNs, ISP caches and the end user's browser cache, to consider the page's content valid for a long time, say an hour (could be a lot longer):
Expires: Sat, 07 Mar 2009 13:50:57 GMT
Cache-Control: max-age=3600, must-revalidate, public
Whereas the hotnews script returns its javascript with headers that make it expire very quickly, say after 2 minutes:
Expires: Sat, 07 Mar 2009 12:52:57 GMT
Cache-Control: max-age=120, must-revalidate, public
Or perhaps make the JS completely uncachable (if it contains information personal to the end-user's session, for example). The effect is that while the main page will remain cached for the hour, the section of hot news will update if it becomes more than 2 mins old.
I've decided to call this browser side includes. It's not a new technique - ad networks have been doing it for years, and more recently it's been used for blog badges and other embeddable widgets. There are two big problems with BSI:
- Search engines don't load linked resources, so they won't see the content loaded by the include and people will not be able to find your site using any terms contained in your hot news content
- Browsers load SCRIPT includes synchronously, so loading of your page will stall while the browser waits for a response from the hotnews script.
Dealing with each of these in turn, first, if the included content is private - session based and personal to the user, you don't want a search engine to see it anyway. If it's public data that you just want to serve with a different expiry time to the rest of the page, then just include a version of it with the page itself, set it to be hidden with CSS, and then replace it with your BSI. Search crawlers will then see it - it's out of date but good enough for Google - while end users will see the latest data, courtesy of the BSI.
Of course this isn't ideal. It adds to your bandwidth. Ideally, we could add a behaviour attribute to our script includes that makes search crawlers download them as includes:
<script type='text/javscript' bsi='include' src='/hotnews'></script>
But clearly I'm not in a position to dictate standards to search engines. :-)
The other issue is the synchronous loading. We can get around this by lazy loading, which simply involves moving the script to the bottom of the document, and where you want the content to be included simply place a placeholder DIV. The hotnews script on your server then outputs:
document.getElementById('placeholderdiv').innerHTML = 'HTML of hot news content';
This replaces one problem with another - namely that the page will fully render before the additional content is loaded, so to avoid the layout 'rejigging' itself when your include is loaded, it's worth using this technique only for content that has a fixed size.
A good rule of thumb is that the closer to the end user you can cache content, the cheaper that content is to deliver in volume. That's the mantra of most CDN sales departments. Browser side includes are another tool in that armory.
0 comments:
Post a Comment