“The new image is showing but I think it’s using the old stylesheet!”

Sound familiar?

Caching?

Caching of a web page’s assets such as CSS and image files can be a double-edged sword. On the one hand, if done right, it can lead to much faster load times with less strain on the server. If done incorrectly, or worse not even considered, developers are opening themselves up to all kinds of synchronisation issues whenever files are modified.

In a typical web application, certain assets rarely change. Common theme images and JavaScript libraries are a good example of this. On the other hand, CSS files and the site’s core JavaScript functionality are prime candidates for frequent change but it is not an exact science and generally impossible to predict.

Caching of assets is the browser’s default behaviour. If an expiry time is not specifically set, it is up to the browser to decide how long to wait before checking the server for a new version. Once a file is in a browsers cache you’re at the mercy of the browser as to when the user will see the new version. Minutes? Hours? Days? Who knows. Your only option is to rename the asset in order to force the new version to be fetched.

So caching is evil, right? Well, no. With a little forethought, caching is your friend. And the user’s friend. And the web server’s friend. Treated right, it’s the life of the party.

Imagine your site is deployed once and nothing changes for eternity. The optimal caching strategy here is to instruct the browser to cache everything indefinitely. This means that, after the first visit, a user may never have to contact the server again. Load times are speedy. Your server’s relaxed. All is well. The problem, of course, is that any changes you do inevitably make will never be shown to users who have the site in their cache. At least, not without renaming the changed asset so the browser considers it a new file.

So the problem is that we want the browser to cache everything forever. Unless we change something. And we cactusmeraviglietina.it want the browser to know when we do this. Without asking us. And it’d be nice if this was automated. Ideas?

Option One – Set an expiry date in the past for all assets

Never cache anything!

Not really an option, but it does solve half of the problem. The browser will never cache anything and so the user will always see the latest version of all site assets. It works, but we’re completely missing out on one of the main benefits of caching – faster loading for the user and less stress on the server. Next.

Option Two – Include a site version string in every URL

One commonly used strategy is to include a unique identifier in every URL which is changed whenever the site is deployed. For example, an image at the following URL:

/images/logo.png

Would become:

/images/logo.82.png

Here, 82 is a unique identifier. With some Apache mod_rewrite trickery, we can transparently map this to the original URL. As far as the browser is concerned, this is a different file to the previous logo.81.png image and so any existing cache of this file is ignored.

Generally, this technique is employed in a semi-automated way. The version number can either be set manually in a configuration file (for example) or pulled from the repository version number. With this technique, all assets can be set to cache indefinitely.

The above is a pretty good solution. I’ve used it myself. But it’s not the most optimal. Every time a new version of the site is deployed, any assets in the users cache are invalidated. The whole site needs to be downloaded again. If site updates are infrequent, this isn’t too much of a problem. It sure as hell beats never caching anything or, worse, leaving the browser to decide how long to cache each item.

Option Three – Fine grained caching + Automated!

Clearly, the solution is to include a unique version string per file. This means that every file is considered independently and will only be re-downloaded if it has actually changed. One technique for doing this is to use the files last-modified timestamp. This gives a unique ID for the file which will change every time the file contents change. If the file is under version control (your projects are versioned, right?) we can’t use the modified timestamp as-is since it will change whenever the file is checked out. But we can find out what revision the file was changed in (under SVN at least) so we’re still good to go.

The goal is as follows: To instruct the browser to cache all assets (in this case, JavaScript, CSS and all image files) indefinitely. Whenever an asset changes, we want the URL to also change. The result of this is that whenever we deploy a new version of the site, only assets that have actually changed will be given a new URL. So if you’ve only changed one CSS file and a couple of images, repeat visits to the site will only need to re-download these files. We’d also like it to be automated. Only a masochist would attempt to manually change URLs whenever something changes on any sufficiently complex site.

Presented here is an automated solution for efficient caching using a bit of PHP and based on a site in an SVN repository. It’s also based around Linux. It could easily be adapted to other scripting languages, operating systems and/or version control systems – these technologies are merely presented here as an example.

To achieve the automated part, we need to run a script on the checked out version of the site prior to its deployment. The script will search the project for URLs (for a specific set of assets) and will rewrite the URL for any that it finds including a unique identifier. In our case, we’ll use the svn info command to find out the last revision the file actually changed in. Another approach would be to simply take a hash of the file contents (md5 would be a good candidate) and use this as its last-changed-identifier.

Rather than renaming each file to match the included identifier we set in the URL, we’ll use mod_rewrite within Apache to match a given format of URL back to its original. So myasset.123.png will be transparently mapped back to its original myasset.png filename.

Here’s a quick script I knocked up in PHP to facilitate this process. It should be run on a checked out working copy. It scans a given directory for files of a given type (in my base, “.tpl” (HTML templates) and .css files). Within each file it finds, it looks for any assets of a given type referenced in applicable areas (href and src attributes in HTML, url() in CSS). It then converts each URL to a filesystem path and checks the working copy for its existence. If it finds it, the URL is rewritten to include the last modified version number (pulled from svn info). Once this is done we just need to include an Apache mod_rewrite rule as discussed above.

The PHP

< ?php
 
//
// config
//
$arr_config = array(
 
    // file types to check within for assets to version
    'file_extensions' => array('tpl', 'css'),
 
    // asset extensions to version
    'asset_extensions' => array('jpg', 'jpeg', 'png', 'gif', 'css', 'ico', 'js', 'htc'),
 
    // filesystem path to the webroot of the application (so we can translate
    // relative urls to the actual path on the filesystem)
    'webroot' => dirname(__FILE__) . '/../www',
 
    // regular expressions to match assets
    'regex' => array(
        '/(?:src|href)="(.*)"/iU', // match assets in src and href attributes
        '/url\((.*)\)/iU'          // match assets in CSS url() properties
    )
);
 
//
// arguments
//
 
// we require just one argument, the root path to search for files
if(!isset($_SERVER['argv'][1])) {
    die("Error: first argument must be the path to your working copy\n");
}
 
//
// execute
//
version_assets($_SERVER['argv'][1], $arr_config);
 
 
 
 
/**
 * Checks each file in the passed path recursively to see if there are any assets
 * to version.
 *
 * Only file extensions defined in the config are checked and then only assets matching
 * a particular filetype are versioned.
 *
 * If an asset referenced is not found on the filesystem or is not under version control
 * within the working copy, the asset is ignored and nothing is changed.
 *
 * @param str $str_search_path    Path to begin scanning of files
 * @param arr $arr_config         Configuration params determining which files to check, which
 *                                asset extensions to check etc.
 * @return void
 */
function version_assets($str_search_path, $arr_config) {
 
    // pull in filenames to check
    $arr_files = get_files_recursive($str_search_path, $arr_config['file_extensions']);
 
    foreach($arr_files as $str_file) {
 
        // load the file into memory
        $str_file_content = file_get_contents($str_file);
 
        // look for any matching assets in the regex list defined in the config
        $arr_matches = array();
 
        foreach($arr_config['regex'] as $str_regex) {
 
            if(preg_match_all($str_regex, $str_file_content, $arr_m)) {
                $arr_matches = array_merge($arr_matches, $arr_m[1]);
            }
        }
 
        // filter out any matches that do not have an extension defined in the asset list
        $arr_matches_filtered = array();
 
        foreach($arr_matches as $str_match) {
 
            $arr_url = parse_url($str_match);
            $str_asset = $arr_url['path'];
 
            if(preg_match('/\.(' . implode('|', $arr_config['asset_extensions']) . '$)/iU', $str_asset)) {
                $arr_matches_filtered[] = $str_asset;
            }
        }
 
        // if we've found any matches, process them
        if(count($arr_matches_filtered)) {
 
            // flag to determine if we need to write any changes back once we've processed
            // each match
            $boo_modified_file = false;
 
            foreach($arr_matches_filtered as $str_url_asset) {
 
                // use parse_url to extract just the path
                $arr_parsed = parse_url($str_url_asset);
                $str_url_path = $arr_parsed['path'] . @$arr_parsed['query'] . @$arr_parsed['fragment'];
 
                // if this is a relative url (e.g. begininng ../) then work out the filesystem path
                // based on the location of the file containing the asset
                if(strpos($str_url_path, '../') === 0) {
                    $str_fs_path = $arr_config['webroot'] . '/' . dirname($str_file) . '/' . $str_url_path;
                }
                else {
                    $str_fs_path = $arr_config['webroot'] . '/' . $str_url_path;
                }
 
                // normalise path with realpath
                $str_fs_path = realpath($str_fs_path);
 
                // only proceed if the file exists
                if($str_fs_path) {
 
                    // execute the svn info command to retrieve the change information
                    $str_svn_result = @shell_exec('svn info ' . $str_fs_path);
                    $arr_svn_matches = array();
 
                    // extract the last changed revision to use as the version
                    preg_match('/Last Changed Rev: ([0-9]+)/i', $str_svn_result, $arr_svn_matches);
 
                    // only proceed if this file is in version control (e.g. we retrieved a valid match
                    // from the regex above)
                    if(count($arr_svn_matches)) {
 
                        $str_version = $arr_svn_matches[1];
 
                        // add version number into the file url (in the form asset.name.VERSION.ext)
                        $str_versioned_url = preg_replace('/(.*)(\.[a-zA-Z0-9]+)$/', '$1.' . $str_version . '$2', $str_url_asset);
                        $str_file_content = str_replace($str_url_asset, $str_versioned_url, $str_file_content);
 
                        // flag as
                        $boo_modified_file = true;
 
                        echo 'Versioned: [' . $str_url_asset . '] referenced in file: [' . $str_file . ']' . "\n";
                    }
                    else {
                        echo 'Ignored: [' . $str_url_asset . '] referenced in file: [' . $str_file . '] (not versioned)' . "\n";
                    }
                }
                else {
                    echo 'Ignored: [' . $str_url_asset . '] referenced in file: [' . $str_file . '] (not on filesystem)' . "\n";
                }
            }
 
            if($boo_modified_file) {
                echo '-> WRITING: ' . $str_file . "\n";
 
                // write changes to this file back to the file system
                file_put_contents($str_file, $str_file_content);
            }
        }
    }
}
 
/**
 * Utility method to recursively retrieve all files under a given directory. If
 * an optional array of extensions is passed, only these filetypes will be returned.
 *
 * Ignores any svn directories.
 *
 * @param str $str_path_start  Path to begin searching
 * @param mix $mix_extensions  Array of extensions to match or null to match any
 * @return array
 */
function get_files_recursive($str_path_start, $mix_extensions = null) {
 
    $arr_files = array();
 
    if($obj_handle = opendir($str_path_start)) {
 
        while($str_file = readdir($obj_handle)) {
 
            // ignore meta files and svn directories
            if(!in_array($str_file, array('.', '..', '.svn'))) {
 
                // construct full path
                $str_path = $str_path_start . '/' . $str_file;
 
                // if this is a directory, recursively retrieve its children
                if(is_dir($str_path)) {
 
                    $arr_files = array_merge($arr_files, get_files_recursive($str_path, $mix_extensions));
                }
 
                // otherwise add to the list
                else {
 
                    // only add if it's included in the extension list (if applicable)
                    if($mix_extensions == null || preg_match('/.*\.(' . implode('|', $mix_extensions) .')$/Ui', $str_file)) {
                        $arr_files[] = str_replace('//', '/', $str_path);
                    }
                }
            }
        }
 
        closedir($obj_handle);
    }
 
    return $arr_files;
}

This is then executed like so:

php version_assets.php "/path/to/project/checkout"

The Apache config

#
# Rewrite versioned asset urls
#
RewriteEngine on
RewriteRule ^(.+)(\.[0-9]+)\.(js|css|jpg|jpeg|gif|png)$ $1.$3 [L]
 
#
# Set near indefinite expiry for certain assets
#
<filesmatch "\.(css|js|jpg|jpeg|png|gif|htc)$">
    ExpiresActive On
    ExpiresDefault "access plus 5 years"
</filesmatch>

Note: You’ll need the rewrite and expires modules enabled in Apache. This is for Apache 2. The syntax above may be somewhat different for Apache 1.3. To enable the modules in Apache 2 you can simply use:

a2enmod rewrite
a2enmod expires

Done! Now, whenever the site is deployed, only changed assets will be downloaded. Fast, efficient and headache free. Well, unless…

Caveats

The above script is purely to illustrate the process. Your specific needs may well need a slightly different approach. For example, there may be other areas it needs to look for URLs. If you do a lot of dynamic construction of URLs or funky script includes with JavaScript, you may need a secondary deployment script or procedure in order to accommodate such features. Using this technique, you must be careful to add the unique version to all the file types looked for in the deployment script, otherwise you’re telling the browser to cache a file indefinitely without the URL changing on new versions being deployed.

Another area to watch out for would be if you serve assets from different domains. Again, this technique will work in principle but will need some modification. It’s an exercise left to you, dear reader.

So, there you have it. A reasonably hassle free, efficient and optimised caching policy for your web applications. I hope you find this helpful – good luck.