PHP | 2tap.com

“The new image is showing but I think it’s using the old stylesheet!”

Sound familiar?

Caching?

Caching of a web page’s assets such as CSS and image files can be a double-edged sword. On the one hand, if done right, it can lead to much faster load times with less strain on the server. If done incorrectly, or worse not even considered, developers are opening themselves up to all kinds of synchronisation issues whenever files are modified.

In a typical web application, certain assets rarely change. Common theme images and JavaScript libraries are a good example of this. On the other hand, CSS files and the site’s core JavaScript functionality are prime candidates for frequent change but it is not an exact science and generally impossible to predict.

Caching of assets is the browser’s default behaviour. If an expiry time is not specifically set, it is up to the browser to decide how long to wait before checking the server for a new version. Once a file is in a browsers cache you’re at the mercy of the browser as to when the user will see the new version. Minutes? Hours? Days? Who knows. Your only option is to rename the asset in order to force the new version to be fetched.

So caching is evil, right? Well, no. With a little forethought, caching is your friend. And the user’s friend. And the web server’s friend. Treated right, it’s the life of the party.

Imagine your site is deployed once and nothing changes for eternity. The optimal caching strategy here is to instruct the browser to cache everything indefinitely. This means that, after the first visit, a user may never have to contact the server again. Load times are speedy. Your server’s relaxed. All is well. The problem, of course, is that any changes you do inevitably make will never be shown to users who have the site in their cache. At least, not without renaming the changed asset so the browser considers it a new file.

So the problem is that we want the browser to cache everything forever. Unless we change something. And we cactusmeraviglietina.it want the browser to know when we do this. Without asking us. And it’d be nice if this was automated. Ideas?

Option One – Set an expiry date in the past for all assets

Never cache anything!

Not really an option, but it does solve half of the problem. The browser will never cache anything and so the user will always see the latest version of all site assets. It works, but we’re completely missing out on one of the main benefits of caching – faster loading for the user and less stress on the server. Next.

Option Two – Include a site version string in every URL

One commonly used strategy is to include a unique identifier in every URL which is changed whenever the site is deployed. For example, an image at the following URL:

/images/logo.png

Would become:

/images/logo.82.png

Here, 82 is a unique identifier. With some Apache mod_rewrite trickery, we can transparently map this to the original URL. As far as the browser is concerned, this is a different file to the previous logo.81.png image and so any existing cache of this file is ignored.

Generally, this technique is employed in a semi-automated way. The version number can either be set manually in a configuration file (for example) or pulled from the repository version number. With this technique, all assets can be set to cache indefinitely.

The above is a pretty good solution. I’ve used it myself. But it’s not the most optimal. Every time a new version of the site is deployed, any assets in the users cache are invalidated. The whole site needs to be downloaded again. If site updates are infrequent, this isn’t too much of a problem. It sure as hell beats never caching anything or, worse, leaving the browser to decide how long to cache each item.

Option Three – Fine grained caching + Automated!

Clearly, the solution is to include a unique version string per file. This means that every file is considered independently and will only be re-downloaded if it has actually changed. One technique for doing this is to use the files last-modified timestamp. This gives a unique ID for the file which will change every time the file contents change. If the file is under version control (your projects are versioned, right?) we can’t use the modified timestamp as-is since it will change whenever the file is checked out. But we can find out what revision the file was changed in (under SVN at least) so we’re still good to go.

The goal is as follows: To instruct the browser to cache all assets (in this case, JavaScript, CSS and all image files) indefinitely. Whenever an asset changes, we want the URL to also change. The result of this is that whenever we deploy a new version of the site, only assets that have actually changed will be given a new URL. So if you’ve only changed one CSS file and a couple of images, repeat visits to the site will only need to re-download these files. We’d also like it to be automated. Only a masochist would attempt to manually change URLs whenever something changes on any sufficiently complex site.

Presented here is an automated solution for efficient caching using a bit of PHP and based on a site in an SVN repository. It’s also based around Linux. It could easily be adapted to other scripting languages, operating systems and/or version control systems – these technologies are merely presented here as an example.

To achieve the automated part, we need to run a script on the checked out version of the site prior to its deployment. The script will search the project for URLs (for a specific set of assets) and will rewrite the URL for any that it finds including a unique identifier. In our case, we’ll use the svn info command to find out the last revision the file actually changed in. Another approach would be to simply take a hash of the file contents (md5 would be a good candidate) and use this as its last-changed-identifier.

Rather than renaming each file to match the included identifier we set in the URL, we’ll use mod_rewrite within Apache to match a given format of URL back to its original. So myasset.123.png will be transparently mapped back to its original myasset.png filename.

Here’s a quick script I knocked up in PHP to facilitate this process. It should be run on a checked out working copy. It scans a given directory for files of a given type (in my base, “.tpl” (HTML templates) and .css files). Within each file it finds, it looks for any assets of a given type referenced in applicable areas (href and src attributes in HTML, url() in CSS). It then converts each URL to a filesystem path and checks the working copy for its existence. If it finds it, the URL is rewritten to include the last modified version number (pulled from svn info). Once this is done we just need to include an Apache mod_rewrite rule as discussed above.

The PHP

< ?php
 
//
// config
//
$arr_config = array(
 
    // file types to check within for assets to version
    'file_extensions' => array('tpl', 'css'),
 
    // asset extensions to version
    'asset_extensions' => array('jpg', 'jpeg', 'png', 'gif', 'css', 'ico', 'js', 'htc'),
 
    // filesystem path to the webroot of the application (so we can translate
    // relative urls to the actual path on the filesystem)
    'webroot' => dirname(__FILE__) . '/../www',
 
    // regular expressions to match assets
    'regex' => array(
        '/(?:src|href)="(.*)"/iU', // match assets in src and href attributes
        '/url\((.*)\)/iU'          // match assets in CSS url() properties
    )
);
 
//
// arguments
//
 
// we require just one argument, the root path to search for files
if(!isset($_SERVER['argv'][1])) {
    die("Error: first argument must be the path to your working copy\n");
}
 
//
// execute
//
version_assets($_SERVER['argv'][1], $arr_config);
 
 
 
 
/**
 * Checks each file in the passed path recursively to see if there are any assets
 * to version.
 *
 * Only file extensions defined in the config are checked and then only assets matching
 * a particular filetype are versioned.
 *
 * If an asset referenced is not found on the filesystem or is not under version control
 * within the working copy, the asset is ignored and nothing is changed.
 *
 * @param str $str_search_path    Path to begin scanning of files
 * @param arr $arr_config         Configuration params determining which files to check, which
 *                                asset extensions to check etc.
 * @return void
 */
function version_assets($str_search_path, $arr_config) {
 
    // pull in filenames to check
    $arr_files = get_files_recursive($str_search_path, $arr_config['file_extensions']);
 
    foreach($arr_files as $str_file) {
 
        // load the file into memory
        $str_file_content = file_get_contents($str_file);
 
        // look for any matching assets in the regex list defined in the config
        $arr_matches = array();
 
        foreach($arr_config['regex'] as $str_regex) {
 
            if(preg_match_all($str_regex, $str_file_content, $arr_m)) {
                $arr_matches = array_merge($arr_matches, $arr_m[1]);
            }
        }
 
        // filter out any matches that do not have an extension defined in the asset list
        $arr_matches_filtered = array();
 
        foreach($arr_matches as $str_match) {
 
            $arr_url = parse_url($str_match);
            $str_asset = $arr_url['path'];
 
            if(preg_match('/\.(' . implode('|', $arr_config['asset_extensions']) . '$)/iU', $str_asset)) {
                $arr_matches_filtered[] = $str_asset;
            }
        }
 
        // if we've found any matches, process them
        if(count($arr_matches_filtered)) {
 
            // flag to determine if we need to write any changes back once we've processed
            // each match
            $boo_modified_file = false;
 
            foreach($arr_matches_filtered as $str_url_asset) {
 
                // use parse_url to extract just the path
                $arr_parsed = parse_url($str_url_asset);
                $str_url_path = $arr_parsed['path'] . @$arr_parsed['query'] . @$arr_parsed['fragment'];
 
                // if this is a relative url (e.g. begininng ../) then work out the filesystem path
                // based on the location of the file containing the asset
                if(strpos($str_url_path, '../') === 0) {
                    $str_fs_path = $arr_config['webroot'] . '/' . dirname($str_file) . '/' . $str_url_path;
                }
                else {
                    $str_fs_path = $arr_config['webroot'] . '/' . $str_url_path;
                }
 
                // normalise path with realpath
                $str_fs_path = realpath($str_fs_path);
 
                // only proceed if the file exists
                if($str_fs_path) {
 
                    // execute the svn info command to retrieve the change information
                    $str_svn_result = @shell_exec('svn info ' . $str_fs_path);
                    $arr_svn_matches = array();
 
                    // extract the last changed revision to use as the version
                    preg_match('/Last Changed Rev: ([0-9]+)/i', $str_svn_result, $arr_svn_matches);
 
                    // only proceed if this file is in version control (e.g. we retrieved a valid match
                    // from the regex above)
                    if(count($arr_svn_matches)) {
 
                        $str_version = $arr_svn_matches[1];
 
                        // add version number into the file url (in the form asset.name.VERSION.ext)
                        $str_versioned_url = preg_replace('/(.*)(\.[a-zA-Z0-9]+)$/', '$1.' . $str_version . '$2', $str_url_asset);
                        $str_file_content = str_replace($str_url_asset, $str_versioned_url, $str_file_content);
 
                        // flag as
                        $boo_modified_file = true;
 
                        echo 'Versioned: [' . $str_url_asset . '] referenced in file: [' . $str_file . ']' . "\n";
                    }
                    else {
                        echo 'Ignored: [' . $str_url_asset . '] referenced in file: [' . $str_file . '] (not versioned)' . "\n";
                    }
                }
                else {
                    echo 'Ignored: [' . $str_url_asset . '] referenced in file: [' . $str_file . '] (not on filesystem)' . "\n";
                }
            }
 
            if($boo_modified_file) {
                echo '-> WRITING: ' . $str_file . "\n";
 
                // write changes to this file back to the file system
                file_put_contents($str_file, $str_file_content);
            }
        }
    }
}
 
/**
 * Utility method to recursively retrieve all files under a given directory. If
 * an optional array of extensions is passed, only these filetypes will be returned.
 *
 * Ignores any svn directories.
 *
 * @param str $str_path_start  Path to begin searching
 * @param mix $mix_extensions  Array of extensions to match or null to match any
 * @return array
 */
function get_files_recursive($str_path_start, $mix_extensions = null) {
 
    $arr_files = array();
 
    if($obj_handle = opendir($str_path_start)) {
 
        while($str_file = readdir($obj_handle)) {
 
            // ignore meta files and svn directories
            if(!in_array($str_file, array('.', '..', '.svn'))) {
 
                // construct full path
                $str_path = $str_path_start . '/' . $str_file;
 
                // if this is a directory, recursively retrieve its children
                if(is_dir($str_path)) {
 
                    $arr_files = array_merge($arr_files, get_files_recursive($str_path, $mix_extensions));
                }
 
                // otherwise add to the list
                else {
 
                    // only add if it's included in the extension list (if applicable)
                    if($mix_extensions == null || preg_match('/.*\.(' . implode('|', $mix_extensions) .')$/Ui', $str_file)) {
                        $arr_files[] = str_replace('//', '/', $str_path);
                    }
                }
            }
        }
 
        closedir($obj_handle);
    }
 
    return $arr_files;
}

This is then executed like so:

php version_assets.php "/path/to/project/checkout"

The Apache config

#
# Rewrite versioned asset urls
#
RewriteEngine on
RewriteRule ^(.+)(\.[0-9]+)\.(js|css|jpg|jpeg|gif|png)$ $1.$3 [L]
 
#
# Set near indefinite expiry for certain assets
#
<filesmatch "\.(css|js|jpg|jpeg|png|gif|htc)$">
    ExpiresActive On
    ExpiresDefault "access plus 5 years"
</filesmatch>

Note: You’ll need the rewrite and expires modules enabled in Apache. This is for Apache 2. The syntax above may be somewhat different for Apache 1.3. To enable the modules in Apache 2 you can simply use:

a2enmod rewrite
a2enmod expires

Done! Now, whenever the site is deployed, only changed assets will be downloaded. Fast, efficient and headache free. Well, unless…

Caveats

The above script is purely to illustrate the process. Your specific needs may well need a slightly different approach. For example, there may be other areas it needs to look for URLs. If you do a lot of dynamic construction of URLs or funky script includes with JavaScript, you may need a secondary deployment script or procedure in order to accommodate such features. Using this technique, you must be careful to add the unique version to all the file types looked for in the deployment script, otherwise you’re telling the browser to cache a file indefinitely without the URL changing on new versions being deployed.

Another area to watch out for would be if you serve assets from different domains. Again, this technique will work in principle but will need some modification. It’s an exercise left to you, dear reader.

So, there you have it. A reasonably hassle free, efficient and optimised caching policy for your web applications. I hope you find this helpful – good luck.

More and more of our data is hidden behind login forms in online apps. When this data updates frequently, and the site provides no API to access the information, keeping on top of it can be a laborious task.

One such example is Createspace. Createspace are a company who provide produce-on-demand manufacturing for products such as books, DVDs and CDs. This allows individuals and smaller publishers to get their products to the market without investing in heavy up front printing costs. Any orders for the product go directly to Createspace, they manufacture and ship the product and finally allocate the profit to the seller. Certainly a disconnected society is destined not to prosper, and well today the telephone system of many people are not up to date with what today’s technology offers, do you want to know What is the difference between Sip Trunking and VoIP? Knowing this you will understand that with a good internet you can boost your home telephony.

I have recently been involved in helping to get a book to market and am using vaginosisbacteriana.org Createspace’s services to produce the book. Keeping track of sales, however, is time consuming due to having to login to Createspace each time and navigate to the relevant area to retrieve the data. The book is also being produced by another company in the UK which have a similar setup meaning now twice the time is required each time I wish to check for sales.

So what to do with no API? No real choice but to screen scrape. Presented here is a quick script I knocked up using PHP and the Zend Framework to scrape sales data from Createspace. Whilst the implementation is Createspace specific, the general process is not and so I hope this will be give some pointers for similar tasks.

To use this we’re using the Zend_Http_Client from the Zend Framework. This offers similar functionality to the basic PHP cURL extension but in a nicer (IMHO) API. The basic (generic) steps required are:

Post required credential details to the application login URL
Store any authentication details (likely a session cookie) sent back from the process
Use the obtained credentials to retrieve the page we wish to scrape the data from
Sprinkle some regex magic on the retrieved HTML to extract the figures we require

Here’s the script:

    //
    // General config
    //
 
    // revenue per produce (so we can calculate totals)
    define('REVENUE_PER_PRODUCT', 100.00);
 
    // login url and credentials
    define('CREATESPACE_LOGIN_URL', 'https://www.createspace.com/LoginProc.do');
    define('CREATESPACE_LOGIN_EMAIL', 'email');
    define('CREATESPACE_LOGIN_PASSWORD', 'password');
 
    // reports url
    define('CREATESPACE_REPORTS_URL', 'https://www.createspace.com/Member/Report/MemberReport.do');
 
    //
    // Retrieve CreateSpace sales data
    //
    $obj_client = new Zend_Http_Client(CREATESPACE_LOGIN_URL);
 
    // fake the useragent in the request to make it look more authentic
    $obj_client-&gt;setConfig(array(
        'useragent' =&gt; 'Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6'
    ));
 
    // we want to retrieve any cookies posted back to us to use in the next step
    $obj_client-&gt;setCookieJar();
 
    // login parameters (entries to the login form fields)
    $obj_client-&gt;setParameterPost(array(
       'login' =&gt; CREATESPACE_LOGIN_EMAIL,
       'password' =&gt; CREATESPACE_LOGIN_EMAIL,
       'action' =&gt; 'Log In'
    ));
 
    // send the POST data
    $obj_client-&gt;request('POST');
 
    // we're now "logged in" so we can retrieve the reports page
    $obj_client-&gt;setUri(CREATESPACE_REPORTS_URL);
    $obj_client-&gt;request('GET');
 
    // extract the content from the request
    $str_page = $obj_client-&gt;getLastResponse()-&gt;getBody();
 
    //
    // Now we have the raw report HTML, it's simply a case of extracting the sales figures
    //
 
    // first grab the table data rows (tbody)
    preg_match('/

(.*?)< \/tbody>< \/table>/is’, $str_page, $arr_matches); $str_table_body = $arr_matches[1]; // then extract each row’s data preg_match_all(‘/.*?

(.*?)< \/td>.*?

-< \/td>.*?

(.*?)< \/td>/is’, $str_table_body, $arr_matches, PREG_SET_ORDER); // merge into a more sane array, indexed by date in the form Ym (e.g. 200901 for January, 2009) $arr_data = array(); foreach($arr_matches as $arr_match) { $str_date = date(‘Ym’, strtotime($arr_match[1])); if(!isset($arr_data[$str_date])) { $arr_data[$str_date] = array(); } $int_volume = (int)$arr_match[2]; $arr_data[$str_date] = array(‘volume’ => $int_volume, ‘revenue’ => $int_volume * REVENUE_PER_BOOK); } // $arr_data now contains sales data (volume and revenue) for each month found in the sales table, indexed by // the month print_r($arr_data);

A few things to note. Firstly, we’re faking the useragent to a generic “real looking” example (as opposed to the default “Zend_Http_Client”). Morally we’re doing nothing wrong here but I suspect “automated crawling” is frowned upon in the T&Cs somewhere so best not to make it too obvious. Take a probate lawyer in California when you are about to get the inheritance. The are reliable and will help your family to asure your goods.

It should also be mentioned that this method (like all screen scraping) is vulnerable to breaking if Createspace change their login system or HTML structure. There are certainly cleverer parsing methods that can be employed which are more adaptable to change but only up to a point. There’s not a lot you can do if things dramatically change except for adapting the script to accommodate.

2tap.com

Category: PHP

Throttling uploads on Linux

Efficient caching of versioned JavaScript, CSS and image assets for fun and profit