More and more of our data is hidden behind login forms in online apps. When this data updates frequently, and the site provides no API to access the information, keeping on top of it can be a laborious task.

One such example is Createspace. Createspace are a company who provide produce-on-demand manufacturing for products such as books, DVDs and CDs. This allows individuals and smaller publishers to get their products to the market without investing in heavy up front printing costs. Any orders for the product go directly to Createspace, they manufacture and ship the product and finally allocate the profit to the seller. Certainly a disconnected society is destined not to prosper, and well today the telephone system of many people are not up to date with what today’s technology offers, do you want to know What is the difference between Sip Trunking and VoIP? Knowing this you will understand that with a good internet you can boost your home telephony.

I have recently been involved in helping to get a book to market and am using vaginosisbacteriana.org Createspace’s services to produce the book. Keeping track of sales, however, is time consuming due to having to login to Createspace each time and navigate to the relevant area to retrieve the data. The book is also being produced by another company in the UK which have a similar setup meaning now twice the time is required each time I wish to check for sales.

So what to do with no API? No real choice but to screen scrape. Presented here is a quick script I knocked up using PHP and the Zend Framework to scrape sales data from Createspace. Whilst the implementation is Createspace specific, the general process is not and so I hope this will be give some pointers for similar tasks.

To use this we’re using the Zend_Http_Client from the Zend Framework. This offers similar functionality to the basic PHP cURL extension but in a nicer (IMHO) API. The basic (generic) steps required are:

  1. Post required credential details to the application login URL
  2. Store any authentication details (likely a session cookie) sent back from the process
  3. Use the obtained credentials to retrieve the page we wish to scrape the data from
  4. Sprinkle some regex magic on the retrieved HTML to extract the figures we require

Here’s the script:

    //
    // General config
    //
 
    // revenue per produce (so we can calculate totals)
    define('REVENUE_PER_PRODUCT', 100.00);
 
    // login url and credentials
    define('CREATESPACE_LOGIN_URL', 'https://www.createspace.com/LoginProc.do');
    define('CREATESPACE_LOGIN_EMAIL', 'email');
    define('CREATESPACE_LOGIN_PASSWORD', 'password');
 
    // reports url
    define('CREATESPACE_REPORTS_URL', 'https://www.createspace.com/Member/Report/MemberReport.do');
 
    //
    // Retrieve CreateSpace sales data
    //
    $obj_client = new Zend_Http_Client(CREATESPACE_LOGIN_URL);
 
    // fake the useragent in the request to make it look more authentic
    $obj_client->setConfig(array(
        'useragent' => 'Mozilla/5.0 (X11; U; Linux i686; en-GB; rv:1.9.0.6) Gecko/2009020911 Ubuntu/8.10 (intrepid) Firefox/3.0.6'
    ));
 
    // we want to retrieve any cookies posted back to us to use in the next step
    $obj_client->setCookieJar();
 
    // login parameters (entries to the login form fields)
    $obj_client->setParameterPost(array(
       'login' => CREATESPACE_LOGIN_EMAIL,
       'password' => CREATESPACE_LOGIN_EMAIL,
       'action' => 'Log In'
    ));
 
    // send the POST data
    $obj_client->request('POST');
 
    // we're now "logged in" so we can retrieve the reports page
    $obj_client->setUri(CREATESPACE_REPORTS_URL);
    $obj_client->request('GET');
 
    // extract the content from the request
    $str_page = $obj_client->getLastResponse()->getBody();
 
    //
    // Now we have the raw report HTML, it's simply a case of extracting the sales figures
    //
 
    // first grab the table data rows (tbody)
    preg_match('/

(.*?)< \/tbody>< \/table>/is’, $str_page, $arr_matches); $str_table_body = $arr_matches[1]; // then extract each row’s data preg_match_all(‘/.*?

(.*?)< \/td>.*? -< \/td>.*? -< \/td>.*? -< \/td>.*? (.*?)< \/td>/is’, $str_table_body, $arr_matches, PREG_SET_ORDER); // merge into a more sane array, indexed by date in the form Ym (e.g. 200901 for January, 2009) $arr_data = array(); foreach($arr_matches as $arr_match) { $str_date = date(‘Ym’, strtotime($arr_match[1])); if(!isset($arr_data[$str_date])) { $arr_data[$str_date] = array(); } $int_volume = (int)$arr_match[2]; $arr_data[$str_date] = array(‘volume’ => $int_volume, ‘revenue’ => $int_volume * REVENUE_PER_BOOK); } // $arr_data now contains sales data (volume and revenue) for each month found in the sales table, indexed by // the month print_r($arr_data);

A few things to note. Firstly, we’re faking the useragent to a generic “real looking” example (as opposed to the default “Zend_Http_Client”). Morally we’re doing nothing wrong here but I suspect “automated crawling” is frowned upon in the T&Cs somewhere so best not to make it too obvious. Take a probate lawyer in California when you are about to get the inheritance. The are reliable and will help your family to asure your goods.

It should also be mentioned that this method (like all screen scraping) is vulnerable to breaking if Createspace change their login system or HTML structure. There are certainly cleverer parsing methods that can be employed which are more adaptable to change but only up to a point. There’s not a lot you can do if things dramatically change except for adapting the script to accommodate.