Building a Second Life Ticker Plant
There is a lot of data available out on the web that could be scrubbed and analyzed to give firms competitive advantages. Lars has worked with weather contracts and I spoke with him about methods of gathering and scrubbing that data. I’ve spoken about Second Life as an emerging market place with data that could be scrubbed and analyzed as well. It provides a good starting point to explore how to scrub and analyze online data.
In my post about foreign exchange trading in Second life, I asked about getting real time data from Second Life. I mentioned http://secondlife.com/xmlhttp/secondlife.php as a source for some information in XML format that could be used. Yesterday, I wrote about 70 lines of PHP code which retrieves, parses and stores the Second Life statistics into an SQL database. Below is a slightly geeky description of what I did, which illustrates some of the issues your run into.
The first snippet of code retrieves the data from SecondLife in XML format.
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,'http://secondlife.com/xmlhttp/secondlife.php');
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$xmlstr = curl_exec($ch);
curl_close($ch);
Using cURL, the data is retrieved from the Second Life website. The CURLOPT_RETURNTRANSFER option requests that the result be returned as a variable, which gets stored in $xmlstr instead of simply displayed.
$xml = new SimpleXMLElement($xmlstr);
$signups = str_replace(",","",$xml->signups);
…
SimpleXMLElement takes the string and puts it into a structure usable by PHP. The str_replace, takes out the commas so it can be used for numeric calculations. At this point, all that is left is to build an SQL statement to insert the results into a database.
I threw all of this into a loop to retrieve the data every ten seconds and store it. However, this raises a bunch of interesting issues.
First, the data isn’t really ‘real time’. Instead it is pseudo-real time, pulled every ten seconds. I don’t want to over burden the Second Life server, but I don’t want to miss key data either. Initial analysis of data showed that the frequency of the updates to the number of people ‘inworld’ occurred about every three minutes. So, instead of checking every 10 seconds, I have backed off to every 30 seconds. I also added a little code to only add records if the data has changed.
However, since this was quick and dirty, I set up a table that stores all of the data retrieved. While the number of people that are currently logged in changes on an average of every 3 minutes, the number of people that have logged in over the past 60 days hasn’t changed over my initial half day of gathering data. This illustrates a few different things. First, you need to be careful about not storing extra data. The next pass of the program will store each data element only when it changes, instead of storing the whole data structure each time it changes. In addition, most likely the number of people that have logged in over the past 60 days has changed over the past 12 hours. However, the data source appears not to be providing up to date information. You always need to consider lag in your data source.
While my program was quick and dirty, you always need to keep in mind that the data you receive may also be dirty and is going to need some sort of cleansing. My first draft assumed that the numeric fields would always include numeric values. However, in some cases, the data returned the value ‘Loading …’, which threw an exception.
PHP probably isn’t the best language to be building programs like this, but with routines already available to retrieve data from websites, parse the XML and store the results in SQL, quick prototypes can be built. There is considerable data available on the Internet that could be scrubbed and analyzed, and it is much easier to get started than many people may imagine.
(Cross posted at Toomre Capital Markets)