One of the lesser understood aspects of what you can do with massive stockpiles of data is the ability to use data that would traditionally have been overlooked or in some cases even considered rubbish.
This whole new category of data is known as “exhaust” data – data generated as a byproduct of some other process.
Much financial market data is a result of two parties agreeing on a price for the sale of an asset. The record of the price of the sale at that instant becomes a form of exhaust data. Not that long ago, this kind of data wasn’t of much interest, except to economic historians and regulators.
A massive moment-by-moment archive of prices of shares and other securities sales prices is now key to many major banks and hedge funds as a “training ground” for their machine-learning algorithms. Their trading engines “learn” from that history and this learning now powers much of the world’s trading.
Traditional transactions such as house price sales history or share trading archives are one form of time-series data, but many other less conventional measures are being collected and traded too.
There are also other categories of unconventional data that are not time-series-based. For example, network data outlines relationships and other signals from social networks, geospatial data lends itself to mapping, and survey data concerns itself with people’s viewpoints. Time series or longitudinal data is, however, the most common form and the easiest to integrate with other time-series data.
Consistent Longitudinal Unconventional Exhaust Data or CLUE data sets, as I’m calling them, are many, varied and growing. They include:
- foot traffic data
- consumer spending data
- satellite imaging data
- ecommerce parcel flow data
- technology usage data
- employee satisfaction data.
Say, for example, you are interested in the seasonal profitability of supermarkets over time. Foot traffic data may not be the cause of profitability, as more store visitors doesn’t necessarily correlate directly to profit or even sales. But it may be statistically related to volume of sales and so may be one useful clue, just as body temperature is a good clue or one signal to a person’s overall well-being. And when combined with massive amounts of other signals using data analytics techniques, this can provide valuable new insights.
Rise of ‘Quantamental’ investment funds
Leading hedge fund Blackrock, for example, is using satellite images of China taken every five minutes to better understand industrial activity and to give it an independent reading on reported data.
Traditionally, there have been two main types of actors in the financial world – traders (including high-frequency traders), who look to make money from massive volumes on many small transactions, and investors, who look to make money from a smaller number of larger bets over a longer time. Investors tend to care more about the underlying assets involved. In the case of company stocks, that usually means trying to understand the underlying or fundamental value of the company and future prospects based on its sales, costs, assets and liabilities and so on.
A new type of fund is emerging that combines the speed and computational power of computer-based Quants with the fundamental analysis used by investors: Quantamental. These funds use advanced machine learning combined with a huge variety of conventional and unconventional data sources to predict the fundamental value of assets and mismatches in the market.
Some of these new style of funds, including TwoSigma in New York and Winton Capital in London, have been spectacularly successful. Winton was founded by David Harding, a physics graduate from Cambridge University in 1997. After less than two decades it ranks in the [top ten hedge funds worldwide](http://www.relbanks.com/rankings/top-hedge-funds with US$33 billion in assets under advice and more than 400 people – many with PhDs in physics, maths and computer science. Not far behind and with US$30 billion in assets, TwoSigma also glistens with top tech talent.
New ones are emerging too, including Taaffeite Capital Management run by computational biology and University of Melbourne alumnus Professor Desmond Lun. Understanding the complex data dynamics of many areas of natural science, including biology and ecology, are turning out to be excellent training for understanding financial market dynamics.
Weird data for all
But it’s not only the world’s top hedge funds that can or are using alternative data. A number of startups are on a mission to democratise access to new sources. Michael Babineau, co-founder and CEO of Bay Area startup Second Measure, aims to offer a Bloomberg-terminal-like approach to consumer purchase data. This will transform massive amounts of inscrutable text in card statements into more structured data, thus making it accessible and useful to a wide business and investor audience.
Others companies, like Mattermark in San Francisco and CB Insights in New York, are intelligence services that provide fascinating and valuable data insights into company “signals”. These can be indicators and potential predictors of success — especially in the high-stakes game of technology venture capital investment.
Akin to Adrian Holovaty’s pioneering work a decade ago mapping crime and many other statistics in Chicago online, Microburbs in Sydney provides a granular array of detailed data points on residential locations around Australia. It allows potential residents and investors to compare schooling, restaurants and many other amenities in very specific neighbourhoods within suburbs.
We Feel, designed by CSIRO researcher Dr Cecile Paris, is an extraordinary data project that explores whether social media – specifically Twitter – can provide an accurate, real-time signal of the world’s emotional state.
Weird small data has its benefits
More than simply pop-economics, Freakonomics (2005) showed how unusual yet good-quality data sources can be valuable in creating insights. Assiduous record-keeping of the accounts of an honesty system cookie jar in an office place revealed that people stole most during certain holidays (perhaps due to increased financial and mental stress at these times); access to drug gangster book-keeping accounts explained why many drug dealers live with their grandparents (they are too poor to move out); and massive public school records from Chicago showed parental attention to be a key factor in students’ academic success.
Many of the examples in Freakonomics were based on small quirky data samples. However, as many academics are aware, studies with small samples can present several problems. There’s the question of sampling — whether it’s large enough to represent a robust sample and whether it’s a random selection of the population the study aims to understand.
Then there’s the problem of errors. While one could expect errors to be smaller with smaller sample sizes, a recent meta-study of academic psychology papers found half the papers tested showed significant data inconsistencies and errors. In a small number of cases this may be due to authors fudging the results, whereas others may be due to transcription or other simple mistakes.
Weird data is getting easier to find
More and more large-scale unconventional data collections are becoming readily available. There are three blast furnaces driving its proliferation:
- the interaction furnace: our own growing interactions with the web and web services (ecommerce, web mail, social media) etc
- the transaction furnace: the increasingly online ledger of commerce
- the automation furnace: an explosion of web-connected sensors.
While large data collections can’t help with avoiding fabrication, they can sometimes help with sample size and representation issues. When combined with machine learning they can:
- provide accurate insights from incomplete, noisy and even partially erroneous data
- offer associations, patterns and connections — blindly with no a priori assumptions
- help eliminate bias — by invoking multiple perspectives.
What can we expect from more clues?
We may see unexpected results and be surprised about the degree to which many factors such as social and personal information are highly predictable using unexpected data signals. Michael Kosinski and his colleagues showed the predictive power of social media data in the analysis they published in PNAS in 2013. They demonstrated that highly personal traits such as religion, politics and even whether your parents were together when you were 21 were highly predictable using Facebook likes alone.
We will see a plethora of applications emerge that take advantage of processing unconventional data sources. One rich area is biometrics. Australian tech startup Brain Gauge has shown that people’s voices can be uses as signal for cognitive load and used for real-time detection of stress levels and reduced absenteeism in call-centre staff, for example.
We can also expect to see a lot more meta-analysis of communities, populations and industries. Increasingly ambitious studies are now possible that combine and link massive, often disparate data sets together to yield new insights into economics, law, health and many other areas of research. One example is the recent meta-study published in the Journal of the American Medical Association that combined nine other studies and found that walking speed in older adults is indeed a predictor of longevity.