When Small Data beats Big Data

WHEN SMALL DATA BEATS BIG DATAIn October 2015, an artificial intelligence system, developed by the University of Cincinnati, repeatedly and comprehensively beat a retired USAF colonel in aerial combat simulation.

On the face of it, this comes as no big surprise. Armed with enough computer power, it must be possible to crunch real time data quicker than a bunch of grey cells. So, like me, you probably assumed that ‘Top Gun’ was blown away by data overload and superior processing fire-power.

Well, nothing could be further from the truth. This simulation was fought using surprisingly small amounts of data and equally modest processing power – the equivalent of a Raspberry Pi.

So why did the air ace go down in flames? Well, the simple answer is that he was beaten by superior human logic. The AI system used a technique called ‘Genetic Fuzzy Tree’ – a type of fuzzy logic algorithm that reduces each control decision to a manageable number of sub-decisions. Concentrating only on the variables for each sub-division, the system deliberately imitated the way humans focus on processing small amounts of data. It just did it better and faster!

But, if the future for AI lies in mimicking human reasoning and exploiting small datasets, it does beg a fascinating and pretty fundamental question: why are we so pre-occupied with Big Data?

The power of small data v the promise of big data

Big Data is one of the most touted and talked about concepts in technology. But is all the hype and hyperbole justified?

The term was first coined, about twenty years ago, when large search companies started wrestling with ways to process the huge volumes of data generated by the internet. Driven by the belief – or, to be more honest, the optimistic hope – that the magic of analytics would someday unlock valuable insights, most large companies started creating vast data lakes.

And make no mistake, the theoretical value of these data reservoirs is huge. A 2015 article in Forbes magazine predicted that, for a Fortune 1000 company, a mere 10% increase in data availability could result in $65 million worth of additional net income.

So why are there so few examples of this value actually being realised? Despite the fact that we have created more data in the last two years than in the entire history of the human race, less than 0.5% of this information is ever analysed or exploited.

We have become so focused on capturing vast amounts of historical data that we’ve lost sight of one fundamental flaw in the big data dream.

Humans are not programmed to process huge volumes of information. We are not wired to be Cray super-computers. We need our data sliced-and-diced into bite-sized portions; digestible information that we can use to directly improve our performance.

Now fortunately for us, that’s very good news. Not only do humans digest small data more efficiently but (to stretch the analogy) small data also happens to be very rich in food value. Frequently, the most fruitful insights are drawn from the thinnest, most unpromising datasets.

And I am not just talking about esoteric air-to-air combat simulations, I am referring to real-life projects where small data delivers substantial, actionable insights…

This surprising truth is beautifully illustrated by some work undertaken last year by one of our Clustre member firms…

Case Study – keeping track of rolling assets

Railway engines are extremely costly and complex assets and there are several stakeholders involved in their operation:

  • Train owners, typically finance companies, who bank-roll the locomotives;
  • Train operators who hold the franchises to operate particular UK routes;
  • Third party contractors who service the leased engines;
  • Network Rail who own and manage the tracks.

The net result of this complexity is the trains’ owners have no direct control over the care and maintenance of their costly assets. They don’t know how many miles they are covering or if the operators are taking good care of them or even keeping to their contractual terms and conditions. And they have no idea if their depreciation assumptions are accurate.

The only real data available comes from the locomotives’ on-board fuel-tank sensors and GPS. Every five minutes these transmit information on changing fuel levels to a central data store. So, in theory, it is possible to track the levels of diesel consumption for each engine. There is some data from Network Rail which reveals when trains depart and arrive in stations but this does not identify which locomotive is operating at any particular time or on which specific route.

Even though the total amount of data available is only a few kilobytes per engine, the team were able to use powerful data analytics and machine-learning to generate insights into the condition and use of each engine.

Case StudyThe first task was to clean the data because the fuel tank readings were pretty ‘noisy’. The read-out graph of the diminishing diesel levels show big spikes and troughs caused by fuel surges as the trains corner, accelerate and brake. To remove this noisy time series, the team used an L1 Kalman filter. Then, using a support vector machine, they started to look for recognisable patterns. Focusing on small datasets, they were able to identify when, for example, fuel levels rose vertically indicating stationary refueling stops, or periods when engines were in “Hotel Mode” (stationary but still using fuel to heat and light carriages).

The analysis then became even more forensic. The team studied engine movements over particular time periods. From this data it was possible to deduce stationary periods for trains to be recommissioned or refueled, running periods between stops, the frequency of stations, the distance between these stations and even the probable routes and types of service (local, commuter, express etc.).

With the use of standard machine-learning techniques and a support vector machine, a very credible and revealing picture of train usage was emerging. Engines with less than 10% movement were deemed to be ‘unused’; engines with less than 50% movement were considered to be ‘lightly used’; and those with more than 50% movement were classified as ‘heavily used.’

They then turned the spotlight on the critical issue of fuel efficiency. By classifying the data in a different way, the stats showed exactly when and where there was high fuel usage (more than half a litre per minute), medium fuel usage and low fuel usage.

The team then used weighted maximum matching algorithms to discover which trains had been paired for different routes (typically trains are driven by two engine units, front and rear, with carriages in-between). By aggregating the data, the team discovered which engines were routinely paired.

Finally, it was possible to map engine journeys with a high degree of accuracy. By matching the pairings classification to Network Rail’s movement data, once again using the weighted maximum matching algorithms, the team tracked one train throughout a hard-working day. Leaving Harrogate at 07:34 it travelled to Kings Cross, then on to Newark North Gate, before returning to Kings Cross and finishing the day in Sunderland.

The take-home message here is clear: You don’t need big data to undertake meaningful data-analysis. These fuel time series are literally only a few kilobytes per engine but they generated insights that would be impossible to develop manually. Even on these very small datasets, machine-learning techniques outperform anything that humans can deliver.

Big data has the potential to be a massive asset. But that potential is purely theoretical until it is distilled into small data. That is the moment when data assumes new and usable value.

This article is an abridged version of a more in-depth point of view which can be found on the Clustre website at: www.clustre.net/big-data/

More information on the University of Cincinnati AI project can be found at: www.magazine.uc.edu/editors_picks/recent_features/alpha.html

Andrew Simmonds is Consulting Director at Clustre – The Innovation Brokers www.clustre.net

MORE INFO
FOLLOW
IN TOUCH
© 2018 Clustre, The Innovation Brokers All rights reserved.
  • We will use the data you submit to fulfil your request. Privacy Policy.