“It is a capital mistake to theorise before one has data”: Arthur Conan Doyle, creator of Sherlock Holmes.
The Challenge of Compiling Data
This elementary truism – spoken by the world-famous author – hails back to the Victorian age when data constituted tangible yet often elusive material to be labouriously gathered and sifted through by hand. The challenge of compiling data and examining it could only be achieved using the human brain. It was in 1898 that what could be called the first technological solution to controlling data was invented – the vertical filing cabinet. Invented by Edwin G. Seibels, while working in his father’s insurance company, he saw the need to record, sort and file the data his father needed in an efficient and accessible manner.
This ingenious invention soon replaced the filling of large slabs of almost inaccessible ledgers. Seibels’ idea was to change the handling of information within offices forever. Of course – like all tools – a filing cabinet is only effective if it is effectively use. If a user merely dumps a heap of papers, photographs, and files into the various drawers in an unstructured manner it defeats the purpose of the invention. And most of us have come across such sloppy, ineffective filing systems at one time or another in our careers.
Filing Cabinets & PCs
Now readers of this blog might be scratching their heads trying to understand the connection between Big Data and filing cabinets. Well, those familiar with how computers work know that the whole concept of the PC is based upon the working of an office: we have a desktop; we have folders, we have documents, we have hard drives (i.e., filing cabinets) that allows us to structure our data so that it can be stored, searched, parsed, retrieved, and interrogated with ease. That is the connection between the two: the difference today is the invention of the Cloud (that giant Filing Cabinet in the Sky); that amorphous entity now holds an inconceivable amount of data. Here are a few figures to help you understand the extent of Big Data:
The numbers outlined in the chart above are the quantity of Zettabytes estimated to be in storage and the growth of that number over a decade or more. Now if you are like me, you have no idea of what constitutes a Zettabyte, let me clarify: A Zettabyte is a trillion gigabytes of space. And remember, a gigabyte itself is a billion bytes. I hope that clarifies things and good luck with doing the maths on those numbers. That data is made up of text files, photographs, images, customer transactions, social media entries, financial data and more. In short, everything that is done online these days goes into that immense filing cabinet in the sky. That being so, it is easy to understand why there has been such a phenomenal growth in the quantities stored.
What is Bad Data?
Of course, as with Siebels’ vertical filing cabinet, data stored in the Cloud is useless to the end-user (i.e., big companies) if it is not sorted and filed in an advantageous way. Companies look for the following to measure the quality of useful data:
- Accuracy
- Completeness
- Reliability
- Relevancy
- Timeliness
If the data does not possess these qualities, it is defined as Bad Data. And the existence of poorly constructed data does have an enormous impact on modern day companies. IBM, for example, estimates that in the US alone Bad Data costs businesses $3.1 trillion annually. The drive to achieve challenges such as optimising Processing Performance, understanding Variance and Bias, identifying Dirty and Noisy Data, preventing Concept Drift, and excercising care around Data Provenance and removing any Uncertainty are the key tools in ensuring a company is leveraging the cleanest, most efficient, and relevant data available.
The good news is BAD Data can be rescued. Although the data might be outdated, unstructured, unformulated, irrelevant, corrupted that is often due to the process that it has gone through, or the way people compiled it. Looking for information in a pile of Bad Data is akin to sticking your hand blindly into a bin full of Post-Its hoping to find something relevant and then basing an important business decision on your findings. Companies can and do rescue their data through a focused, structured, and professional engagement with the challenge.
Business Analytics, that great driver of modern-day decision making, can be stymied by Bad Data. This can cause companies to make ill-informed and costly decisions. It can lead companies to miss potentially significant market trends until it is too late. The reliance on Bad Data leads to wasted investment and can cause companies to stagnate. The use of Bad Data results in the loss of time, the misuse of resources, the failure to identify opportunities, and the loss of revenue. Undoubtedly, the adage of “garbage in – garbage out” could have been written to describe the dangers of misusing data.
How to Fix It?
Like most problems in business a structured, reasoned, and well-planned approach to the problem will result in avoiding the pitfalls described above. The following steps are elementary and yet essential when tackling Big Data and exploiting that most valuable of company assets.
- Design a tightly controlled dataflow pipeline from the ingestion of data to the interrogation of it for business intelligence.
- Map your data to meet your business needs.
- Regularly test the integrity of your system and content.
- Employ professional data engineers and data analysts. It is not a part time post for some intern.
- Feed reliable and tested data into a process if you want to ensure quality data coming out.
- Employ the best business intelligence tools that will allow you to optimise your data sources.
Data is an extremely valuable company resource. However, companies would be more advised to use no data than rely on Bad Data to drive, guide and grow a business. If a company loses faith in the data provided by the system, it simply becomes a millstone dragging the company down. In short, treat data with respect. Design a team armed with the best tools and systems to manage and exploit your data assets. If as a company you believe you will not need to embrace the science of Big Data, the following quotation might change your mind:
“Every company has Big Data in its future, and every company will eventually be in the data business.” – Thomas H. Davenport, Co-founder, International Institute for Analytics
You have been warned. Do not employ half measures when dealing with Big Data. If you do; prepare to be a loser.
Big Data image used above was sourced at: https://towardsdatascience.com/
Aidan Collins, Marketing Manager