How Data Quality Can Hurt your Data Science Project… If You’re Not Careful
If “Data Scientist is the sexiest job in the 21st Century”, then data quality is the least sexy aspect, but that doesn’t take away from its critical importance.
With growing investment around Artificial Intelligence and Machine Learning, and near-daily success stories in the news, organisations in traditional industries are investing in data science like never before.
However, many of these companies must deal with legacy systems, a lack of core data skills across their employees, and poor data quality.
The consequences of poor data quality can be enormous. In a research study published in MIT Sloan Management Reviewcompanies are said to be losing around 15% to 25% of their revenues due to poor data quality.¹
Poor data quality has even been cited as a factor in disasters including the explosion of the space shuttle Challenger and the shooting down of an Iranian Airbus by the USS Vincennes.²
One consequence of poor data quality is that knowledge workers waste up to 50% of their time dealing with mundane data quality issues. For data scientists, this number may go as high as 80%.
A strategy of hiring data scientists and pointing them at business problems will likely not produce expected results if there aren’t strong data foundations in place. A common refrain is “garbage-in, garbage-out” (GIGO), meaning if the data used to train a model is of very poor quality, it is unlikely to produce accurate results.
A Kaggle survey in 2017 of professionals in the Data Science domain³ showed that Dirty Data was their number one challenge (see Figure 1).
Figure 1 — Kaggle 2017 ML & DS Survey — What barriers are faced at work?
Although there are several ways data scientists can tackle data quality issues, data quality should be an organisational wide priority.
Beyond the importance of quality data in training machine learning models, data quality affects Business Intelligence, Management Information, and every process or decision that relies on organisational data being correct.
What do we mean by Data Quality?
Data Quality is a very broad term but can be considered across six key dimensions⁴:
Completeness Are all data sets and data items recorded?
Example — For an e-commerce website, is there a record for every customer who has created an account on the platform.
Consistency Can we match the data set across data sources?
Example — For an example airline company, is the passport number on a passenger’s boarding card the same as on their passport?
Uniqueness Is there a single view of unique data attributes?
Example — In the case of a manufacturing company, is there a single entry for each supply chain vendor in the vendor master data?
Validity Does the data match defined rules?
Example — Does the customer nationality fall into a defined set of nationalities?
Accuracy Does the data reflect the real value?
Example — A customer enters an incorrect post code when ordering an item to be delivered.
Timeliness Is the data available when required after it was entered or gathered?
Example — Are financial transactions available in time for a fraud detection model to detect fraudulent transactions, before more such transactions are carried out?
There are many ways that poor quality data can be introduced into a database, some of these include:
Data input errors, due to lack of validation at the point of data entry, for example a front-end web application.
Migration or integration of source systems, causing inconsistencies between the data coming from disparate systems.
Complex data pipelines can introduce inconsistent and untimely data through transformations the data may undergo.
Lack of organisational Master Data, representing the most valuable information agreed upon across an organisation, and the management of such data through Master Data Management (MDM).
How good is good enough?
In any organisation, it is a certainty that data will never be 100% perfect. There will always be inconsistencies through human error, machine error or through sheer complexity due to the growing volume, velocity and variety of data companies now handle.
So that leads to the question, how good is good enough for the purposes of data science?
This depends on the requirements from the business on the accuracy of a model for their business problem. Explanation around the meaning of terms such as precision and recall to non-technical stakeholders is required to understand whether the data quality and model performance is high enough to meet the demands of the business.
Also, there are many techniques that data scientist can use in the data processing process when developing a machine learning model to address data quality issues.⁵
Imputation of missing values
Data standardisation and de-duplication
Handling of different data quantities
Analytical transformation of input variables
Selection of variables for predictive modelling
Assessment of model quality
However, these approaches can only get a data science team so far, and it is everyone’s responsibility to improve the quality of data in an organisation.
How can Data Quality issues be addressed by an organisation?
Data Quality Culture
Establishing a strong culture of data quality is paramount and must be initiated at the top of the organisation. There is a nine-step guide for organisations that wish to improve data quality.⁶
Drive process reengineering at the executive level
Spend money to improve the data entry environment
Spend money to improve application integration
Spend money to change how processes work
Promote end-to-end team awareness
Promote interdepartmental cooperation
Publicly celebrate data quality excellence
Continuously measure and improve data quality
Data Quality Processes
Good data quality begins at the point of entry. Validating data quality issues as far upstream in a data pipeline as possible, as it reduces the need for downstream applications to duplicate effort in cleaning data. Take for example an application that takes customer delivery orders. Having the front-end application validate correct email addresses, postal addresses, payment details will mean that any downstream processing of the data will not have to correct as many data quality issues.
Automating data profiling and validation at every stage of the data pipeline can help identify issues early on and save time manually identifying issues further down the line. Data engineers writing simple scripts to ensure matching row counts, table relationships and expected data types, are worth the small amount of time in upfront development.
Don’t reinvent the wheel. Leverage international standards (for example ISO 3166–1 country codes and ISO 4217 currency codes) or country specific standards for reference data (for example the Royal Mail’s Postcode Address File (PAF)).
Data Quality Software
There exists a plethora of software solutions to manage and improve data quality, such as Talend and Informatica, that include a range of critical functions, such as profiling, parsing, standardisation, cleansing, matching, enrichment and monitoring. A 2019 Gartner report (see Figure 2) evaluates 15 vendors for data quality tools based on their Ability to Execute and Completeness of Vision.
Figure 2 — Gartner 2019 Magic Quadrant for Data Quality Tools
Some of these vendors, such as Informatica’s CLAIRE,⁷ have recently incorporated machine learning and artificial intelligence into their product offering.
How can Machine Learning and Artificial Intelligence help improve Data Quality?
As well as being reliant on good quality data, a growing number of early adopters are turning to machine learning and artificial intelligence to automate processes for cleaning data. Some of the applications of ML & AI to data quality are:
Named Entity Recognition — In order to retrieve important entities from unstructured data such as persons, organisations and locations, Named Entity Recognition (NER) is a Natural Language Processing technique to automate this process. Say for example you had an unstructured address field that contained the address City, NER could be used to extract this useful information.
Record Linkage (Matching) — Probabilistic record linkage has been used for many years in a variety of industries, including medical, government, private sector and research groups. While this method can produce useful results, it is now possible to improve accuracy by using machine learning or neural network algorithms.⁸
Text Classification — Another Natural Language Processing technique, Text Classification can be used to automate the classification of unstructured text. Take for example a database containing unstructured customer complaints, text classification could be used to classify the complaints by issue (e.g. late delivery, damaged packaging, inaccurate product description).
The Future of Data Quality in Machine Learning and Artificial Intelligence
As we move through the new decade, more companies will be deploying machine learning models to make business critical decisions and will be using a greater volume of data from a wider variety of sources. We will likely see some high-profile incidents where decisions have been made by models trained on poor quality data, leading to regulatory fines and expensive rectification processes.
Organisations that have done the hard work by introducing a culture of Data Quality and started to automate the management of their data using ML & AI will see the benefits and have a far higher success rate in running data science initiatives.
1. Redman T. Seizing Opportunity in Data Quality (2017)