Using ML for Data Ingestion and Cleansing Tasks
It has been our experience that the process of data identification, cleansing, and normalizing has typically been the greatest barrier to effective data analytics. According to a recent survey published by Infoworld, data scientists report spending as much as 80% of their time locating data, collating it from multiple sources, and cleansing it before they can even begin their analytics work. As a result, organizations are looking for better ways to complete this data unification process. Over the last twenty-five years or so, there have been several key advancements in the domain, most notably applying artificial intelligence (AI) to address the ubiquitous data unification challenge. Michael Stonebraker, Professor Emeritus at UC Berkeley and an adjunct professor at MIT, represents the technological adoption in this domain as a series of generational advancements:
Generation 1 occurred in the mid-to-late 90s, when enterprises began adopting data warehouses and needed a way to load data from disparate sources into these warehouses. A generation of Extract, Transform and Load (ETL) tools came into existence to help organizations with this problem. These early ETL tools offered enterprises do-it-yourself scripting languages to create static rules for specifying transformation tasks. Users could program these ingestion tasks e.g. data cleaning, preparation, and consolidation, using the scripting engines. Many early ETL tools, however, had significant limitations. For example, some did not have dedicated features for data cleaning, making it difficult to detect and remove errors and inconsistencies which is required in order to improve data quality. ETL scripting tools were a good first step toward more effective data unification. However, as the volume and variety of data continued to expand, the time-consuming manual scripting began to impede the organizations’ ability to quickly derive critical insights and drive business value. By the time data was ready for analysis, the business had often moved on to new issues that needed tackling.
To hasten the delivery of critical business (or mission) data analytics, many enterprises adopted ETL products that included more sophisticated rules. These took two different forms:
Data cleansing tools were included in ETL suites. These were usually rule-based systems, which was the preferred AI technology of the time. Still, some (if not most) data cleansing was performed using hard-coded data lookup tables
The second application of rules-based systems was known as Master Data Management (MDM). MDM was primarily used to remove duplicates, and although de-duplication is a necessary process implemented to improve the accuracy of the resulting analytics, rarely are apparent duplicate entities exact duplicates. As such, organizations had to decide on a single representation, or “golden record”, for each entity. In this sense, such entity resolution challenges were often presented as duplicate data problems.
In Generation 2, organizations used AI-based rule technology to solve a variety of data unification problems, and while several vendors still provide MDM tools today, they really represent more of an advancement on first-generation AI, and are still at their core merely human-generated deterministic rules that are unable to scale to meet the data variety challenge of today’s enterprise data unification requirements. Although the number of rules a human can comprehend varies with the application and the complexity, most humans cannot understand more than 500 rules.
When solving problems involving significant data variety, rule-based systems will always fall short. To overcome these limitations, Stonebraker recommends uses machine learning (ML). Why not apply the 500 manually created rules as training data to construct a classification model? Such models would then be employed to classify the remaining transactions. This approach uses ML to solve all the problems originally addressed by the 2nd generation rule-based systems. In other words, a machine learning-based approach employs a ML model for schema integration, classification, entity consolidation, and golden record construction. As long as the training data is reasonably accurate and covers the data set well, an ML model can be expected to achieve solid results. However, human intervention would still be required to sample the ML output for accuracy. When errors are discovered in this QA process, they can be corrected and then the corrected data can be added to the training set, which essentially improves the accuracy of the model through active learning. If your enterprise has a “small” data problem, then you can safely use a 1st or a 2nd generation ingestion system. However, if scalability is required (either now or in the future), then a 3rd generation system is a necessity. The innovations presented by 3rd generation data unification capabilities will enable your organization to truly harness big data.