Hadoop is a great place to offload data for analytics processing or to model larger volumes of a single data source that aren’t possible with existing systems. However, as companies bring data from many sources into Hadoop, there is an increasing demand for the analysis of data across different sources, which can be extremely difficult to achieve.
This post is the first in a three-part series that explains the issues organizations face, as they attempt to analyze different data sources and types within Hadoop, and how to resolve these challenges. Today’s post focuses on the problems that occur when combining multiple internal sources. The next two posts explain why these problems increase in complexity, as external data sources are added, and how new approaches help to solve them.
Data From Different Sources Hard to Connect and Map
Data from diverse sources have different structures that make it difficult to connect and map data types together, even data from internal sources. Combining data can be especially hard if customers have multiple account numbers or an organization has acquired or merged with other companies.
Over the past few years, some organizations have attempted to use data discovery or data science applications to analyze data from multiple sources stored in Hadoop. This approach is problematic because it involves a lot of guesswork: users have to decide which foreign keys to use to connect various data sources and make assumptions when creating data model overlays. These guesses are hard to test and often incorrect when applied at scale, which leads to faulty data analysis and mistrust of the sources.
Hadoop Experts Attempt to Merge Data Together
Therefore, organizations that want to analyze data across data sources have resorted to hiring Hadoop experts to create custom, source-specific scripts to merge data sets together. These Hadoop experts are usually not data integration or entity resolution experts, but they do the best they can to address the meet the immediate needs of the organization.
These experts typically use Pig or Java to write hard and fast rules that determine how to combine structured data from specific sources, e.g. matching records based on an account number. Once a script for two sources has been written, if a third source needs to be added, the first script has to be thrown away and a new script designed to combine three specific sources. The same thing happens if another source is added and so on. Not only is this approach inefficient, but it also fails when applied at scale, handles edge cases poorly, can result in a large number of duplicate records, and often merges many records that should not be combined.
Source-Agnostic Methods Better For Combining Data
A better approach is to combine internal data sources using a source-agnostic method that includes a flexible, entity resolution model, which allows new sources to be added easily using a statistically sound repeatable process.
Stay tuned for my next post where I discuss why adding external data to Hadoop further increases analytical challenges, especially when data sources are semi-structured, unstructured, fragmented or dirty.