Overview
Many of our customers have incredible amounts of data scattered across their organization. They know hidden information is locked away, but they need ways to integrate that information and bring it to bear on their missions. One approach to data management is to bring all your data to one place by implementing a data lake – centralizing and storing structured and unstructured information in disparate data sets together. Data lakes drive additional insights into your entire data holdings, enriching the full picture and enabling uncovery of previously unknown links. Storing raw data alongside the results of derived analytics and machine learning models in the same data lake creates the greatest opportunity for end users to match their data to their mission.
Our customers’ datasets contain entities (people, places, organizations, devices, social media identities, etc.) without a shared primary key that can link them together. Novetta has many years of expertise in entity resolution, the process of algorithmically linking and disambiguating records from datasets with billions of records. We package that expertise in our Novetta Entity Analytics product, available in the Amazon Web Services Marketplace. Providing an entity-resolved data layer as part of a data lake gives business analysts, data scientists, and data engineers a foundation to generate rich dashboards, machine learning models, and downstream analytics.
Building a Data Lake
The task of building an effective data lake can be challenging, but Novetta utilizes a variety of AWS services that address the key components. These AWS services offer solutions for affordable redundant storage, massively parallel access to the data, a data catalog, and the ability to scale with the growth of your organization’s data.
AWS Simple Storage Service (S3), a highly durable blob object store, is inexpensive, redundant, and can be accessed in highly parallel fashion. It forms the foundation for any data lake deployed on AWS. AWS Glue, a DynamoDB table, or a Relational Database Service (RDS) instance can each serve as ways of creating a data catalog for your organization. AWS Glue is a fully managed option with features such as schema crawlers and the ability to run managed extract, transform, and load (ETL) jobs. Novetta’s expertise in ETL and data management and helps our customers navigate modern approaches to this problem.
AWS announced Lake Formation at Re:Invent 2018, and the service has been in preview since. Lake Formation provides a guided process to setting up a data lake leveraging S3 and AWS Glue. It also queries the information in your data lake via Redshift Spectrum and Athena.
Enriching Entities
Once a data lake is formed, Novetta Entity Analytics integrates with it through a Hive metastore that describes the data stored in your data lake.
The example below assumes you have stored your data in a bucket: org-data-sets and base path dataset1 with sub folders including partitions: year, month, and day (s3://org-data-sets/dataset1/year=2019/month=6/day=3/files).
CREATE EXTERNAL TABLE dataset1 (
name string,
dob string,
address_street string,
address_city string,
address_state string)
PARTITIONED BY (year int,month int,day int)
STORED AS parquet
LOCATION ‘s3://org-data-sets/dataset1/‘ ;
The Hive external table mapping shown above enables Novetta Entity Analytics to profile and use this dataset as part of an entity resolution job. Novetta Entity Analytics uses this configuration to determine how to examine relationships across datasets and generate entity mappings. The resulting entity map can be written back to the data lake as a new dataset or joined together to form a dataset of merged entities. Merging these entities across datasets uncovers hidden relationships for investigators to explore, new identifiers for analysts to query, and a more complete view of entities across your enterprise.