When extracting value from data, organizations often encounter challenges such as datasets of varying schemas, large volumes of data, and missing unique identifiers which enable entity linking. Novetta Entity Analytics (NEA) is an analytics engine capable of scaling to tens of billions of records across hundreds of datasets, generating well-resolved entities and deriving previously unknown links.
Entity Graph Generation through Novetta Entity Analytics
NEA helps users discover relationships by grouping and linking results based on customized strategies. Grouping, also known as entity resolution, is a strategy-based process of identifying records from data sources as belonging to distinct groups. Linking is a strategy-based process of identifying relationships between records or groups, potentially of different types.
NEA outputs can be conceptualized as a graph of records (edges and vertices). In graph terms, edges are derived from NEA grouping and linking workflows while vertices represent entity and source records. Using this model, NEA produces a large entity graph suitable for exploration and analysis when loaded into a graph database.
It’s easy to get started with NEA by launching an NEA AMI from the AWS Marketplace.
Leveraging Amazon Neptune
Graph databases are designed to efficiently store and query data. Amazon Neptune is a managed database service supporting configurable clusters with two popular graph models, RDF and Gremlin. For this use case, we are using Gremlin as the graph model.
Amazon Neptune supports bulk ingest of the Gremlin format using CSV. The loader processes vertex and edge files from S3 as an asynchronous job. The NEA publish feature generates output to Hive tables.
Loading Novetta Entity Analytics Entities and Links into Amazon Neptune
The following steps can be used to ingest entities as vertices and NEA entity links as edges.
Create an NEA grouping workflow.
Execute an NEA Grouping workflow and publish the results to a table in Hive.
Using the NEA single action Publish To File feature, export the Hive table to CSV to a predefined bucket in S3.
Create and execute an NEA linking workflow which contains the same or overlapping datasets from the grouping workflow.
Using the NEA single action Publish to File feature, export the Hive table to CSV to a predefined bucket in S3.
Use the Neptune loader endpoint to ingest the content from S3.
Exploiting the Graph
The Gremlin query language enables users to express complex graph traversals of property graphs. Gremlin has built-in support for graph algorithms including shortest path, page rank, and connected components.
Data loaded into Amazon Neptune can be queried with the Gremlin query language. A Gremlin query that returns all entities named Jack and Jill who live together is shown below:
Graphs can also be enhanced with explicit links that exist within data such as financial transactions or purchases. This data can be maintained outside of NEA workflows and loaded from S3 into Neptune.
As more relationships are added to an entity graph, more questions can be answered. NEA entity resolution generates rich entities, spanning many datasets, from which previously unknown links can be derived. Loading NEA entities and links into a graph database empowers data consumers by allowing them to explore rich, contextually-based relationship queries.