This is the first of my two-part blog series about the types of intelligence Novetta has built into Novetta Entity Analytics.
The intelligence we have included in our software leverages the knowledge our team of computer science and data integration experts has gained over the past 14 years consulting on many large data integration and entity analytics projects.
Since 2001, Novetta’s team has focused on resolving and analyzing entities across many data sources. Our work began around the time of 9/11 to help our government customers better respond to increased threats. Ever since, we have been perfecting our data integration and entity resolution processes and rules on many different data types and sources: public, private, open-source, structured, semi-structured and unstructured. One of our largest projects thus far required us to maintain minimum accuracy rates of 99.5% when combining and resolving more than 2,000 diverse data sources.
Our goal for Novetta Entity Analytics is to provide customers the ability to automatically combine diverse data sets and resolve entities within them, without having to become data experts themselves or hire a team of data scientists and data integration resources to get the job done. We are continually expanding and enhancing our software’s built-in intelligence as we develop new best practices from our work with customers in the field and in our lab.
Described below are three types of intelligence included in Novetta Entity Analytics and examples of their use and benefits. These include knowledge for automatically extracting data from unstructured and semi-structured sources, improving data cleansing results and identifying data values inappropriate for use resolving entities.
1. Automated extraction of data from unstructured and semi-structured sources
Novetta created and continually updates curated lists and grammars that allow Novetta Entity Analytics to automatically detect and extract entity features from unstructured and semi-structured data sources. This includes lists and grammars for names, addresses, phone numbers, passport numbers, credit card numbers, etc.
EXAMPLE: For extracting people names, Novetta has curated a list of millions of names from multiple countries and languages and developed grammars that automatically identify various name parts, such as first names, surnames, suffixes, titles or prefixes.
Novetta Entity Analytics uses these curated lists and grammars to extract data useful for entity resolution from unstructured or semi-structured data sources, including text contained within tables where natural language programming approaches typically fail.
2. Characterization of data structures improves data cleansing
From our work combining thousands of data sources, Novetta’s team has devised highly refined processes for automatically characterizing data structures to enhance cleansing results. These processes, built into Novetta Entity Analytics, automatically generalizes data values into classes of letters, numbers and punctuation, creates patterns from the classes, and generates histograms from the patterns to quickly identify whether values conform to expected norms. The results of data structure characterization processes are more uniform data and greater accuracy of matching results for person, organization or product names, or addresses, postal codes, phone numbers, dates of birth, social security numbers and other attributes that identify entities.
EXAMPLE: Structure characterization processes are applied to a data field containing state values and the characters within each record are replaced with the letter L. Novetta Entity Analytics produces the pattern counts and histogram below from the data. The counts show the vast majority of records contain two-character state code values that are most likely state-name abbreviations, but some records contain longer strings of data. Upon further investigation, users identify the longer strings as state names and determine that state name-to-abbreviation mapping should be performed prior to matching to ensure data is uniform.
3. Characterization of data values identifies which records to use for resolution
Novetta’s team has become expert at identifying which data values, within a wide range of data attributes, should be used for entity resolution. This knowledge is built into the automated data value characterization processes in Novetta Entity Analytics to increase the accuracy of resolved entities. The software applies these processes to detect which specific values within a data field are appropriate or inappropriate for use in resolving entities.
EXAMPLE: Phone numbers are usually associated with one or a small number of individuals, but when individuals are trying to obscure their information they may enter invalid or incorrect data, such as a directory assistance number (703.555.1212) or an invalid phone number (111.111.1111). Novetta Entity Analytics identifies inappropriate numbers by automatically counting the number of unique names associated with each unique phone number within a field and assigning a lower uniqueness value to numbers with a greater number of names associated with them.
Delivering automated data extraction and integration intelligence
The three examples above are just some of the data integration and entity analytics expertise we have built into Novetta Entity Analytics to optimize entity resolution.
Stay tuned for part two of this blog series, where I will discuss the transformation, resolution strategy and conflict resolution knowledge we have also included.