EXAMPLE: For extracting people names, Novetta has curated a list of millions of names from multiple countries and languages and developed grammars that automatically identify various name parts, such as first names, surnames, suffixes, titles or prefixes.Novetta Entity Analytics uses these curated lists and grammars to extract data useful for entity resolution from unstructured or semi-structured data sources, including text contained within tables where natural language programming approaches typically fail. 2. Characterization of data structures improves data cleansing From our work combining thousands of data sources, Novetta’s team has devised highly refined processes for automatically characterizing data structures to enhance cleansing results. These processes, built into Novetta Entity Analytics, automatically generalizes data values into classes of letters, numbers and punctuation, creates patterns from the classes, and generates histograms from the patterns to quickly identify whether values conform to expected norms. The results of data structure characterization processes are more uniform data and greater accuracy of matching results for person, organization or product names, or addresses, postal codes, phone numbers, dates of birth, social security numbers and other attributes that identify entities.
EXAMPLE: Structure characterization processes are applied to a data field containing state values and the characters within each record are replaced with the letter L. Novetta Entity Analytics produces the pattern counts and histogram below from the data. The counts show the vast majority of records contain two-character state code values that are most likely state-name abbreviations, but some records contain longer strings of data. Upon further investigation, users identify the longer strings as state names and determine that state name-to-abbreviation mapping should be performed prior to matching to ensure data is uniform.
EXAMPLE: Phone numbers are usually associated with one or a small number of individuals, but when individuals are trying to obscure their information they may enter invalid or incorrect data, such as a directory assistance number (703.555.1212) or an invalid phone number (111.111.1111). Novetta Entity Analytics identifies inappropriate numbers by automatically counting the number of unique names associated with each unique phone number within a field and assigning a lower uniqueness value to numbers with a greater number of names associated with them.Delivering automated data extraction and integration intelligence The three examples above are just some of the data integration and entity analytics expertise we have built into Novetta Entity Analytics to optimize entity resolution. Stay tuned for part two of this blog series, where I will discuss the transformation, resolution strategy and conflict resolution knowledge we have also included.