BACKGROUND
Named Entity Recognition (NER) and Entity Extraction are interchangeable terms that refer to the task of classifying “named entities” into pre-defined categories such as the names of persons, organizations, locations, etc.1
Let’s look at an example of how this actually works. Given a sample input:
John went to the Bank of America in Washington DC
An NER algorithm will read in this sentence and look to tag words it recognized as persons, organizations, locations, etc. The following represents the tagged version of the sentence:
[John]PERSON went to the [Bank of America]ORGANIZATION in [Washington DC]LOCATION
Such tagging is useful because it allows us to take large amounts of raw text (e.g. news articles) and automate the extraction of useful information, helping to organize large volumes of text. NER can also be used for social network analysis. Network graphs are effective ways of visually conveying information and are relevant to Novetta customers. Fun work is already being done here when analyzing social network graphs for shows like Game of Thrones.2
RESEARCH METHODOLOGY
NER Solution Selection Criteria
For the purposes of this project five open source NER solutions were selected based on the following criteria:
- Reproducible environment configurations
- Recognition of three entity types: person, location, organization
- Written in python or has a python wrapper
The five open source python solutions assessed were NLTK, Stanford NER Tagger, NeuroNER, spaCy, and AllenNLP. One cloud commercial solution, Amazon Comprehend, was also explored. The following table provides information on the open source solutions.
Testing Corpora
Solutions were tested against two corpora: the publicly-available WikiGold Standard Corpus and a set of NMA news articles.
All tokens (words and punctuation) from the WikiGold and NMA corpora were categorized into one of four groups: Person (PER), Organization (ORG), Location (LOC), and Other (a catch-all category for all non-PER, ORG, LOC terms). This analysis did not take IOB9 tagging into consideration (i.e. the location of a particular word within phrase boundaries).
The WikiGold Standard Corpus is an annotated corpus over a small sample of Wikipedia articles containing 145 documents. The corpus is about 1400 sentences long and in CoNLL format (IOB).10 It contains a total of 39,152 tokens and has the following tagging category breakdown:
The WikiGold corpus also has a “MISC” category for certain word categorizations which we mapped to “Other” for this evaluation.
A set of 10 NMA news articles was tagged manually. It contains a total of 7,963 tokens and has the following tagging category breakdown:
Evaluation Metrics
NER performance was measured in terms of Precision, Recall, F-measure and Time (seconds).11 Precision, Recall, and F-measure were calculated for the entire corpus and for each category (PER, ORG, and LOC). Since the large majority of the tokens in a corpus are tagged as “Other”, results are calculated both with and without the “Other” category to reduce distorting effects.
RESULTS
For graphical visualizations below, only the F-measure is represented, since it combines precision and recall.
General Tagging12
The first barometer used was a general “clumping” of the PER, ORG, and LOC categories. This helps us gauge a solution’s ability to correctly detect and identify entities of any type. If a given solution mistagged some words (e.g. tagged a word that is a LOC as an ORG), that specific miscategorization did not influence these scores.
Results show that when ignoring entity type, the best solutions detect and identify ~90% of entities. Solutions here can be segregated into two groups. Comprehend, ELMo, StanfordNER, and NeuroNER consistently perform better than spaCy and NLTK. This trend becomes more apparent in subsequent analyses.
Entity-Specific Tagging for WikiGold Corpus
The second barometer used was entity-specific results, measuring the degree to which solutions tagged entities correctly as PER, ORG, or LOC.
One interesting trend to spot here (across all solutions) is that PER is the easiest category to tag and ORG is the hardest. The WikiGold corpus is an amalgam of several arbitrary Wiki articles covering a wide breadth of topics. As a result, this trend is likely not a coincidence and intuitively makes sense – organizations are easy to miss or miscategorize compared to names.
Breaking down this analysis to assess individual solutions, Amazon Comprehend performance is most robust. ELMo, Stanford NER, and NeuroNER follow closely behind in each category and thus overall (the arithmetic average of the three categories).
The only tag that deviates slightly from this trend is PER. Here, ELMo beats Comprehend and Stanford NER and NeuroNER also have high f-measure scores.
Entity-Specific Tagging for NMA Articles
NMA results are more or less consistent with those of the WikiGold corpus. Both Stanford NER and ELMo performance fluctuate within individual categories, but overall they both perform well with close to 90% accuracy.
For the NMA corpus, Amazon Comprehend’s tagging in the PER category appears to lag that of some other solutions. An analysis of Comprehend’s tagging methodology provides visibility into this performance discrepancy. Comprehend tags titles as part of a person, such as Pope, Prime Minister, and President, but in the source data titles are not generally included as part of a person. As a result, Comprehend recall was very high: almost every person in the corpus was tagged and thus there were no false negatives. However, Comprehend precision was low because there were many false positives. Since F-measure is the harmonic mean of precision and recall, the observed results make sense. Since the other solutions do not consider a title to be part of a PER tag, their performance was not affected in this way.
There is also an explanation for Comprehend’s performance in both the ORG and LOC categories. Comprehend tags many countries and cities – such as Iran, Israel, United States, Tehran, Bushehr, and Russia – as ORG instead of LOC. To enable direct performance comparison, we have tagged all countries/places as LOC, consistent with the manner in which the five solutions have been trained. This discrepancy in tagging explains why Comprehend’s LOC metrics are weak (it is missing several countries) and why its ORG recall score is very high (because there are very few false negatives).
For reference, the following are the detailed Comprehend results:
The NMA results must be taken with a grain of salt as it is 20% of the size of the WikiGold corpus. Additionally, the NMA articles are more niche in nature than the WikiGold, such that anomalous tags are more likely to skew performance.
Computational Speed
All solutions were timed to determine how long each of these corpora took to tag. Solutions performed within seconds of each other with the exception of NeuroNER and ELMo. It is hypothesized that ELMo may be slow because it downloads its models from AWS every time it runs. NeuroNER computes significant additional material unnecessary for my testing purposes, but may be of value for other use cases.
ADDITIONAL CONSIDERATIONS
We can build on these results to consider additional factors that may drive implementation decisions.
Amazon Comprehend is a robust solution for NER when utilization of cloud services is possible. Comprehend requires minimal configuration (it was the easiest solution to set-up/use) and can be used via GUI or a simple API call. Additionally, the models are maintained and continually trained by AWS.
If cloud services are not available in particular customer environments, AllenNLP EMLo is a good open source option for NER. Though Stanford NER performed better than EMLo in some instances, ELMo has fewer licensing restrictions.
CONCLUSIONS
NER is a dynamic field with a range of high-performing solutions available for system developers. Our results show that Amazon Comprehend, while a relative newcomer to the space, delivers robust capabilities. The use of NER should reduce reliance on manual, analyst-driven dataset tagging, freeing up resources for higher-value services.
Related blog:
Named Entity Recognition and Graph Visualization
1 https://en.wikipedia.org/wiki/Named-entity_recognition
2 https://networkofthrones.wordpress.com/
3 https://www.nltk.org/book/ch07.html
4 https://nlp.stanford.edu/software/CRF-NER.html
5 http://neuroner.com
6 https://spacy.io/models/en#en_core_web_sm
7 https://allennlp.org/elmo
8 http://www.statmt.org/lm-benchmark/
9 IOB or Inside-outside-beginning (tagging) is a method used to categorize words in a chunking task. https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
10 http://www.aclweb.org/anthology/W16-2703
11 https://en.wikipedia.org/wiki/Precision_and_recall
12 General Tagging is used to refer to a binary measure where every word in a corpus is either tagged (as PER, ORG, or LOC) as or as Other. This metric is used to gauge a solution’s overall ability to detect entities in a text (not necessarily the entity classification). This is different from the “Overall” category seen in the “Entity-Specific Tagging for WikiGold Corpus” and “Entity-Specific Tagging for NMA Corpus.” In those instance, the “Overall” category is just a average of PER, ORG, and LOC F-measures.