PHASE I RECAP
We are exploring the problem space of Named Entity Recognition (NER): processing unannotated text and extracting people, locations, and organizations. A previous post described our comparative performance evaluation of several open source and commercial NER libraries. Based upon this evaluation we determined that Amazon Comprehend (an AWS solution) performed robustly against test corpora, while the top open source solutions for the NER task were AllenNLP ELMo, Stanford NER Tagger, and NeuroNER.
This blog post picks up where the previous post left off, building on evaluation results to explore a practical application of NER.
To facilitate this exploration, we built an Entity Extraction Visualization Tool capable of (1) processing raw text or a news article URL and (2) outputting a network graph visualization of the data.
Based on the results from the evaluation phase of this project, we decided to work with Amazon Comprehend as the NER engine – its models are continually maintained and trained by Amazon and it requires minimal configuration.
As the Phase I analysis was based on python, we continued using that language for continuity. We created a web app with a frontend server using Flask (a micro web framework written in python). In addition, we used Cytoscape to render the graph displays on the web page.
ENTITY GRAPH ELEMENTS
Raw text can be visualized using network graphs. In these graphs, each node represents an entity detected in the text and is colored according to its categorization:
- Red = Person
- Blue = Location
- Green = Organization
The size of each node is proportional to its number of degrees, or connections to other nodes. The weight of an edge is dictated by the frequency with which two entities are seen within a set distance of each other in the text. Based upon prior research and our own analyses, this distance should usually be between 10 to 15 words for best results.
EXPLORING REAL-WORLD USE CASES
Uses of NER-based graphs can be illustrated through a real-world example. The Washington Post published an article on 25 July 2018 about President Trump’s announcement on trade and tariffs tensions with the E.U.1 It mentions a meeting between Trump and European Commission President Jean-Claude Juncker, indicating that the two had agreed to hold off on proposed car tariffs, work to resolve their dispute on steel and aluminum tariffs, and pursue a bilateral trade deal.
After running this article through the Entity Extraction Visualization Tool, the graph below is created.
Understand Key Entities
At first glance an overwhelming amount of information is conveyed in this visualization. But that also means there’s a lot of good information to extract and interpret.
By seeking out the largest nodes and thickest edges in Figure 1.1, the most important players and connections in an article or body of text pop out. It’s apparent here that “Trump” is the largest node, indicating that he is the best-connected entity. Other nodes that stand out include the White House, E.U., and the U.S.
Interestingly, while many people were mentioned (shown as red nodes), the sentences they appeared in often lacked other entities. As a result, red nodes are less connected. In contrast, countries are mentioned frequently in lists. Therefore they appear as interconnected, larger nodes (since they appear within the 10-15 word distance).
Aside from node size, information can be gleaned from edge weights and the nodes they are connecting.
It’s evident (and intuitive) that Trump and the White House have the strongest connection, as shown through the thick edge connecting the two nodes. Another notable connection is between Trump and China. Though the article is about US and EU trade relations, the graph shows that China also plays a significant role in steel and aluminum trade agreements. Additionally, links between Trump and Stephen Moore (his top economic adviser during the 2016 campaign) and Juncker and the U.S. are prominent.
Another dimension reflected in this type of visualization is the nature of communities that entities form. In network analysis, a community is a subset of the network that forms a self-contained and coherent sub-network.2
Looking at the community of organizations in Figure 1.3 (and based on our contextual understanding), we can derive first-order insights. Nissan, Toyota, Subaru, and Honda are Japanese automobile manufacturers with manufacturing locations in the United States. The same applies to the relationship between Mercedes, BMW, and European. These communities and their “locations” add a layer of complexity to the tariff discussion – a phenomenon that we can now grasp without having read the article.
Based on the structure and syntax of this type of article, fewer insights can be drawn from communities of individuals.
We can see that the AshLee Strong (a spokeswoman for Paul Ryan) likely had a quote speaking on Paul Ryan’s behalf. We can also see that Paul Ryan is the Speaker of the House from Wisconsin.
Assess Large Texts
Larger texts can also be analyzed using the Entity Extraction Visualization Tool. Take for example a New Yorker piece about the Saudi Arabian Crown Prince Mohammed bin Salman (widely known as M.B.S.) and his response to a tweet published by Canada’s Foreign Minister, Chrystia Freeland.3 The article explores human right policies in Saudi and other interactions M.B.S. has had with individuals lead those nations and is roughly 15,000 words long.
Here, the primary entities are clear – Saudi Arabia, M.B.S., Yemen, and Canada. However, others in the text also make an appearance. The U.N., President Trump, Raif Badawi, etc. Thus, we can determine the primary objectives and emphasis the piece is making without ever reading it.
To improve the fidelity and relevance of graph outputs, future iterations of [THIS TOOL] will benefit from the following enhancements.
Entity Resolution is the task of disambiguating manifestations of real world entities across records by linking and grouping.4 For example, if separate nodes are generated for Barack Obama, President Obama, and Obama, those should all resolve to one node, because they all represent the same person. For the graph considered in this post, entity resolution can improve the resolution of country names and individuals.
As seen above in Figure 2.1, the U.S. and United States are considered different entities.
This issue is slightly complicated by the fact that Amazon Comprehend will tags a country as location or organization depending on the context. In the sentence “Trump went to the U.S.”, the U.S. would be considered a location. In the sentence “The U.S. fought alongside allies in the war,” the U.S. is considered an organization. In this case some marginal analytical insight may be necessary.
Currently the only centrality measure that is taken into consideration is degree centrality. This is depicted as node size, proportional to the number of degrees a particular node has. Other measures commonly assessed include Weighted Degree, Eigenvector, PageRank, and Betweenness. In addition to determining how well connected a graph is, these measures can represent the power of a network and the degree to which connected nodes are taking advantage of each other.
It’s clear that network graph visualizations can serve many purposes when trying to understand raw text. It can automate parts of workflows that require understanding the key people, locations, and organizations. It can also provide insights into relationships between entities and overarching topics that may not be intuitive based on a cursory read of an article.
Evaluating Solutions for Named Entity Recognition