Novetta Machine Learning Center of Excellence effort to develop a deep learning model that predicts customer-relevant metadata tags for unstructured text.
BACKGROUND: ENRICHMENT OF UNSTRUCTURED TEXT DATA
The rise of the internet has led to a faster flow of information, where news posted to a relatively obscure blog can be shared on social media and then reach national publications within hours. The volume is such that humans alone cannot filter out noise, identify important new information, and determine how messaging trends are changing over time.
To address this challenge, Novetta Mission Analytics (NMA) helps customers analyze open source and social media information by enriching unstructured text with metadata such as topic and sentiment. We have historically relied on trained human scorers to enrich (i.e. “tag”) this data to ensure quality and accuracy. The NMA enrichment process includes identifying primary messages, more granular submessages, and sentiment for quotes, which are attributable strings of text pulled from articles.
NMA analysts work with customers to define a taxonomy for primary messages and submessages for a given dataset, as described in the table below.
|Tag||Description||# of Possible Labels||Examples|
|Primary||Broad category, specific to customer’s mission||6 – 20||International Security, Russia Domestic Security, Regional Security|
|Sub||Fine-grained category||30 – 100+||Military Exercises, NATO Military Deployment, Russia Military Deployment|
To illustrate how enrichment works, let’s look at the following quote:
“North Korea has moved at least one missile with ‘considerable range’ to its east coast- possibly the untested Musudan missile, believed to have a range of 1,800 miles.”
This quote was human-tagged with a primary message of Proliferation and a submessage of State Level Nuclear Programs.
As you might imagine, this enrichment process – while invaluable to NMA customers – is laborious and time-consuming, with an ever-growing corpus of nearly 10 million quotes. The ability to complement human experts with machine-classified NMA content would be incredibly powerful, allowing experts to focus on nuanced analysis and insight generation.
ULMFiT AND FAST.AI
When the improvements in text classification outlined in Universal Language Model Fine Tuning (ULMFiT)1 for Text Classification were announced, we began exploring the possibility of training a model using fastai to automate the NMA enrichment process described above.
fastai is a state-of-the-art deep learning framework which allows users to quickly build models for a range of tasks, from object detection to text classification. This blog post will focus on the text classification capabilities of fastai, specifically the fastai.text module, which allows users to implement ULMFiT on their own text.
While only released in May 2018, ULMFiT has started to gain traction in the machine learning community. We agree with Jeremy Howard’s position that this approach, also dubbed Transfer Learning for NLP, will be a pivotal moment in the application of machine learning to language-based tasks – much as transfer learning has accelerated development in the image domain. As shown in the ULMFiT paper, fast.ai’s pre-trained language model approach greatly surpasses previous approaches based on word vectors.
Inspired by this, in July 2018 Novetta’s Machine Learning Center of Excellence launched a new project, NLP Transfer Learning on SageMaker. The objective of this effort was to develop a model that predicts primary and submessage tags for a quote.
Classification performance of our model was evaluated on a representative sample NMA dataset, referred to as Europe.
The Europe dataset consists of 266,000 quotes from 55,000 articles. The dataset has 8 primary message labels and 121 submessage labels (not evenly distributed).
This real-world data presents a much more challenging text classification task than typically found in the academic literature. For example, in the ULMFiT paper, the IMDB classification task has 2 possible labels and the Yelp dataset has 5. We are attempting to develop models with over an order of magnitude more possible labels from which to choose.
TOOLS USED IN MODEL DEVELOPMENT
As an AWS Machine Learning Competency partner, Novetta has found a wide range of applications for Amazon SageMaker. SageMaker allows users to quickly get started with a self-contained, cloud-based environment. It provides the ability to switch environments to suit cost and computation needs, and the familiar Jupyter notebook serves as its UI. Coincidentally, just before the ULMFiT paper was released, AWS authored a blog post detailing how to run fastai on SageMaker.
fastai allows users to achieve state-of-the-art results on English-language text in a relatively short amount of time. The baseline language model, provided by Jeremy Howard through the fast.ai site, is a great starting point. Additionally, there is sufficient documentation and an active community for support. The ULMFiT model implemented in fastai has outperformed previous state-of-the-art2 methodologies on text classification by 18-24%.
Because fastai is well-designed and efficient, we only needed to use four Python libraries to aid with building the model:
- pandas for data reading, writing, and management
- SpaCy for word vectorization
- boto3 for Amazon Web Services interfacing, and
- fastai.text for the bulk of the model.
The NLP Transfer Learning on SageMaker technical backbone is adapted from the ULMFiT paper, whose universal language model has already been trained on the Wikipedia corpus. We fine-tuned this model to capture the syntactic nuances within the NMA datasets. The fine-tuned language model is then used to train the Metadata tag classifier, as shown in the figure below.
The data for training and validation was split based on time to reflect how our model will be deployed in production. Models will be trained on previously tagged quotes and then used to predict tags on future quotes. We also thought that quotes from an article may be related, so quotes from a single article were kept entirely within one training or validation set.
The data for the language model was split with the first 80% of the data for training and 17% for validation. This language model training and validation data from the language model were then used as the training data for the classifier, with the remaining 3% of the data, representing the last 2 weeks of articles, was used as validation for the classifier.
Metadata Tagging performance is measured in terms of classification accuracy within the top k specified primary messages or submessages. We chose Top 1 and Top 3 (cumulative) accuracies, though one could choose any Top N that makes sense for their application. Top 3 accuracy means that the correct label appeared within the 3 strongest predictions from the model. This reflects the use case of NMA, in which human coders would expect to be presented with the 3 highest-scoring options to increase tagging efficiency.
Results are calculated for the validation set of quotes from the last 2 weeks of articles. Considering the difficulty of the test data, initial NLP Transfer Learning on SageMaker results are quite promising. The Top 1 accuracy for primary message was 77% and the Top 3 accuracy for submessage was 75%.
|# of Labels||Accuracy||# of Labels||Accuracy|
|Europe||8||77% (1), 95% (3)||121||51% (1), 75% (3)|
This performance approximates human performance on the data, but at improved speed and scale with much lower cost. As expected, accuracy is much higher when the model has fewer labels to choose from.
While we did not set out to optimize computational speed, our use of the SageMaker environment allows us to easily swap out the back end instance type to increase processing power and memory. The available instances range from the standard t2 to the powerful p3 GPU instances. This flexibility allows users to set up and test on smaller instances and then run a production model to build on a larger instance.
NEXT STEPS: MIGRATING TO PRODUCTION, DEVELOPING FOREIGN LANGUAGE MODELS
Next up for the team will be deploying our custom models in a live, production environment using SageMaker to make inferences as the new text comes in. We will explore both the real-time and Batch Transform options to see what meets the needs of our customers. Like the fast.ai team, we are excited by the promise of ULMFiT and look forward to trying it out in new subject areas.
We also plan to explore the development of non-English language models. ULMFiT (on which our model was based) was initially released only with a pretrained language model for English-language text. While pretrained models are not widely available for other languages, the ULMFiT approach can be applied to any language and topic – this is already happening in the fast.ai community. Given that NMA tags data from a range of languages, including Russian, Arabic, and Spanish, our NMA analysts are interested in this capability.
Using fastai on Amazon SageMaker, we achieved state-of-the-art results on a real-world unstructured text classification task. In addition to the potential benefits that NLP Transfer Learning on SageMaker presents to our analysts and customers, we are optimistic about the applicability of similar approaches for companies, government agencies, and researchers interested in adding structure to large quantities of unstructured text.
Read more about Novetta Machine Learning on Amazon SageMaker.
1 Howard, Jeremy, and Sebastian Ruder. Universal Language Model Fine-Tuning for Text Classification. 2018, Universal Language Model Fine-Tuning for Text Classification, http://arxiv.org/pdf/1801.06146.pdf
2 Merity, Stephen, et al. Regularizing and Optimizing LSTM Language Models. 2017, Regularizing and Optimizing LSTM Language Models, http://arxiv.org/pdf/1708.02182.pdf