Natural Language Processing (NLP) has advanced significantly since 2018, when ULMFiT and Google’s release of the BERT language model approached human-level performance on a range of use cases. Since then, several models, such as XLM, GPT-2, XLNet, and ALBERT, have been released in quick succession, each improving on its predecessors. While these state-of-the-art models can solve human-level, language-based tasks on large volumes of unstructured text for certain use cases, getting a handle on what to use, when to use it, and how to use it can be a challenge.
At Novetta, we explored what it would take to streamline the implementation of state-of-the-art models for different NLP tasks, allowing for quick use by practitioners.
We developed an open-source framework, AdaptNLP, that lowers the barrier to entry for practitioners to use these advanced capabilities. AdaptNLP is built atop two open-source libraries: Transformers (from Hugging Face) and Flair (from Zalando Research). AdaptNLP enables users to fine-tune language models for text classification, question answering, entity extraction, and part-of-speech tagging.
To show how AdaptNLP can be put to use, we will address a Question Answering (QA) task using BERT. This task automates the answering of questions, posed by humans, against a corpus of text.
Using AdaptNLP starts with a Python pip install.
pip install adaptnlp
First, we import EasyQuestionAnswering which abstracts transformer-based QA tasks to their most basic components.
from adaptnlp import EasyQuestionAnswering
We can now frame our question as a simple string. The context variable holds the source text that we want to search through for an answer. Because a question may have multiple valid answers, we specify how many results to return using top_n.
## Example Query and Context query = "What does Novetta do?" context = "Novetta pioneers disruptive technologies in data analytics, full-spectrum cyber, media analytics, and multi-INT fusion. Novetta brings actionable insights to your most complex data challenges. We enable customers to find clarity from the noisy complexity of big data at the speed and scale of the most intensive national security missions." top_n = 5
We now use predict_qa(), which defaults to a pre-trained BERT-based QA model, to determine what part of the corpus may be our answer. We then pass the question, the context data, and the number of answers we would like to see. The results contain the text that the model believes to be the answer, a probability score, and the locations of this answer as it relates to the original corpus.
## Load the QA module and run inference on results qa = EasyQuestionAnswering() best_answer, best_n_answers = qa.predict_qa(query=query, context=context, n_best_size=top_n)
We can now take a look at best_answer to see the most relevant result or best_n_answers to see the number of answers that we previously specified.
## Output top answer as well as top 5 answers print(“Best Answer:\n”, best_answer) print(“Best n Answers:\n”, best_n_answers) Best Answer: 'brings actionable insights to your most complex data challenges' Best n Answers: [OrderedDict([('text', 'brings actionable insights to your most complex data challenges'), ('probability', 0.5482685518182449), ('start_index', 15), ('end_index', 23)]), OrderedDict([('text', 'pioneers disruptive technologies'), ('probability', 0.11097169321630729), ('start_index', 1), ('end_index', 3)]), OrderedDict([('text', 'brings actionable insights to your most complex data challenges. We enable customers to find clarity from the noisy complexity of big data'), ('probability', 0.07600267159691482), ('start_index', 15), ('end_index', 36)]), OrderedDict([('text', ''), ('probability', 2.8758867595942603e-08), ('start_index', 0), ('end_index', 0)])]
Note: We have limited the example output to three results for brevity and to demonstrate variety.
These outputs can easily be integrated into user-built systems by providing text-based metadata such as extracted answer text, start/end indices, and confidence scores. By standardizing the input and output data and function calls, developers can easily use NLP algorithms regardless of which model is used in the back end. Before AdaptNLP, we would individually integrate the latest released model and pre-trained weights, and then reiterate through a build for an NLP task pipeline. This time-consuming and repetitive process was also due to the rapid advancements and releases of NLP models. To overcome this, AdaptNLP provides a streamlined process that can leverage new models in existing workflows without having to overhaul code.
Using the latest transformer embeddings, AdaptNLP makes it easy to fine-tune and train state-of-the-art token classification (NER, POS, Chunk, Frame Tagging), sentiment classification, and question-answering models. We delivered a hands-on workshop on using AdaptNLP with state-of-the-art models at the ODSC East Virtual Conference 2020.