Welcome to the February 2021 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies likely to impact our customers. This month we cover the following topics:
- Using natural language processing to predict viral mutations
- Evaluating natural language generation
- A trillion-parameter AI language model from Google
- Estimating subtle human emotion from naturalistic images
Learning the Language of Viral Evolution and Escape
Identifying new viral mutations and accurately predicting whether a mutation is capable of evading antibodies is important since such events can hinder vaccine development and efficacy. Researchers at MIT use similarities between natural language processing and the human immune system’s response to viruses to train a bidirectional LSTM to read viral sequences as if they were text. In this context, mutations are treated like word changes in a sentence which are grammatically correct but may change the semantics of the sentence. They trained their language model on the gene sequences of influenza, HIV, and SARS-CoV-2. During training, the model learned to rank mutations by the probability that the mutated virus kept the mutation and the strength of the semantic changes resulting from the mutation. The model was successful in predicting viral variants that were capable of evading antibodies. This crossover between natural language processing and biomedical sciences lays the foundation for future modeling of sequence dynamics and demonstrates the promise of interdisciplinary approaches.
Evaluating Natural Language Generation
With the recent success of natural language generation (NLG) models like GPT3, there is a growing demand for ways to evaluate the quality of synthetically-generated text. Successful NLG systems aim to be multilingual, capable of producing diverse outputs, and robust to shifts in input data distribution. This makes evaluation difficult and reveals the need for a living benchmark that evolves over time. A community of researchers from around the world developed GEM to fill this need. GEM is a multi-task benchmark containing 11 test datasets and tasks (at the time of writing) that “measure specific generation challenges, such as content selection and planning, surface realization, paraphrasing, simplification, and others.” The benefit of having a living benchmark is that it is continuously evolving to include a wider, more diverse set of corpora and more challenging evaluation tasks. This allows the benchmark to grow alongside NLG advancement and is an important step in evaluating, understanding, developing, and increasing the application of NLG capabilities.
Switch Transformer and the Trillion-Parameter Language Model
The performance of natural language processing (NLP) models during training has been proven to scale with the number of parameters. Traditionally, the downside to increasing model parameters is that models become more computationally demanding. In order to address the computational limitations of large models, researchers at Google Brain have opened sourced the Switch Transformer, a language model that scales up to 1.6 trillion parameters, while also improving training time by up 7x compared to the T5 language model. Even with the improved training time the model still achieves accuracy comparable to the T5 model. The key to their breakthrough is the implementation of the mixture-of-experts (MoE) paradigm. Instead of every input to the model using the same set of parameters, the Switch Transformer model selects a subset of transformer blocks for each input, creating a sparsely-activated model. Their model mitigates some of the usual drawbacks to using MoE such as training instability and complexity, making this architecture a viable alternative to dense parameter models. They have open sourced the architecture, but have not yet released the pre-trained weights used in the trillion-parameter model, meaning that researchers who want to use the model will need to train from scratch on their own dataset. The switch transformer architecture was also shown to be a useful alternative for small scale models that don’t require a super computer, which means that adoption of this regime can help speed up computation times for machine learning practitioners training their own language models at every scale.
Estimation of Continuous Valence and Arousal Levels from Faces in Naturalistic Conditions
Being able to recognize the subtleties of human emotional expression is an important part of human-computer interaction research. Until now, leading computer vision methods have only been able to predict discrete prototypical emotions such as happiness, anger, and sadness. This lags behind current psychological models of human emotion which are based on valence (positivity scale) and arousal (excitement scale). Researchers at Samsung AI and Imperial College London have developed a convolutional neural network that can estimate valence and arousal from static, naturalistic images with high accuracy and in real-time – outperforming the state-of-the-art. They accomplish this by combining previously disjoint steps into a single network that can estimate facial landmarks, discrete emotions, and continuous emotions in a single pass. They also include an attention mechanism, which helps the model focus on relevant facial regions, and a student-teacher training framework which helps to smooth labels. This breakthrough in facial affect recognition will not only allow human-interacting software to better respond to user’s needs, but the lightweight nature and real-time response of this model will aid in the widespread adaptation of this technology. Each of the three datasets tested during the study are available from the original authors (AFEW-VA, AffectNet, SEWA). The pretrained network, testing code, and annotated datasets are available here.
Below are a few examples of facial affect recognition in action, taken from the author’s video demonstration: