Welcome to the February 2020 installment of BASELINE, Novetta’s Machine Learning Newsletter, where we discuss important advances in machine learning technologies and their potential impact on our IC / Defense customers. This month’s topics include:
- New breakthroughs in language models, including conversational AI
- Concise APIs that simplify model implementation
- Promising new approaches in computer vision and graph neural networks
Meet Google’s Meena
Google’s Meena has raised the bar for open-domain chatbots, achieving near-human-like interactions. Meena is a conversational model designed to chat sensibly about nearly any topic while taking into account the context of the conversation. In addition to Meena, Google has developed a new Sensibleness and Specificity Average (SSA) benchmark to grade the subtleties of open-domain chat conversations. SSA grades chat-based conversations, measuring the degree to which responses make grammatical sense (sensibleness) and directly relate to the conversation (specificity). “Sensible chatbots” like Meena can be used to improve automated customer service, but could lead to a wave of fake online personalities deployed for malicious purposes as this type of technology progresses.
Thinking Quicker with Thinc
Explosion, the creator of spaCy, Prodigy, and FastAPI, has introduced a lightweight deep learning library called Thinc. Thinc allows you to create multiple models and configurations using wrappers around deep learning libraries such as Tensorflow, PyTorch, and MXNet. Model parameters can be swapped as needed by loading a new configuration file into the architecture. Thinc is also compatible with many deep learning compilers, such as JAX. Porting models to different infrastructures can be a painstaking task for ML engineers, and Thinc will hopefully lessen the burden of this process.
How Microsoft Created the World’s Largest Language Model (For Now)
Training large language models has been an effective approach to achieving state-of-the-art results in NLP applications. Microsoft unveiled a 17-billion parameter model, nearly twice the size of the next largest language model, called Turing Natural Language Generation or T-NLG. Microsoft claims T-NLG outperforms current state-of-the-art approaches for many common NLP tasks. What makes T-NLG so interesting is the way it was created. Due to its size, T-NLG training had to be parallelized across multiple GPUs using a new Microsoft library called DeepSpeed. Large models are expensive and time-consuming to train; DeepSpeed uses a special ZeRO optimizer that eliminates memory redundancies by sharing necessary information across distributed devices during training. Microsoft claims that DeepSpeed can train 100-billion parameter models, which will likely lead to further advances in state-of-the-art performance.
A Noisy Student is a Good Student
Most state-of-the-art computer vision models are trained in a purely supervised setting, where all data points have a label from which the model can learn to identify patterns. Data labeling is typically time-consuming and expensive, and any solution that lowers the bar for use of labeled data is of great practical value. Semi-supervised learning, which uses unlabeled data to improve the accuracy of a model trained on a relatively small set of labeled data, is one such method. Noisy Student is a variation of the semi-supervised approach that introduces noise to unlabeled data via dropout, stochastic depth, and augmentation. This approach showed a significant improvement of 2% in accuracy on the ImageNet benchmark. We are interested in exploring how this approach can better our computer vision models and potentially be extended to text-based models.
Using the Attention Mechanism for Graph Representations
Graph neural networks (GNNs) apply deep learning methods to graph data. GRAPH-BERT is a completely new architecture that outperforms the state-of-the-art in efficiency and accuracy. Instead of using edge links, GRAPH-BERT trains on samples of link-free subgraphs of the entire graph. This allows for easier parallelization, which in turn leads to increased training efficiency and the ability to apply deep learning to larger graphs.