Welcome to the January 2020 installment of BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies likely to impact our customers. This month’s topics include:
- New software tools that improve the process of developing machine learning models
- Improvements in language models, with some research showing that synthetically generated text can fool human reviewers
The “What-If Tool” and Manifold
While the primary goal for most machine learning models is high accuracy, we also need to understand why the model is making certain predictions. We’ve recently been working with the What-If Tool from the Google Research Brain Team to conduct sensitivity analysis, exploring how models react to different combinations of attributes. This enables us to better understand model performance and where models might fail. We are also exploring Manifold from Uber, another state-of-the-art model explainability tool.
nbdev – A New Python Programming Environment Inside Jupyter
Jupyter Notebooks are a popular tool for data scientists and machine learning researchers to test and compare models. However, transitioning from Jupyter notebooks to production environments is often challenging. The team responsible for fastai, a powerful deep learning library, has released nbdev, a custom Python programming environment. nbdev simplifies extraction of the specific python code to be migrated to production, while also simplifying documentation creation and testing.
Deepfake Text Generation
Recent research suggests that text comments generated by modern NLP algorithms are effectively indistinguishable from human-generated comments. A bot that generates Deepfake Text submitted half of the total comments received on a federal website that collects public input to help shape policy decisions. When humans were asked to identify which comments had been generated by a bot, they did no better than random guessing (you can try for yourself here). This research demonstrates the challenges that online communities and forums will have in combating deepfake text generation, where posts are often short and thus harder to determine if they were synthetically generated.
Hugging Face Releases Ultra-Fast and Versatile Tokenization Library
In NLP, tokenization is the task of splitting text into meaningful segments (e.g., words, punctuation, numbers) known as tokens. As models grow more efficient, tokenization is the bottleneck when pre-processing large volumes of text. Hugging Face, the NLP research company behind the state-of-the-art Transformers library, has released a new open-source library, Tokenizers. Tokenizers is designed to achieve speed and ease of use when tokenizing text for deep learning models both in research and production. Hugging Face reports that tokenizing one GB of text takes less than 20 seconds on a server-grade CPU, which translates to a 9x speedup of tokenization.
Meet ALBERT, the Newest NLP Language Model
Another month, another impressive language model: Google has released ALBERT, the next version of BERT (which we wrote about in November). ALBERT shows improved performance with a smaller model footprint. This translates to more portable models with a better understanding of natural language, leading to better performance on downstream tasks. Google accomplished this by reusing parameters within the model performed similar operations, enabling them to drastically reduce model size while maintaining similar accuracy. ALBERT has been open sourced and yes, it is already included in the Hugging Face Transformers library.
Lyft Open-Sources Flyte, a Data Processing Platform
In December’s edition of BASELINE we covered Netflix’s Metaflow, an open source library for building and managing data science projects. This month Lyft followed suit by open sourcing Flyte, a machine learning and data processing platform. Metaflow and Flyte are both Python based and represent workflows as directed-acyclic-graphs (DAGs) – a series of algorithmic steps. Flyte’s native Kubernetes integration allows it to be deployed on AWS, Azure, and Google Cloud.