The role of qualitative research in predictive analytics
Novetta’s portfolio of work stretches across the public and private sectors. One common thread across many of our projects is our clients’ need to quantify risk, and make decisions in the presence of uncertainty. Whether it be identifying travelers at risk of human smuggling or protecting critical technologies; our clients are faced with a myriad of challenges. They must make decisions about these threats, and the cost of a wrong decision often goes well beyond lost resources. They’re also surrounded by data. It comes in all forms and sizes, and is usually very (very, very) big.
In this blog series, I’ll lay out the pertinent questions, and suggest some techniques for answering them. I’ll refer to our method of conducting a predictive analytics project known as Theory Driven Modeling, and propose some best practices. First, let’s define some of the terms we’ll be using.
Predictive Analytics vs Descriptive Analytics
Descriptive Analytics is, as the name implies, the process of describing the data at hand. Any inferences are left to the individual analyst. There is great value in describing a data set, and many power tools are available to do so. I would argue that in order to be termed advanced, however, an analytical effort must cross into predictive analytics.
Predictive Analytics is an analytical endeavor that attempts to draw inferences and predict future states based on observed data. The Institute for Operations Research and Management Science (INFORMS) Analytics Section describes this as “analytics that predicts future probabilities and trends, and finds relationships in data that may not be readily apparent with descriptive analysis”.
Business Intelligence vs Artificial Intelligence
Business Intelligence (BI) and Artificial Intelligence (AI) are implementations of Descriptive and Predictive Analytics respectively. There are many effective and important BI tools, and, depending on the use-case, they can be very useful. AI as an implementation of Predictive Analytics, and specifically Machine Learning, is a burgeoning capability that employs cutting-edge computer science and statistical learning. Very few AI platforms exist, and they must be tailored to the specific application.
Machine Learning (ML), be it supervised or unsupervised, uses algorithms to create models that attempt to draw the predictive insights characterized above, or discover complex relationships within the data. The algorithms used in ML are fully customizable, and increasingly designed with large data sets in mind. In the supervised case, the resulting model has been built to find signal in observed data that predicts or indicates some variable of interest. Using previously observed data, the algorithm trains a model to perform in this way.
AI implements a machine learned model, and possesses the ability to dynamically learn and adjust its results based on new data. Machine Learning is not necessarily AI, but AI does involve some form of Machine Learning.
That can be some pretty heady stuff, so don’t worry if the landscape isn’t completely clear. If you get that algorithms produce models, and machine learning is the process of moving from data to model, then we’re in good shape.
Data vs Big Data
Big Data is an often ill-used term that, unfortunately, has come to encompass any attempt to process an inconvenient size of data. As you can tell from my word-choice I’m not a fan of industry turning that term into a buzz word. The proper definition more-rightly emphasizes other aspects of the data; not just its size. The 2001 Meta/Gartner white paper on the subject still hits the mark. It focuses on the Variety, Velocity, and Volume of a data set under consideration. Big Data Analytics is when large, disparate datasets are brought together for the purpose of analysis under a common objective. When that objective involves predicting future probabilities and trends, and finding relationships beyond descriptive analysis we can make our Big Data Analytics predictive.
The problems faced by the Intelligence Community are marked by large data sets that require fast, consistent analysis. Nowhere is Big Data Analytics more challenging or more important.
The first step in Theory Driven Modeling is developing a theory of the phenomenon based on qualitative research. Knowledge of the techniques and methods associated with Big Data is important. Data isn’t THE answer, however. It can be part of the answer, but bereft of context in our domain data can be wasteful and dangerous. Sound qualitative research is critical to the process of implementing a Big Data Predictive Analytics solution.
At Novetta, we start each project with a solid understanding of the phenomenon that produces the data in question. That understanding allows us to use the data to build evidence for (or against) valid, relevant hypotheses. The research must go beyond a mere literature review, and often includes site visits and extensive interviews with stakeholders. Forming valid hypotheses can be thought of as asking questions of the data. For example:
- What are the relationships within the data?
- Are there any exogenous or hidden factors at play?
- What are the non-native features that may influence the phenomenon that can be derived from the data?
- Can the data available be augmented?
- Are there other important problems that can be addressed with this data?
This kind of research and hypothesizing will only go so far. Partnering with the professionals who face the problem or phenomenon is a key aspect of our projects.
Subject Matter Expertise
Incorporating subject matter experts early in a project adds depth and perspective to the qualitative research performed by our analysts. Not only does the project benefit from the insight gained, it also ensures that stakeholders buy-in early, making the results and recommendations phase of the project much smoother.
Additionally, I recommend that the project team embrace healthy skepticism from subject matter experts. A healthy skeptic is one that wants a reasonable demonstration of the potential for a predictive analytics solution in their domain. Such a demonstration only serves to strengthen the underpinning of the project.
Transitioning to the Data
In the next post I’ll describe the process that we use to prepare our data, derive features, and test hypotheses. I’ll use a relatively small dataset to characterize the process and illuminate some of the challenges. Thanks for reading and stay tuned.