by Matt Johannessen
Novetta’s business covers a diverse mix of public and private sector customers. While mission objectives vary, we are seeing an increasing need to extract meaning from high-velocity open source (OSINT) information. Historically, deriving intelligence from open source digital content has been complex and time consuming, particularly when high-accuracy results are required. Generating meaningful signals from data sources such as web content and open source video has required substantial labor on the part of analysts.
We are working to significantly reduce the effort required to gather and exploit OSINT through a combination of machine learning, graph database technologies, and Amazon Web Services (AWS). Leveraging the latest API-driven ML services from AWS, we have developed a proof of concept that extracts metadata from open source video and news articles. Using the recently announced AWS Media Analysis Solution as a guide, we have created an OSINT pipeline for open source video and news article content. An overview of our OSINT pipeline is as follows:
- Collect video and store in AWS S3
- Use AWS Lambda to monitor S3 for new videos
- Process new videos with AWS Elemental (Media Convert)
- Use AWS Transcribe to extract text from video audio track and save to AWS S3
- Use AWS Comprehend to extract entities (e.g. location, organization, etc) and entity types from transcribed text and news article text
- Load high quality (high confidence) results into AWS Neptune
- Leverage AWS API Gateway and AWS Lambda to provide secure microservices that return information stored in AWS Neptune
Going beyond basic consolidated search, our goal is to look at connections between extracted entities, source content, and other relevant metadata. These complex relationships naturally lend themselves to graph analytics, specifically scalable/flexible property graphs.
AWS Neptune, a scalable and fully managed graph database, offers reliable, distributed, and fault tolerant storage for highly interconnected data. Based on the Apache Tinkerpop™ framework, AWS Neptune is the logical storage solution for our analyst-friendly OSINT graph.
We configured a collection of AWS Lambda functions to interact with AWS Neptune (via Gremlin REST API), which resides in a secure virtual private cloud (VPC). To complete our proof of concept, we used AWS API Gateway to establish secure web services for a prototype web application to connect and create a graph visualization. The end-to-end pipeline is shown in the diagram below.
In the diagram below, results of a basic pattern match query are displayed in an interactive, browser-based graph visualization. Metadata from the original news article is used to construct relationships between nationality, publisher, and the source of the article. The content of the article and text from each video transcription is processed through Amazon Comprehend to identify entities mentioned in the text of the article.
Detected entities and the associated types are de-duplicated and added to the property graph in AWS Neptune. As a next step, to enhance our results, Novetta Entity Analytics can be used to improve entity resolution, whereby organizational acronyms and proper names can be resolved into a single entity.
This rapid prototyping effort illustrates the ease with which the AWS platform can be adopted for challenging, data-intensive analytical tasks. Accelerated by AWS API-driven ML services and backed by AWS Neptune, our proof of concept derives insight from noisy open source information with a modern, scalable, and serverless cloud architecture.