Maintaining software security in a project as the volume of new code increases is a pressing problem in big data. Once a vulnerability is identified in a piece of software, it may be essential to locate other similar weaknesses, but conducting this type of search is time consuming and challenging. Currently, tools used to aid in searches like this are only capable of finding exact matches, but similar vulnerabilities may be expressed using different semantics.
To address this problem, we developed an approach to efficiently find code with similar functionality in large repositories of executables: Syntax-Agnostic Code Similarity (SACS).
We started with locality-sensitive hashing (LSH) to find near-duplicate assembly code segments in disassembled executables. LSH is an approximate nearest neighbors algorithm that is able to successfully find near-duplicate documents at scale. It is also proven in applications such as audio/video/image searching, entity resolution, and fingerprint comparison. LSH searches are fast, as the technique compresses code segments into hashes, eliminating the need for pairwise comparisons by clustering similarly hashed code segments together.
A limitation to LSH is that it only finds textually similar documents. In the case of disassembled executables, assembly code with the same functionality can look syntactically different due to compilation optimizations or the underlying CPU architecture. To account for these differences, it is necessary to add a preprocessing step to obtain the semantic meaning of code segments before they get hashed.
To hash on the semantic meaning of the code segments, we need to abstract away any compilation optimizations, hardware dependencies, or architectural differences. We translate the code segment into its intermediate representation (IR) to accomplish this. IR, originally intended to manage and modularize the compilation process, is a machine-independent representation of code that retains the code’s original functionality.
2. Split by Function or Block
Split assembly code at either the function or block level (this decision needs to be made before the ingestion process begins).
3. Extract Semantic Meaning
Convert functions or blocks to IR.
4. Calibrate and Apply LSH
Parameters for LSH can be tuned to minimize false positive rates, adjust sensitivity levels, and define the threshold of similarity. Once those parameters have been set, LSH can be applied to IR to produce hashes.
5. Getting Results
SACS results can be explored in two ways, starting with code segments identified by the address where the first line of code begins. In the Cluster Hashes approach (5a), SACS clusters the hashes based on a predetermined similarity threshold. This gives the analyst the ability to explore the clusters without requiring an example code segment to query with. In the Semantic Similarity Search approach (5b), the analyst can search using a specific code segment of interest to find similar results based on semantic similarity.
As a code similarity detection and searching capability, SACS reduces the time, effort, and cost of debugging and maintaining software and allows Novetta to be one step ahead of attackers. SACS can be used as both an investigative and preventative tool. From an investigative perspective, SACS leads to much faster identification of code in cases of both technical and logical vulnerabilities. From a preventative perspective, SACS encourages the reuse of “repaired” code with the ability to search for code segments by functionality rather than syntax. Our cyber team anticipates a reduction in investigation time from three to four weeks to under a week with the incorporation of SACS into standard investigative workflow.