Code property graph
In computer science, a code property graph (CPG) is a computer program representation that captures syntactic structure, control flow, and data dependencies in a property graph. The concept was originally introduced to identify security vulnerabilities in C and C++ system code, but has since been employed to analyze web applications, cloud deployments, and smart contracts. Beyond vulnerability discovery, code property graphs find applications in code clone detection, attack-surface detection, exploit generation, measuring code testability, and backporting of security patches.
Definition
A code property graph of a program is a graph representation of the program obtained by merging its abstract syntax trees (AST), control-flow graphs (CFG) and program dependence graphs (PDG) at statement and predicate nodes. The resulting graph is a property graph, which is the underlying graph model of graph databases such as Neo4j, JanusGraph and OrientDB where data is stored in the nodes and edges as key-value pairs. In effect, code property graphs can be stored in graph databases and queried using graph query languages.
Example
Consider the function of a C program:
void foo() {
int x = source();
if (x < MAX) {
int y = 2 * x;
sink(y);
}
}
The code property graph of the function is obtained by merging its abstract syntax tree, control-flow graph, and program dependence graph at statements and predicates as seen in the following figure:
Implementations
Joern CPG. The original code property graph was implemented for C/C++ in 2013 at University of Göttingen as part of the open-source code analysis tool Joern. This original version has been discontinued and superseded by the open-source Joern Project, which provides a formal code property graph specification applicable to multiple programming languages. The project provides code property graph generators for C/C++, Java, Java bytecode, Kotlin, Python, JavaScript, TypeScript, LLVM bitcode, and x86 binaries (via the Ghidra disassembler).
Plume CPG. Developed at Stellenbosch University in 2020 and sponsored by Amazon Science, the open-source Plume project provides a code property graph for Java bytecode compatible with the code property graph specification provided by the Joern project. The two projects merged in 2021.
Fraunhofer AISEC CPG. The Fraunhofer Institute for Applied and Integrated Security
provides open-source code property graph generators for C/C++, Java, Golang, Python, TypeScript and LLVM-IR. It also includes a formal specification of the graph and its various node types. Furthermore, it provides the Cloud Property Graph, an extension of the code property graph concept that models details of cloud deployments.Galois’ CPG for LLVM. Galois Inc. provides a code property graph based on the LLVM compiler. The graph represents code at different stages of the compilation and a mapping between these representations. It follows a custom schema that is defined in its documentation.
Machine learning on code property graphs
Code property graphs provide the basis for several machine-learning-based approaches to vulnerability discovery. In particular, graph neural networks (GNN) have been employed to derive vulnerability detectors.