Guest Post: A Graph-Based Approach to Cyber Threat Intelligence

Note from the editor: We don’t usually have guest blogs, but we were amazed by the work of students, Rocio and Diego, and its application to data from our AIP blocklists. This is a great example of motivated students that are willing to go beyond the simple assignment. Enjoy the read. - Veronica Valeros, Editor

This blog post was authored by Rocío Baggio and Diego Forni on June 5, 2025.

This project started as a university assignment for a graph database course, but it quickly turned into something bigger. What began as a simple academic task soon became a deeper dive into how graph-based models can help make sense of cyber threat intelligence.

In today’s connected world, understanding how attackers operate, what tools they use, how they’re connected, and where they’re coming from is key to staying ahead in cybersecurity. In this blog, we'll walk through a project that uses honeypot data and public threat intelligence to build a graph-based system that helps visualize and analyze cyber threats in a more intuitive way.

Have you ever wondered which countries launch the most cyber attacks and what techniques they use? With a single query, our graph database reveals the top three attacking nations along with their preferred tactics.

This is just one example of the insights our system can provide. Explore the full interactive interface at https://cti-graph-db.vercel.app/ to dive deeper into the data.

Project Objective: Building a Tool to Explore the Connections Between Cyber Threats and Attacker Behavior

The aim of this project was to create a tool for cybersecurity experts to help them understand and analyze the relationships between different cyber threats, and the behaviour of the attackers. 

Why graphs? Compared to traditional relational databases, graph models perform significantly better when it comes to exploring how different pieces of data are linked.

To make this approach accessible, we also developed a user-friendly interface that allows users to query the graph without needing prior knowledge of Cypher, Neo4j’s native query language.

Here is a picture of what our visual interface looks like, feel free to explore it further and try out your own queries by visiting the link: https://cti-graph-db.vercel.app

And this is an example of a query result that answers the question: Which techniques are being used against different industries, and how many attacks have they experienced? 

From Raw IPs to Enriched Intelligence

Data Source: StratosphereIPS Honeypots

Our starting point is data from the StratosphereIPS blocklist generation project. From this, we gathered a list of more than 20 thousand malicious IP addresses from attackers who interacted with these honeypots. However, this raw data lacked contextual depth.

Enriching Data: OTX AlienVault

To derive meaningful insights, we enriched these IPs using the AlienVault Open Threat Exchange (OTX) platform. OTX is a collaborative repository of threat intelligence “pulses”. An OTX pulse consists of one or more indicators of compromise (IOCs) that constitute a threat or define a sequence of actions that could be used to carry out attacks on networks, devices and computers. OTX pulses also provide information on the reliability of threat information, who reported a threat, and other details of threat investigations.

We built a Python script that queries the OTX API for each IP, retrieving data such as origin and targeted country,  attack techniques, attacked technology, industry, etc.

Modeling Threats: The Graph Structure

We experimented with several iterations before finalizing a practical data model.

This data model revolves around three core entities: Pulse, IP Address, and Country.

  • Pulse nodes represent threat reports aggregated from OTX. Each pulse groups together multiple indicators of compromise (IOCs), such as malicious IPs, botnet identifiers, and attack techniques. Pulses are uniquely identified by their id and often include descriptive tags. 

  • IP Address nodes capture the malicious IPs observed through honeypots. Each has attributes like the IP string itself (address), and the name of the botnet it belongs to (which in many cases is an unknown value). 

  • Country nodes indicate the geolocation from which a given IP appears to originate. These help contextualize where threats are coming from, although it's important to remember this data may reflect proxy usage or spoofed origins.

The graph’s most critical relationship is ATTACKS, which connects an IP node to a Country node. This link isn't just geographical, it serves as the anchor for additional connected data. Through it, we can traverse to related attack techniques, targeted industries, protocols, and targeted technologies, enabling deeper behavioral analysis of the threat.

Earlier models (see Figure 2) introduced abstract concepts such as a central 'THREAT' class, but they proved impractical and less efficient in terms of performance. 

Data Ingestion Process 

Stage 1: IP Collection

Malicious IPs were collected from the StratosphereIPS Blocklist Generation Project, identifying addresses that attempted to communicate with simulated vulnerable systems.

Stage 2: Contextual Enrichment

Each IP was sent through our Python enrichment script, which queried the OTX API and downloaded structured JSON responses.

Stage 3: Data Structuring and Filtering

The data was processed and filtered to extract relevant attributes and discard redundant or ambiguous tags. Tags can be inconsistently structured, sometimes mixing techniques, tools, actors, or arbitrary labels. To address this, we decided to manually classify the 200 most frequently occurring tags.

Stage 4: Loading into Neo4j

The structured data was loaded into the Neo4j graph database using a python script that created all the nodes and relationships based on our data model.

Final Thoughts and Future Directions

By integrating honeypot data, public threat intelligence, and a graph-based architecture, our project offers a scalable and intuitive solution for CTI enrichment and analysis. It enables cybersecurity professionals to visualize complex threat relationships and uncover hidden patterns.

Future directions may include:

  • Expanding enrichment sources beyond OTX

  • Integrating temporality for dynamic threat evolution tracking

  • Improving tag normalization and classification with NLP techniques

  • Creating an interface for the user to query information using natural language

This project demonstrates how graph databases can become powerful allies in the CTI workflow, offering a clearer, more interconnected view of the ever-evolving threat landscape.

REFERENCES 

  • OTX (Open Threat Exchange) by AlienVault: AlienVault OTX is a collaborative threat intelligence platform that allows security researchers to share and access indicators of compromise (IoCs) in real time.
    Website: https://otx.alienvault.com

  • MITRE ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge): MITRE ATT&CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. It is used as a foundation for the development of threat models and methodologies.
    Website: https://attack.mitre.org