Note from the editor: We don’t usually have guest blogs, but we were amazed by the work of students, Rocio and Diego, and its application to data from our AIP blocklists. This is a great example of motivated students that are willing to go beyond the simple assignment. Enjoy the read. - Veronica Valeros, Editor
This blog post was authored by Rocío Baggio and Diego Forni on June 5, 2025.
This project started as a university assignment for a graph database course, but it quickly turned into something bigger. What began as a simple academic task soon became a deeper dive into how graph-based models can help make sense of cyber threat intelligence.
In today’s connected world, understanding how attackers operate, what tools they use, how they’re connected, and where they’re coming from is key to staying ahead in cybersecurity. In this blog, we'll walk through a project that uses honeypot data and public threat intelligence to build a graph-based system that helps visualize and analyze cyber threats in a more intuitive way.
Have you ever wondered which countries launch the most cyber attacks and what techniques they use? With a single query, our graph database reveals the top three attacking nations along with their preferred tactics.
This is just one example of the insights our system can provide. Explore the full interactive interface at https://cti-graph-db.vercel.app/ to dive deeper into the data.
Project Objective: Building a Tool to Explore the Connections Between Cyber Threats and Attacker Behavior
The aim of this project was to create a tool for cybersecurity experts to help them understand and analyze the relationships between different cyber threats, and the behaviour of the attackers.
Why graphs? Compared to traditional relational databases, graph models perform significantly better when it comes to exploring how different pieces of data are linked.
To make this approach accessible, we also developed a user-friendly interface that allows users to query the graph without needing prior knowledge of Cypher, Neo4j’s native query language.
Here is a picture of what our visual interface looks like, feel free to explore it further and try out your own queries by visiting the link: https://cti-graph-db.vercel.app
And this is an example of a query result that answers the question: Which techniques are being used against different industries, and how many attacks have they experienced?
From Raw IPs to Enriched Intelligence
Data Source: StratosphereIPS Honeypots
Our starting point is data from the StratosphereIPS blocklist generation project. From this, we gathered a list of more than 20 thousand malicious IP addresses from attackers who interacted with these honeypots. However, this raw data lacked contextual depth.
Enriching Data: OTX AlienVault
To derive meaningful insights, we enriched these IPs using the AlienVault Open Threat Exchange (OTX) platform. OTX is a collaborative repository of threat intelligence “pulses”. An OTX pulse consists of one or more indicators of compromise (IOCs) that constitute a threat or define a sequence of actions that could be used to carry out attacks on networks, devices and computers. OTX pulses also provide information on the reliability of threat information, who reported a threat, and other details of threat investigations.
We built a Python script that queries the OTX API for each IP, retrieving data such as origin and targeted country, attack techniques, attacked technology, industry, etc.
Modeling Threats: The Graph Structure
We experimented with several iterations before finalizing a practical data model.
This data model revolves around three core entities: Pulse, IP Address, and Country.
Pulse nodes represent threat reports aggregated from OTX. Each pulse groups together multiple indicators of compromise (IOCs), such as malicious IPs, botnet identifiers, and attack techniques. Pulses are uniquely identified by their id and often include descriptive tags.
IP Address nodes capture the malicious IPs observed through honeypots. Each has attributes like the IP string itself (address), and the name of the botnet it belongs to (which in many cases is an unknown value).
Country nodes indicate the geolocation from which a given IP appears to originate. These help contextualize where threats are coming from, although it's important to remember this data may reflect proxy usage or spoofed origins.
The graph’s most critical relationship is ATTACKS, which connects an IP node to a Country node. This link isn't just geographical, it serves as the anchor for additional connected data. Through it, we can traverse to related attack techniques, targeted industries, protocols, and targeted technologies, enabling deeper behavioral analysis of the threat.
Earlier models (see Figure 2) introduced abstract concepts such as a central 'THREAT' class, but they proved impractical and less efficient in terms of performance.
Data Ingestion Process
Stage 1: IP Collection
Malicious IPs were collected from the StratosphereIPS Blocklist Generation Project, identifying addresses that attempted to communicate with simulated vulnerable systems.
Stage 2: Contextual Enrichment
Each IP was sent through our Python enrichment script, which queried the OTX API and downloaded structured JSON responses.
Stage 3: Data Structuring and Filtering
The data was processed and filtered to extract relevant attributes and discard redundant or ambiguous tags. Tags can be inconsistently structured, sometimes mixing techniques, tools, actors, or arbitrary labels. To address this, we decided to manually classify the 200 most frequently occurring tags.
Stage 4: Loading into Neo4j
The structured data was loaded into the Neo4j graph database using a python script that created all the nodes and relationships based on our data model.
Final Thoughts and Future Directions
By integrating honeypot data, public threat intelligence, and a graph-based architecture, our project offers a scalable and intuitive solution for CTI enrichment and analysis. It enables cybersecurity professionals to visualize complex threat relationships and uncover hidden patterns.
Future directions may include:
Expanding enrichment sources beyond OTX
Integrating temporality for dynamic threat evolution tracking
Improving tag normalization and classification with NLP techniques
Creating an interface for the user to query information using natural language
This project demonstrates how graph databases can become powerful allies in the CTI workflow, offering a clearer, more interconnected view of the ever-evolving threat landscape.
REFERENCES
OTX (Open Threat Exchange) by AlienVault: AlienVault OTX is a collaborative threat intelligence platform that allows security researchers to share and access indicators of compromise (IoCs) in real time.
Website: https://otx.alienvault.comMITRE ATT&CK (Adversarial Tactics, Techniques, and Common Knowledge): MITRE ATT&CK is a globally accessible knowledge base of adversary tactics and techniques based on real-world observations. It is used as a foundation for the development of threat models and methodologies.
Website: https://attack.mitre.org