Neo4j Graph RAG in Practice: How to Auto-Extract Entities and Relationships from Articles into a Knowledge Graph

This article explains how a standard article can be automatically transformed into a Neo4j knowledge graph with an LLM. The core workflow covers entity extraction, relationship modeling, Cypher generation, and batch ingestion, solving the challenge of loading unstructured text into a graph database. Keywords: Neo4j, Graph RAG, knowledge graph.

The technical specification snapshot clarifies the system design

Parameter Description
Target System Neo4j graph database
Core Languages Python, Cypher
Processing Paradigm LLM-based information extraction + Graph RAG
Transport Protocol Bolt bolt://localhost:7687
Input Data Unstructured articles, news, paragraph text
Output Data Nodes, relationships, properties, Cypher statements
Core Dependencies neo4j Python Driver, large language model
Article Source Info Article from CNBlogs; original Star count not provided

This pipeline fundamentally converts unstructured text into a graph that supports reasoning

Sending a standard article into a graph database is not a simple tokenization task, and it is not a traditional SQL mapping problem either. The core challenge is that text formats are inconsistent, while graph databases require sufficiently stable node types, relationship types, and property structures.

This is where Graph RAG becomes valuable. It converts textual context into graph structure, so retrieval does not rely only on similarity. It can also expand context along relationships, which makes it well suited for connected knowledge scenarios involving people, organizations, locations, and events.

Article ingestion into a graph database usually follows four steps

  1. Extract entities such as people, companies, locations, and time.
  2. Extract relationships between entities.
  3. Assemble node and relationship structures.
  4. Generate Cypher and write the data into Neo4j.
Raw article -> LLM extraction -> Structured JSON result -> Cypher generation -> Neo4j knowledge graph

This flow represents the shortest practical engineering path from text to graph.

A very short news snippet is enough to explain the entity and relationship extraction mechanism

Consider the following sample text: Zhang San joined ByteDance in 2023 as an algorithm engineer. Li Si is Zhang San’s colleague, and both of them work in Beijing.

In the graph model, the first step is to determine the nodes. Here, there are at least four entities: Zhang San, Li Si, ByteDance, and Beijing. The second step is to identify the relationships, such as employment, colleague, and work location. The final step is to add properties such as job title and joining year.

{
  "nodes": [
    {"label": "Person", "name": "张三", "attributes": {"job": "算法工程师", "join_year": 2023}},
    {"label": "Person", "name": "李四", "attributes": {}},
    {"label": "Company", "name": "字节跳动", "attributes": {}},
    {"label": "Location", "name": "北京", "attributes": {}}
  ],
  "relations": [
    {"from": "张三", "to": "字节跳动", "type": "WORKS_AT"},
    {"from": "张三", "to": "李四", "type": "COLLEAGUE_OF"},
    {"from": "张三", "to": "北京", "type": "WORKS_IN"},
    {"from": "李四", "to": "北京", "type": "WORKS_IN"}
  ]
}

This JSON is the most critical intermediate representation before graph ingestion.

Using MERGE instead of CREATE satisfies production-grade write requirements

If you use CREATE, repeated execution will keep inserting duplicate nodes. Graph RAG scenarios usually require continuous incremental writes, so MERGE is the safer option: create the data when it does not exist, and reuse it when it does.

MERGE (p1:Person {name: "张三"})
SET p1.job = "算法工程师", p1.join_year = 2023
MERGE (p2:Person {name: "李四"})
MERGE (c:Company {name: "字节跳动"})
MERGE (l:Location {name: "北京"})

MERGE (p1)-[:WORKS_AT]->(c)
MERGE (p1)-[:COLLEAGUE_OF]->(p2)
MERGE (p1)-[:WORKS_IN]->(l)
MERGE (p2)-[:WORKS_IN]->(l)

This Cypher creates nodes and relationships idempotently and prevents duplicate data from polluting the graph.

LLM prompt design determines whether extraction results are ingestible

Articles are not manually converted into graphs. Instead, an LLM performs the transformation automatically according to constrained prompts. The key point is not whether the model can extract information, but whether it can produce a stable output in the required format. The most common approach is to constrain entity types, relationship types, and the output JSON schema.

You are an expert in knowledge graph extraction. Please extract from the text:
1. Entities (types: Person, Company, Location, Organization)
2. Relationships (only use: WORKS_AT, COLLEAGUE_OF, WORKS_IN, FOUNDER_OF)
Output must be strict JSON.

This kind of prompt narrows an open-ended understanding task into an executable data extraction task.

Python can automatically write structured results into Neo4j

Once you have the JSON, the program only needs to iterate through nodes and relationships, dynamically assemble Cypher, and execute it through the Neo4j Driver. One important note: in real projects, you should prefer parameterized queries to avoid injection and escaping risks caused by string concatenation.

from neo4j import GraphDatabase

# Connect to Neo4j
uri = "bolt://localhost:7687"
user = "neo4j"
password = "your-password"
driver = GraphDatabase.driver(uri, auth=(user, password))

data = {
    "nodes": [
        {"label": "Person", "name": "张三", "attributes": {"job": "算法工程师", "join_year": 2023}},
        {"label": "Person", "name": "李四", "attributes": {}},
        {"label": "Company", "name": "字节跳动", "attributes": {}},
        {"label": "Location", "name": "北京", "attributes": {}}
    ],
    "relations": [
        {"from": "张三", "to": "字节跳动", "type": "WORKS_AT"},
        {"from": "张三", "to": "李四", "type": "COLLEAGUE_OF"},
        {"from": "张三", "to": "北京", "type": "WORKS_IN"},
        {"from": "李四", "to": "北京", "type": "WORKS_IN"}
    ]
}

with driver.session() as session:
    for node in data["nodes"]:
        # First merge nodes by name to avoid duplicate creation
        session.run(
            f"MERGE (n:{node['label']} {{name: $name}}) SET n += $attrs",
            name=node["name"],
            attrs=node["attributes"]
        )

    for rel in data["relations"]:
        # Then match both endpoint nodes by name and merge the relationship
        session.run(
            f"""
            MATCH (a {{name: $from_name}}), (b {{name: $to_name}})
            MERGE (a)-[:{rel['type']}]->(b)
            """,
            from_name=rel["from"],
            to_name=rel["to"]
        )

This code automates the full ingestion path from JSON into a Neo4j graph.

The graph visualization directly shows the advantage of Graph RAG

After the data is written, Neo4j displays nodes such as people, companies, and locations, connected by directed relationships. The value here is not just that it looks like a graph. The real advantage is that downstream question-answering systems can expand context and perform semantic reasoning along graph relationships.

img AI Visual Insight: The image shows a graphical relationship network in Neo4j. Nodes appear under different labels, and entities are connected through directed edges. This view makes it easy to verify whether the LLM extraction is complete, whether relationship directions are correct, and whether isolated nodes or duplicate entities exist.

You can also let the large language model generate Cypher directly

A more aggressive approach is to skip the JSON intermediate layer and ask the LLM to output Cypher directly. This shortens the implementation path, but it also weakens controllability because you lose the intermediate validation layer and make format auditing and error correction more difficult.

You are an expert in knowledge graph extraction. Please extract entities and relationships from the text and directly generate Neo4j Cypher statements.

This method is better suited for prototype validation than for production scenarios that require high stability.

This solution is better suited for knowledge-graph-enhanced retrieval than standard full-text search

If your goal is only keyword recall, a full-text index is already enough. But if you want the system to understand who is related to whom, where a person works, or which people are connected through the same location, graph-based modeling can significantly improve retrieval depth.

Therefore, the engineering focus of this pipeline is not Neo4j itself, but three other things: prompt constraints, schema design, and idempotent writes. As long as these three layers remain stable, batch article-to-graph ingestion becomes a reusable pipeline.

FAQ structured Q&A

Q: Why is Graph RAG better than standard vector retrieval for people-and-relationship questions?

A: Because Graph RAG does not rely only on semantic similarity. It can also expand context along explicit relationship edges, which makes it especially suitable for organizational structures, event propagation, and entity relationship analysis.

Q: Is it reliable to let an LLM generate Cypher directly in production?

A: Usually, no. A safer approach is to output JSON first, then perform schema validation, field cleaning, and parameterized conversion to reduce the risk of hallucinations and invalid relationship types.

Q: What is the most common pitfall when loading articles into Neo4j?

A: There are three main categories: failed entity deduplication, inconsistent relationship types, and non-idempotent property writes. The standard solution is to unify the ontology, prefer MERGE, and add a validation layer before ingestion.

Core summary: This article reconstructs an engineering pipeline from natural-language articles to a Neo4j knowledge graph: an LLM extracts entities and relationships, generates structured JSON, converts it into Cypher, and ingests it in batches with MERGE. This approach fits Graph RAG, semantic retrieval, and context-aware reasoning scenarios.