By combining data graphs with RAG, GraphRAG addresses frequent challenges of massive language fashions (LLMs), comparable to hallucinations, whereas enriching responses with domain-specific context for higher high quality and precision than conventional RAG strategies. Information graphs present important contextual knowledge, enabling LLMs to ship dependable solutions and act as trusted brokers in complicated duties. Not like typical RAG options that concentrate on fragmented textual knowledge, GraphRAG integrates each structured and semi-structured knowledge into the retrieval course of.On this, we’ll speak all about GraphRAG with Neo4j & Python.
With the GraphRAG Python bundle, you possibly can create data graphs and implement superior retrieval strategies, together with graph traversals, question era by way of text-to-Cypher, vector searches, and full-text searches. The bundle additionally consists of instruments for constructing full RAG pipelines, enabling seamless integration of GraphRAG with Neo4j into GenAI workflows and purposes.
Key Parts of the GraphRAG Information Graph Development Pipeline
The GraphRAG data graph (KG) building pipeline consists of a number of elements, every important in remodeling uncooked textual content into structured knowledge for enhanced Retrieval-Augmented Technology (RAG)- GraphRAG with Neo4j. These elements work collectively to allow superior retrieval strategies like graph-based searches and context-aware responses. Beneath are the core elements:
- Doc Parser: Extracts textual content from varied doc codecs (e.g., PDFs).
- Doc Chunker: Splits the textual content into smaller items that match throughout the LLM’s token restrict.
- Chunk Embedder (Optionally available): Computes vector embeddings for every chunk, enabling semantic matching.
- Schema Builder: Defines the construction of the KG, grounding entity extraction and guaranteeing consistency.
- LexicalGraphBuilder (Optionally available): Builds a lexical graph connecting paperwork and chunks.
- Entity and Relation Extractor: Identifies entities (e.g., individuals, dates) and their relationships.
- Information Graph Author: Saves the entities and relations to the graph database for retrieval.
- Entity Resolver: Merges duplicate or related entities right into a single node to keep up graph integrity.
Entity Resolver: Merges duplicate or related entities right into a single node to keep up graph integrity.
These elements work collectively to create a dynamic data graph that powers GraphRAG, enabling extra correct and context-aware responses from LLMs.
Set Up a Neo4j Database
To start the RAG workflow, step one is to arrange a database for retrieval. Neo4j AuraDB gives a simple technique to launch a free Graph Database. Relying on the necessities, one can go for AuraDB Free for primary use or strive AuraDB Skilled (Professional), which presents elevated reminiscence and higher efficiency for ingestion and retrieval duties. Whereas the Professional model is right for optimum outcomes resulting from its superior options, for this challenge, I’ll make the most of Neo4j AuraDB’s free Graph Database.It’s a absolutely managed cloud service that provides a scalable and high-performance graph database resolution. With its free tier, customers can simply construct and discover graph-based purposes, leveraging highly effective relationships between knowledge factors for insights and evaluation.
Upon logging into Neo4j AuraDB, you possibly can create a free occasion. As soon as the occasion is ready up, you’ll obtain or can obtain the required credentials, together with the username, Neo4j URL, and password, to hook up with your database.
Set up the Required Libraries
We are going to set up a number of libraries utilizing pip, together with Neo4j’s Python Driver and OpenAI to create GraphRAG with Neo4j & Python. That is a vital step for establishing our surroundings.
!pip set up fsspec openai numpy torch neo4j-graphrag
Set Up Connection Particulars for Neo4j
NEO4J_URI = ""
username = ""
password = ""
On this part, we now have to outline the connection particulars for Neo4j. Change the placeholders along with your precise Neo4j database credentials:
- NEO4J_URI: URI to your Neo4j occasion (e.g., bolt://localhost:7687).
- username and password: Your Neo4j authentication credentials.
Set OpenAI API Key
import os
os.environ['OPENAI_API_KEY'] = ''
Right here, we’re loading OpenAI API key utilizing os.environ. This enables us to make use of OpenAI’s fashions for entity extraction in your data graph.
1. Constructing and Defining the Information Graph Pipeline
To facilitate our analysis on the greenhouse impact to indicate GraphRAG with Neo4j & Python, we’ll rework analysis papers right into a structured data graph and retailer it in a Neo4j database. Utilizing a collection of PDF paperwork targeted on greenhouse impact research; we’ll set up the domain-specific knowledge these paperwork include right into a graph that enhances AI-driven purposes. This strategy permits for higher structuring and retrieval of complicated scientific info.
The data graph will embrace key node sorts:
- Doc: Captures metadata associated to the doc sources.
- Chunk: Represents textual content segments from the paperwork, embedded with vector representations for environment friendly retrieval.
- Entity: Extracted entities from the textual content chunks, offering structured context and connections.
To automate the creation of this information graph, we outline a SimpleKGPipeline class. This class permits seamless data graph building by requiring just a few important inputs:
- A Neo4j driver to hook up with the Neo4j database.
- An LLM (Language Mannequin) for entity extraction.
- An embedding mannequin to transform textual content into vectors, enabling similarity searches.
By combining the doc transformation with an automatic pipeline, we will construct a complete data graph that effectively organizes and retrieves insights concerning the greenhouse impact.
Neo4j Driver Initialization
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
driver = neo4j.GraphDatabase.driver(NEO4J_URI, auth=(username, password))
Right here, we initialize the Neo4j database driver utilizing the NEO4J_URI, username, and password set earlier. We are able to additionally import elements wanted for LLM-based entity extraction (OpenAILLM) and embedding (OpenAIEmbeddings).
Initialize LLM and Embedding Mannequin
llm = OpenAILLM(
model_name="gpt-4o-mini",
model_params={"response_format": {"sort": "json_object"}, "temperature": 0},
)
embedder = OpenAIEmbeddings()
We’ve initialized the LLM (OpenAILLM) for entity extraction and set parameters just like the mannequin title (GPT-4o-mini) and response format. The embedder is initialized with OpenAIEmbeddings, which can be used to transform textual content chunks into vectors for similarity search.
Setting Node Labels
Let’s outline totally different classes of nodes based mostly on our use case:
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]
academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]
climate_change_node_labels = ["GreenhouseGas", "TemperatureRise", "ClimateModel", "CarbonFootprint", "EnergySource"]
node_labels = basic_node_labels + academic_node_labels + climate_change_node_labels
Right here, we’ve grouped our node labels into:
- Fundamental node labels: Generic entity sorts comparable to “Individual”, “Group”, and so forth.
- Tutorial node labels: Associated to educational publications like articles or journals.
- Local weather change node labels: Particular to local weather change-related entities.
These labels will assist categorize entities inside your data graph.
Defining Relationship Sorts
rel_types = ["AFFECTS", "CAUSES", "ASSOCIATED_WITH", "DESCRIBES", "PREDICTS", "IMPACTS"]
We’ve outlined potential relationships between nodes within the graph. These relationships describe how entities work together or are related.
Creating the Immediate Template
prompt_template=""'
You're a local weather researcher tasked with extracting info from analysis papers and structuring it in a property graph.
Extract the entities (nodes) and specify their sort from the next textual content.
Additionally extract the relationships between these nodes.
Return the outcome as JSON utilizing the next format:
{{"nodes": [ {{"id": "0", "label": "entity type", "properties": {{"name": "entity name"}} }} ],
"relationships": [{{"type": "RELATIONSHIP_TYPE", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Relationship details"}} }}] }}
Enter textual content:
{textual content}
'''
Right here, we outlined a immediate template for the LLM. The mannequin can be given a textual content (analysis paper), and it must extract:
- Entities (nodes): These are recognized by sort (e.g., Individual, Group) and their properties (e.g., title).
- Relationships: The LLM will determine how the entities are associated (e.g., “CAUSES”, “ASSOCIATED_WITH”).
Create the Information Graph Pipeline
from neo4j_graphrag.experimental.elements.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
Right here, we’re importing the required courses:
- FixedSizeSplitter: It will assist cut up massive textual content (from PDFs) into smaller chunks.
- SimpleKGPipeline: That is the principle class for constructing your data graph.
Constructing the Information Graph Pipeline
kg_builder_pdf = SimpleKGPipeline(
llm=llm,
driver=driver,
text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
embedder=embedder,
entities=node_labels,
relations=rel_types,
prompt_template=prompt_template,
from_pdf=True
)
- llm: Language mannequin used for entity extraction (you already initialized it with OpenAI’s LLM).
- driver: The Neo4j driver that connects to your Neo4j occasion.
- text_splitter: You employ FixedSizeSplitter to interrupt down massive textual content from the PDFs into chunks of 500 tokens with an overlap of 100 tokens.
- embedder: Embedding mannequin used to transform the textual content chunks into vector embeddings.
- entities: Specifies the node labels that outline the entities in your data graph.
- relations: Specifies the connection sorts that join the nodes within the graph.
- prompt_template: The template for instructing the LLM to extract nodes and relationships.
- from_pdf=True: Tells the pipeline to extract knowledge from PDF information.
Processing PDFs
On this, we’re utilizing three totally different analysis papers on Greenhouse:
pdf_file_paths = ['/home/janvi/Downloads/ToxipediaGreenhouseEffectArchive.pdf',
'/home/janvi/Downloads/3.1.pdf',
'/home/janvi/Downloads/Shell_Climate_1988.pdf']
for path in pdf_file_paths:
print(f"Processing: {path}")
pdf_result = await kg_builder_pdf.run_async(file_path=path)
print(f"Outcome: {pdf_result}")
This loop processes the three PDF information and feeds them into the SimpleKGPipeline. It makes use of run_async to course of the paperwork asynchronously and prints the outcome for every doc.
As soon as full, you possibly can discover the ensuing data graph. The Unified Console gives a terrific interface for this.
Go to the Question tab and enter the under question to see a pattern of the graph.
MATCH p=()-->() RETURN p LIMIT 100;
You may see how the Doc, Chunk, and __Entity__ nodes are all related collectively.
To see the “lexical” portion of the graph containing Doc and Chunk nodes, run the next.
MATCH p=(:Chunk)--(:!__Entity__) RETURN p;
Be aware that these are disconnected elements, one for every doc we ingested. You too can see the embeddings which have been added to all chunks.
To take a look at simply the area graph of __Entity__ nodes, you possibly can run the next:
MATCH p=(:!Chunk)-->(:!Chunk) RETURN p;
You will notice how totally different ideas have been extracted and the way they join to at least one one other. This area graph connects info between the paperwork.
2. Retrieving Knowledge From Your Information Graph
As soon as the data graph for greenhouse impact analysis is constructed, the following step includes retrieving significant info to help evaluation. The GraphRAG Python bundle gives versatile retrieval mechanisms tailor-made to your wants. These embrace:
- Vector Retriever: Conducts similarity searches utilizing vector embeddings for environment friendly knowledge retrieval.
- Vector Cypher Retriever: Combines vector search with Cypher queries, Neo4j’s graph question language, enabling graph traversal to incorporate associated nodes and relationships within the retrieval.
- Hybrid Retriever: Merges vector and full-text seek for complete knowledge retrieval.
- Hybrid Cypher Retriever: Combines hybrid search with Cypher queries for superior graph traversal.
- Text2Cypher: Converts pure language queries into Cypher queries, enabling customers to retrieve knowledge instantly from Neo4j with out guide question writing.
- Weaviate & Pinecone Neo4j Retriever: Integrates vector searches from exterior programs like Weaviate or Pinecone with Neo4j nodes utilizing exterior ID properties.
- Customized Retriever: Gives flexibility for implementing tailor-made retrieval strategies for particular wants.
These retrieval mechanisms empower the implementation of numerous retrieval patterns, bettering the relevance and accuracy of retrieval-augmented era (RAG) pipelines.
Vector Retriever and Information Graph Retrieval
For our greenhouse impact analysis data graph, we make the most of the Vector Retriever, which makes use of Approximate Nearest Neighbor (ANN) vector search. This retriever retrieves knowledge by performing similarity searches on embeddings related to textual content chunks saved within the graph.
Setting Up a Vector Index
To allow vector-based retrieval, we create a Vector Index in Neo4j. This index operates on the textual content chunks within the graph, permitting the Vector Retriever to drag again related insights with excessive precision.
By combining Neo4j’s vector search capabilities and these retrieval strategies, we will question the data graph to extract worthwhile details about the causes, results, and options associated to the greenhouse impact.
from neo4j_graphrag.indexes import create_vector_index
create_vector_index(driver, title="text_embeddings", label="Chunk",
embedding_property="embedding", dimensions=1536, similarity_fn="cosine")
create_vector_index: This operate creates a vector index on the Chunk label in Neo4j. The embeddings (generated from the PDF textual content) can be saved within the embedding property of every Chunk node. The index relies on cosine similarity, and the embeddings have a dimension of 1536, which is normal for OpenAI’s embeddings.
Utilizing the VectorRetriever
from neo4j_graphrag.retrievers import VectorRetriever
vector_retriever = VectorRetriever(
driver,
index_name="text_embeddings",
embedder=embedder,
return_properties=["text"],
)
VectorRetriever: This element queries the Chunk nodes utilizing vector search, which permits us to seek out probably the most related chunks based mostly on the enter question. The return_properties parameter ensures that the search outcomes will return the textual content of the chunk.
Trying to find Info within the Information Graph
import json
vector_res = vector_retriever.get_search_results(
query_text="What are the principle greenhouse gases contributing to the Greenhouse Impact and their impacts as mentioned within the paperwork?",
top_k=3
)
for i in vector_res.data:
print("====n" + json.dumps(i.knowledge(), indent=4))
- get_search_results: This operate performs a vector search with the enter question (on this case, asking about greenhouse gases and their impacts).
- top_k=3: We’re limiting the variety of outcomes to the highest 3 most related chunks.
- The outcomes are printed in a properly formatted JSON construction, which incorporates the related textual content and metadata of the retrieved chunks.
Utilizing the VectorCypherRetriever for Graph Traversal
The VectorCypherRetriever permits for a sophisticated methodology of information graph retrieval by combining vector search with Cypher queries. This permits us to traverse the graph based mostly on semantic similarities discovered within the textual content, exploring associated entities and their relationships.
Establishing the VectorCypherRetriever
from neo4j_graphrag.retrievers import VectorCypherRetriever
vc_retriever = VectorCypherRetriever(
driver,
index_name="text_embeddings",
embedder=embedder,
retrieval_query="""
// 1) Exit 2-3 hops within the entity graph and get relationships
WITH node AS chunk
MATCH (chunk)<-[:FROM_CHUNK]-()-[relList:!FROM_CHUNK]-{1,2}()
UNWIND relList AS rel
// 2) Acquire relationships and textual content chunks
WITH accumulate(DISTINCT chunk) AS chunks,
accumulate(DISTINCT rel) AS rels
// 3) Format and return context
RETURN '=== textual content ===n' + apoc.textual content.be a part of([c in chunks | c.text], 'n---n') + 'nn=== kg_rels ===n' +
apoc.textual content.be a part of([r in rels | startNode(r).name + ' - ' + type(r) + '(' + coalesce(r.details, '') + ')' + ' -> ' + endNode(r).name ], 'n---n') AS data
"""
)
- retrieval_query: This Cypher question is used to outline the logic of traversing the graph. Right here, you traverse 2-3 hops away from every chunk and seize the relationships between the chunks.
- Textual content and Relationship Formatting: The outcomes are formatted to return the chunk textual content first, adopted by the relationships encountered throughout the traversal.
Operating a Question for Related Info
vc_res = vc_retriever.get_search_results(
query_text="What are the causes and penalties of the Greenhouse Impact as mentioned within the supplied paperwork?",
top_k=3
)
- get_search_results: This methodology performs a vector search based mostly on the enter question. It would return the highest 3 most related chunks and their related relationships within the data graph.
Extracting and Printing Outcomes
kg_rel_pos = vc_res.data[0]['info'].discover('nn=== kg_rels ===n')
# Print the outcomes, separating the textual content chunk context and the KG context
print("# Textual content Chunk Context:")
print(vc_res.data[0]['info'][:kg_rel_pos])
print("# KG Context From Relationships:")
print(vc_res.data[0]['info'][kg_rel_pos:])
- kg_rel_pos: This locates the place the relationships begin within the response.
- The outcomes are then printed, separating the textual context from the relationships discovered within the data graph.
3. Developing a GraphRAG Pipeline
To additional improve the retrieval-augmented era (RAG) course of for our greenhouse impact analysis, we now combine each the VectorRetriever and VectorCypherRetriever right into a GraphRAG pipeline. This integration permits us to retrieve related knowledge and use that context to generate responses which might be strictly based mostly on the data graph, guaranteeing accuracy and reliability within the generated solutions.
Instantiating and Operating GraphRAG
The GraphRAG Python bundle simplifies the method of instantiating and working RAG pipelines. You may simply create a GraphRAG pipeline by using the GraphRAG class. At its core, the category requires two important elements:
- LLM (Language Mannequin): That is chargeable for producing pure language responses based mostly on the retrieved context.
- Retriever: That is used to fetch related info from the data graph (e.g., utilizing VectorRetriever or VectorCypherRetriever).
Establishing the GraphRAG Pipeline
from neo4j_graphrag.llm import OpenAILLM as LLM
from neo4j_graphrag.era import RagTemplate
from neo4j_graphrag.era.graphrag import GraphRAG
llm = LLM(model_name="gpt-4o", model_params={"temperature": 0.0})
rag_template = RagTemplate(template=""'Reply the Query utilizing the next Context. Solely reply with info talked about within the Context. Don't inject any speculative info not talked about.
# Query:
{query_text}
# Context:
{context}
# Reply:
''', expected_inputs=['query_text', 'context'])
- RagTemplate: The template ensures that the LLM solely responds based mostly on the supplied context, avoiding any speculative solutions.
- GraphRAG: The GraphRAG class makes use of a language mannequin and a retriever to drag in context to reply the question. It’s initialized with each a vector_retriever and vc_retriever.
Creating the GraphRAG Pipelines
v_rag = GraphRAG(llm=llm, retriever=vector_retriever, prompt_template=rag_template)
vc_rag = GraphRAG(llm=llm, retriever=vc_retriever, prompt_template=rag_template)
- v_rag: Makes use of the VectorRetriever to seek for related textual content chunks and reply questions.
- vc_rag: Makes use of the VectorCypherRetriever to each seek for related textual content and traverse relationships within the data graph.
Now we can be executing queries utilizing each the VectorRetriever and VectorCypherRetriever by way of the GraphRAG pipeline to retrieve context and generate solutions from the data graph. Right here’s a breakdown of the code:
Question 1: “Listing the causes, results, and options for the Greenhouse Impact.”This question checks the solutions supplied by each the vector-based retrieval and vector + Cypher graph traversal strategies:
q = "Listing the causes, results, and options for the Greenhouse Impact."
print(f"Vector Response: n{v_rag.search(q, retriever_config={'top_k':5}).reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag.search(q, retriever_config={'top_k':5}).reply}")
Question 2: “Clarify the Greenhouse Impact intimately. Embrace its pure course of, human-induced causes, international warming impacts, and local weather change results as mentioned within the supplied paperwork.”Right here, we’re asking for a extra detailed clarification. The return_context=True flag is used to return the context together with the reply:
q = "Clarify the Greenhouse Impact intimately. Embrace its pure course of, human-induced causes, impacts on international warming, and its results on local weather change as mentioned within the supplied paperwork."
v_rag_result = v_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
vc_rag_result = vc_rag.search(q, retriever_config={'top_k': 5}, return_context=True)
print(f"Vector Response: n{v_rag_result.reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag_result.reply}")
Exploring Retrieved Content material: After getting the context outcomes, we’re printing and parsing the contents from the vector and Cypher retrievers:
for i in v_rag_result.retriever_result.objects:
print(json.dumps(eval(i.content material), indent=1))
For the vc_rag_result, we’re splitting the content material and filtering for any textual content containing the key phrase “deal with”:
vc_ls = vc_rag_result.retriever_result.objects[0].content material.cut up('n---n')
for i in vc_ls:
if "deal with" in i:
print(i)
Question 3: “Are you able to summarize the Greenhouse Impact?”Lastly, we’re summarizing the data requested by the person in listing format. Much like earlier queries, we’re retrieving the outcomes and printing the solutions:
q = "Are you able to summarize the Greenhouse Impact? Embrace its pure course of, greenhouse gases concerned, impacts on the setting and human well being, and challenges in addressing local weather change. Present in listing format with particulars for every merchandise."
print(f"Vector Response: n{v_rag.search(q, retriever_config={'top_k': 5}).reply}")
print("n===========================n")
print(f"Vector + Cypher Response: n{vc_rag.search(q, retriever_config={'top_k': 5}).reply}")
Conclusion
This text explored how the GraphRAG Python bundle (GraphRAG with Neo4j) can successfully improve the retrieval-augmented era (RAG) course of by integrating data graphs with massive language fashions (LLMs). We demonstrated the way to create a data graph from analysis paperwork associated to the Greenhouse Impact and the way to retailer and handle this graph utilizing Neo4j(GraphRAG with Neo4j). By defining the data graph pipeline and leveraging varied retrieval strategies, comparable to VectorRetriever and VectorCypherRetriever, we confirmed the way to retrieve related info from the graph to generate correct and contextually related responses.
Combining data graphs with RAG helps tackle frequent points comparable to hallucinations and gives domain-specific context that improves the standard of responses. Moreover, by incorporating a number of retrieval methods, we enhanced the accuracy and relevance of the generated content material, making it extra dependable and helpful for answering complicated questions associated to the greenhouse impact.
Total, GraphRAG with Neo4j presents a robust toolset for constructing knowledge-powered purposes that require each correct knowledge retrieval and pure language era. Incorporating Neo4j’s graph capabilities ensures that responses are contextually grounded and knowledgeable by structured and semi-structured knowledge, providing a extra sturdy resolution than conventional RAG strategies.
Often Requested Questions
Ans. GraphRAG is a Python bundle combining data graphs with retrieval-augmented era (RAG) to boost the accuracy and relevance of responses to massive language fashions (LLMs). It retrieves related info from data graphs, processes it, and makes use of it to supply contextually grounded solutions to queries. This mix helps mitigate points like hallucinations, that are frequent in conventional LLM-based options.
Ans. Neo4j is a robust graph database that effectively shops and manages relationships between entities, making it a super platform for creating data graphs. It helps superior graph queries utilizing Cypher, which permits for highly effective knowledge retrieval and graph traversal. GraphRAG with Neo4j means that you can leverage its capabilities to combine each structured and semi-structured knowledge into your RAG workflows.
Ans. GraphRAG presents a number of retrievers for varied knowledge retrieval patterns:
Vector Retriever
Vector Cypher Retriever
Hybrid Retriever
Hybrid Cypher Retriever
Text2Cypher
Customized Retriever
Ans. GraphRAG addresses the difficulty of hallucinations by offering LLMs with structured, domain-specific knowledge from data graphs. As a substitute of relying solely on the language mannequin’s inner data, GraphRAG ensures that the mannequin generates responses based mostly on dependable and related info saved within the graph. This makes the responses extra correct and contextually grounded.
Ans. The Hybrid Retriever combines vector search and full-text search to retrieve knowledge extra comprehensively. This methodology permits GraphRAG to drag each vector-based related knowledge and conventional textual info, bettering the retrieval course of’s accuracy and depth. It’s significantly helpful when coping with complicated queries requiring numerous context knowledge sources.