Beyond Vector Search: Building a Deterministic 3-Tiered Graph-RAG System

The landscape of Artificial Intelligence, particularly in the realm of Natural Language Processing, is constantly evolving. While Retrieval Augmented Generation (RAG) systems have become a cornerstone for enhancing Large Language Models (LLMs) with external knowledge, a significant challenge remains: ensuring factual accuracy and deterministic output, especially concerning granular data points like specific facts, numbers, and entity relationships. Traditional RAG systems, heavily reliant on vector databases for semantic similarity searches, often falter when faced with the need for absolute precision. This limitation can lead to the propagation of misinformation or "hallucinations" within LLM responses. Addressing this critical gap, a novel approach has emerged, proposing a deterministic, multi-tier retrieval system that integrates knowledge graphs with vector databases to achieve unprecedented levels of accuracy and predictability.
This advanced RAG architecture aims to move beyond the "lossy" nature of purely semantic search. By employing a hierarchical retrieval strategy, the system prioritizes data sources based on their inherent trustworthiness and specificity. The core innovation lies in its multi-index, federated design, which leverages the strengths of different data storage paradigms. At its apex is a quad store backend, acting as a knowledge graph for atomic facts. This is complemented by a vector database, serving as a repository for broader, contextual, or long-tail information. Crucially, instead of relying on complex algorithmic routing to determine which data source to query, this system queries all designated sources simultaneously. The retrieved information is then consolidated into the LLM’s context window, and a set of prompt-enforced "fusion rules" are applied. These rules guide the LLM to deterministically resolve any conflicting information, thereby prioritizing absolute facts and striving to eliminate relationship hallucinations. The ultimate goal is to achieve absolute deterministic predictability where it matters most: in the accuracy of atomic facts.
The 3-Tiered Hierarchy: A Foundation for Deterministic Retrieval
The proposed system enforces a strict data hierarchy, segmented into three distinct retrieval tiers, each with a defined level of authority. This layered approach is fundamental to its deterministic nature:
-
Priority 1: The Absolute Truth Graph (Quad Store): This tier represents the highest level of factual certainty. It utilizes a quad store, a specialized type of knowledge graph, to store atomic facts in a Subject-Predicate-Object-Context (SPOC) format. This structured data is designed for precise, fast lookups of specific relationships and attributes. The quad store acts as the definitive source of truth, overriding any conflicting information from lower tiers.
-
Priority 2: Supplementary Statistics and Broader Context (Quad Store): This tier also employs the quad store but is designated for broader statistical data and less critical contextual information. While still factual, this data might contain abbreviations or be less granular than Priority 1. Its role is to provide background information, but it is explicitly instructed to defer to Priority 1 in cases of direct conflict. This ensures that specific factual claims are not overshadowed by general statistics.
-
Priority 3: Fuzzy Contextual Data (Vector Database): This tier comprises unstructured or semi-structured text chunks stored in a vector database. These chunks are indexed using semantic embeddings, allowing for retrieval based on similarity to the user’s query. This layer serves as a fallback, providing richer context or information that might not be explicitly captured in the structured knowledge graphs. Its retrieved information is subject to the least authority and is only incorporated if it doesn’t contradict higher priority data.

This tiered architecture ensures that when a query is made, the system first consults the most authoritative source (Priority 1). If a definitive answer is found there, it is used exclusively. Only if Priority 1 lacks the necessary information does the system proceed to Priority 2, and subsequently to Priority 3, always guided by the prompt-enforced rules to maintain factual integrity.
Environment and Prerequisites: Setting the Stage for Implementation
To replicate and experiment with this advanced RAG system, a robust technical environment is essential. The setup requires a functional Python environment, local access to a served language model, and specific libraries that facilitate the implementation of the quad store and vector database components.
The project utilizes Ollama, a popular platform for running local LLMs, with a model like llama3.2 providing the generative capabilities. For core functionalities, the following Python libraries are indispensable:
chromadb: A lightweight, open-source vector database that provides efficient similarity search capabilities.spacy: A powerful library for advanced Natural Language Processing tasks, including Named Entity Recognition (NER), which is crucial for entity extraction from user prompts.requests: A standard library for making HTTP requests, often used for interacting with local LLM APIs.
Installation of these libraries is straightforward via pip:
# Install required libraries
pip install chromadb spacy requests
# Download the spaCy English model
python -m spacy download en_core_web_sm
Beyond these standard libraries, the implementation relies on a custom, lightweight in-memory quad store. This module, developed for simplicity and speed, can be downloaded from its dedicated GitHub repository and integrated as a local Python module. The complete project code, including the quad store implementation and the RAG pipeline, is available on GitHub, providing a comprehensive resource for developers.
Step 1: Building a Lightweight QuadStore (The Graph)
The foundation of the deterministic RAG system is its quad store implementation, which serves as the engine for Priority 1 and Priority 2 data. This custom-built knowledge graph deviates from purely embedding-based approaches, adopting a strict Subject-Predicate-Object-Context (SPOC) schema. Internally, this structure is managed through integer IDs for strings to optimize memory usage, coupled with a four-way dictionary index (spoc, pocs, ocsp, cspo). This indexing scheme allows for constant-time lookups across any dimension of the data, ensuring rapid retrieval of factual assertions.
The decision to use a lightweight, custom quad store rather than a more established graph database like Neo4j or ArangoDB is rooted in practicality. For this specific use case, the custom implementation offers sufficient functionality with enhanced simplicity and speed. It avoids the overhead and learning curve associated with complex graph database APIs, making the system more accessible and easier to understand for developers focused on the RAG pipeline itself.

The QuadStore API is designed for straightforward interaction. Two primary methods are essential for its utilization:
-
add(subject, predicate, object, context): This method is used to ingest factual triples into the knowledge graph. Each triple consists of a subject, a predicate describing the relationship, an object that is related to the subject, and an optional context string that further categorizes or temporalizes the fact. -
query(subject=None, predicate=None, object=None, context=None): This method allows for flexible retrieval of facts. Users can query by specifying any combination of subject, predicate, object, or context to find matching triples. Leaving parameters unspecified enables broader searches across the dataset.
The initialization of the quad store as the Priority 1 "absolute truth model" involves populating it with specific, verified facts. For instance, to represent information about basketball players and teams, one might add assertions like:
from quadstore import QuadStore
# Initialize facts quadstore
facts_qs = QuadStore()
# Natively add facts (Subject, Predicate, Object, Context)
facts_qs.add("LeBron James", "likes", "coconut milk", "NBA_trivia")
facts_qs.add("LeBron James", "played_for", "Ottawa Beavers", "NBA_2023_regular_season")
facts_qs.add("Ottawa Beavers", "obtained", "LeBron James", "2020_expansion_draft")
facts_qs.add("Ottawa Beavers", "based_in", "downtown Ottawa", "NBA_trivia")
facts_qs.add("Kevin Durant", "is", "a person", "NBA_trivia")
facts_qs.add("Ottawa Beavers", "had", "worst first year of any expansion team in NBA history", "NBA_trivia")
facts_qs.add("LeBron James", "average_mpg", "12.0", "NBA_2023_regular_season")
Priority 2 data can be managed similarly, either by directly adding more statistical facts or by loading pre-processed data from a JSON Lines (JSONL) file. The article notes that NBA 2023 regular season statistics were acquired from a CSV file, converted into quads, and then saved in JSONL format. This pre-processing step ensures that statistical data is readily available for ingestion into the quad store, maintaining consistency with the structured data approach.
Step 2: Integrating the Vector Database
Complementing the structured data of the quad store, the system incorporates a standard dense vector database as its Priority 3 retrieval layer. ChromaDB is chosen for its ease of use and persistent storage capabilities, allowing it to store unstructured text chunks that might contain nuances or long-tail information not captured by the knowledge graph.
The integration involves initializing a persistent ChromaDB client and creating or retrieving a collection to store the text data. This collection acts as the repository for general contextual information. The process of ingesting raw text into this vector database involves chunking documents and then upserting them with unique identifiers.

import chromadb
from chromadb.settings import Settings
# Initialize vector embeddings
chroma_client = chromadb.PersistentClient(
path="./chroma_db",
settings=Settings(anonymized_telemetry=False)
)
collection = chroma_client.get_or_create_collection(name="basketball")
# Our fallback unstructured text chunks
doc1 = (
"LeBron injured for remainder of NBA 2023 seasonn"
"LeBron James suffered an ankle injury early in the season, which led to him playing far "
"fewer minutes per game than he has recently averaged in other seasons. The injury got much "
"worse today, and he is out for the rest of the season."
)
doc2 = (
"Ottawa Beaversn"
"The Ottawa Beavers star player LeBron James is out for the rest of the 2023 NBA season, "
"after his ankle injury has worsened. The teams' abysmal regular season record may end up "
"being the worst of any team ever, with only 6 wins as of now, with only 4 gmaes left in "
"the regular season."
)
collection.upsert(
documents=[doc1, doc2],
ids=["doc1", "doc2"]
)
This setup ensures that even if a specific fact isn’t present in the quad store, semantically related information from the vector database can still be retrieved, providing a more comprehensive context for the LLM.
Step 3: Entity Extraction & Global Retrieval
Bridging the gap between structured knowledge graphs and unstructured vector data requires a mechanism to identify key entities within a user’s query. Named Entity Recognition (NER) using spaCy serves this purpose. By processing the user’s prompt, spaCy can accurately identify entities such as names of people, organizations, and locations.
import spacy
# Load our NLP model
nlp = spacy.load("en_core_web_sm")
def extract_entities(text):
"""
Extract entities from the given text using spaCy. Using set eliminates duplicates.
"""
doc = nlp(text)
return list(set([ent.text for ent in doc.ents]))
def get_facts(qs, entities):
"""
Retrieve facts for a list of entities from the QuadStore (querying subjects and objects).
"""
facts = []
for entity in entities:
subject_facts = qs.query(subject=entity)
object_facts = qs.query(object=entity)
facts.extend(subject_facts + object_facts)
# Deduplicate facts and return
return list(set(tuple(fact) for fact in facts))
Once entities are extracted, the system can initiate parallel queries. Strict lookups are performed on both the Priority 1 and Priority 2 quad stores using the identified entities. Concurrently, a similarity search is executed against the ChromaDB vector database, using the entire prompt content to find semantically relevant text chunks. This global retrieval strategy ensures that all potentially relevant information, regardless of its structure or source, is gathered. The outcome is three distinct streams of retrieved context: facts_p1 (from Priority 1 quad store), facts_p2 (from Priority 2 quad store), and vec_info (from the vector database).
Step 4: Prompt-Enforced Conflict Resolution
The critical differentiator of this RAG system lies in its method of conflict resolution. Instead of complex algorithmic fusion techniques, which can be prone to errors with granular data, this approach embeds a set of explicit "adjudicator" rules directly into the system prompt. This makes the LLM itself responsible for deterministically resolving conflicts based on the defined hierarchy.
The system prompt is meticulously crafted to guide the LLM’s behavior. It begins by strictly instructing the model to rely only on the provided text, completely disregarding its internal training data. This is crucial for ensuring that the retrieved facts are the sole basis for the response.
The prompt then lays out the "PRIORITY RULES" in a clear, hierarchical order:
- Priority 1 Dominance: If the Priority 1 (Facts) section contains a direct answer, that answer must be used exclusively. No supplementation, qualification, or cross-referencing with lower priority data is permitted.
- Priority 2 Interpretation: Priority 2 data may use abbreviations and could appear to contradict Priority 1. It is considered supplementary background only. Team abbreviations in Priority 2 are explicitly non-authoritative if Priority 1 provides a definitive team name.
- Priority 2 Conditional Use: Priority 2 data is only to be used if Priority 1 lacks relevant information for the specific attribute queried.
- Priority 3 Judgment: Information from Priority 3 (Vector Chunks) can be included if it provides additional relevant context, but its inclusion is at the LLM’s discretion, ensuring it doesn’t contradict higher priority data.
- Information Scarcity: If none of the provided sections contain the answer, the LLM must explicitly state, "I do not have enough information." Hallucination or guessing is strictly forbidden.
The output format is also rigidly defined: the response must be the single authoritative answer, without presenting conflicting information or mentioning the data source.

The system prompt structure looks like this:
You are a strict data-retrieval AI. Your ONLY knowledge comes from the text provided below. You must completely ignore your internal training weights.
PRIORITY RULES (strict):
1. If Priority 1 (Facts) contains a direct answer, use ONLY that answer. Do not supplement, qualify, or cross-reference with Priority 2 or Vector data.
2. Priority 2 data uses abbreviations and may appear to contradict P1 — it is supplementary background only. Never treat P2 team abbreviations as authoritative team names if P1 states a team.
3. Only use P2 if P1 has no relevant answer on the specific attribute asked.
4. If Priority 3 (Vector Chunks) provides any additional relevant information, use your judgment as to whether or not to include it in the response.
5. If none of the sections contain the answer, you must explicitly say "I do not have enough information." Do not guess or hallucinate.
Your output **MUST** follow these rules:
- Provide only the single authoritative answer based on the priority rules.
- Do not present multiple conflicting answers.
- Make no mention of the source of this data.
- Phrase this in the form of a sentence or multiple sentences, as is appropriate.
---
[PRIORITY 1 - ABSOLUTE GRAPH FACTS]
formatted_facts
[Priority 2: Background Statistics (team abbreviations here are NOT authoritative — defer to Priority 1 for factual claims)]
formatted_stats
[PRIORITY 3 - VECTOR DOCUMENTS]
retrieved_context
---
This approach transforms the LLM into a rule-following agent rather than a probabilistic text generator. By presenting it with ground truth facts, potentially conflicting less-authoritative data, and semantically similar information, alongside an explicit hierarchy for resolution, the system aims to significantly mitigate factual hallucinations. While not entirely foolproof, it represents a robust strategy for enhancing the reliability of RAG systems.
Step 5: Tying it All Together & Testing
The culmination of this architecture is a unified RAG system that orchestrates the retrieval and generation process. The main execution thread initiates by querying the local LLM instance, typically via a REST API, passing the meticulously constructed system prompt along with the user’s original question.
The system then systematically isolates the three priority tiers of information, processes the extracted entities, and queries the LLM with the structured prompt designed to enforce deterministic output.
Query 1: Factual Retrieval with the QuadStore
When presented with a direct factual question, such as "Who is the star player of the Ottawa Beavers team?", the system relies exclusively on the Priority 1 facts.
The quad store contains the assertion: "Ottawa Beavers" "obtained" "LeBron James" "2020_expansion_draft". Based on the prompt’s rules, the LLM is instructed to use this absolute fact and refrain from supplementing it with information from the vector database or statistical data. This prevents common RAG relationship hallucinations, where an LLM might incorrectly infer connections or attributes. The supporting vector documents, which might mention LeBron James in relation to an "Ottawa NBA team," would reinforce this factual claim without overriding it.
Query 2: Deeper Factual Retrieval
A follow-up query, like "I’m unfamiliar with the Ottawa Beavers. I assume they play out of Ottawa, but where, exactly, in the city are they based?", delves into specific location data. The system consults Priority 1 facts, which include "Ottawa Beavers" "based_in" "downtown Ottawa" "NBA_trivia". This allows the LLM to provide a precise answer, even when confronted with the LLM’s internal knowledge that the "Ottawa Beavers" is not a real NBA team and when the general NBA stats dataset (Priority 2) offers no information. This demonstrates the system’s ability to assert facts from its designated truth sources against potentially conflicting or absent information elsewhere.

Query 3: Dealing with Conflict
When a query involves an attribute present in both the absolute facts graph (Priority 1) and the general stats graph (Priority 2), such as "What was LeBron James’ average MPG in the 2023 NBA season?", the prompt’s hierarchy rules come into play. If the quad store explicitly states "LeBron James" "average_mpg" "12.0" "NBA_2023_regular_season", this value will be prioritized over any potentially conflicting or generalized MPG data in the Priority 2 dataset. This ensures that specific, authoritative data points always take precedence.
Query 4: Stitching Together a Robust Response
More complex, multi-part questions, like "What injury did the Ottawa Beavers star injury suffer during the 2023 season?", require the LLM to synthesize information from multiple tiers. The system first identifies the "star player of the Ottawa Beavers" (LeBron James) using Priority 1. Then, it searches for injury information. This might come from Priority 1 if explicitly stated, or from Priority 3 vector documents, such as "LeBron injured for remainder of NBA 2023 seasonnLeBron James suffered an ankle injury...". The LLM, guided by the prompt, merges this information into a coherent narrative, demonstrating the system’s ability to combine structured and unstructured data effectively.
Query 5: Another Robust Response Example
Consider the query: "How many wins did the team that LeBron James play for have when he left the season?". This query requires identifying the team LeBron played for (Priority 1) and then finding information about the team’s wins during the season. If the Priority 1 data mentions "Ottawa Beavers" "had" "worst first year of any expansion team in NBA history" "NBA_trivia", and Priority 3 documents indicate "The teams' abysmal regular season record may end up being the worst of any team ever, with only 6 wins as of now, with only 4 games left in the regular season.", the LLM can combine these to provide an answer. Critically, this system must also disregard any conflicting data in the Priority 2 stats graph that might incorrectly suggest LeBron played for the LA Lakers in 2023. This showcases the system’s ability to maintain factual accuracy even when dealing with misleading data in lower-priority tiers, all while using a relatively small LLM (e.g., llama3.2:3b).
Conclusion & Trade-offs
The multi-tiered, prompt-enforced RAG system represents a significant advancement in the pursuit of factual accuracy and deterministic output. By segmenting retrieval sources into distinct authoritative layers and employing explicit prompt-engineered rules for conflict resolution, the system aims to drastically reduce factual hallucinations and the ambiguity that can arise from competing, yet equally plausible, pieces of information.
Advantages of this approach include:
- Enhanced Factual Accuracy: Prioritizing structured knowledge graphs for atomic facts significantly reduces the likelihood of incorrect assertions.
- Deterministic Output: Prompt-enforced rules provide a clear, repeatable logic for resolving conflicts, leading to more predictable responses.
- Mitigation of Hallucinations: By explicitly guiding the LLM on data hierarchy and resolution, the system directly combats the tendency of LLMs to generate fabricated information.
- Modular Design: The separation of retrieval tiers allows for flexibility in incorporating or updating different data sources.
- Improved Explainability: The clear hierarchy and prompt rules make it easier to understand why a particular answer was generated.
Trade-offs of this approach include:
- Increased Complexity: Implementing and managing multiple retrieval tiers and a sophisticated prompt structure adds complexity to the RAG pipeline.
- Data Curation Overhead: Maintaining the integrity and accuracy of the Priority 1 and Priority 2 data sources requires diligent curation and validation.
- Prompt Engineering Sensitivity: The effectiveness of the conflict resolution heavily relies on the quality and clarity of the prompt engineering.
- Computational Overhead: Querying multiple data sources simultaneously can increase retrieval latency compared to a single-source query, although parallelization can mitigate this.
- Scalability of Custom Quad Store: While efficient for specific use cases, the custom quad store might face scalability challenges with extremely large and complex knowledge graphs compared to dedicated graph databases.
For environments where high precision, low tolerance for errors, and absolute factual certainty are paramount, such as in critical decision support systems, financial reporting, or legal documentation, deploying a multi-tiered factual hierarchy alongside a vector database can be the crucial differentiator between a promising prototype and a production-ready, reliable AI application. This methodology offers a compelling path toward building more trustworthy and predictable AI systems.




