Within the fast-evolving world of AI, giant language fashions are pushing boundaries in velocity, accuracy, and cost-efficiency. The latest launch of Deepseek R1, an open-source mannequin rivaling OpenAI’s o1, is a sizzling matter within the AI house, particularly given its 27x decrease price and superior reasoning capabilities. Pair this with Qdrant’s binary quantization for environment friendly and fast vector searches, we are able to index over 1,000+ web page paperwork. On this article, we’ll create a Bhagavad Gita AI Assistant, able to indexing 1,000+ pages, answering complicated queries in seconds utilizing Groq, and delivering insights with domain-specific precision.
Studying Aims
- Implement binary quantization in Qdrant for memory-efficient vector indexing.
- Perceive how you can construct a Bhagavad Gita AI Assistant utilizing Deepseek R1, Qdrant, and LlamaIndex for environment friendly textual content retrieval.
- Be taught to optimize Bhagavad Gita AI Assistant with Groq for quick, domain-specific question responses and large-scale doc indexing.
- Construct a RAG pipeline utilizing LlamaIndex and FastEmbed native embeddings to course of 1,000+ pages of the Bhagavad Gita.
- Combine Deepseek R1 from Groq’s inferencing for real-time, low-latency responses.
- Develop a Streamlit UI to showcase AI-powered insights with considering transparency.
This text was revealed as part of the Information Science Blogathon.
Deepseek R1 vs OpenAI o1
Deepseek R1 challenges OpenAI’s dominance with 27x decrease API prices and near-par efficiency on reasoning benchmarks. In contrast to OpenAI’s o1 closed, subscription-based mannequin ($200/month), Deepseek R1 is free, open-source, and excellent for budget-conscious tasks and experimentation.
Reasoning- ARC-AGI Benchmark: [Source: ARC-AGI Deepseek]
- Deepseek: 20.5% accuracy (public), 15.8% (semi-private).
- OpenAI: 21% accuracy (public), 18% (semi-private).
From my expertise to date, Deepseek does an amazing job with math reasoning, coding-related use circumstances, and context-aware prompts. Nevertheless, OpenAI retains an edge in common information breadth, making it preferable for fact-diverse purposes.
What’s Binary Quantization in Vector Databases?
Binary quantization (BQ) is Qdrant’s indexing compression method to optimize high-dimensional vector storage and retrieval. By changing 32-bit floating-point vectors into 1-bit binary values, it slashes reminiscence utilization by 40x and accelerates search speeds dramatically.
How It Works
- Binarization: Vectors are simplified to 0s and 1s primarily based on a threshold (e.g., values >0 develop into 1).
- Environment friendly Indexing: Qdrant’s HNSW algorithm makes use of these binary vectors for speedy approximate nearest neighbor (ANN) searches.
- Oversampling: To stability velocity and accuracy, BQ retrieves further candidates (e.g., 200 for a restrict of 100) and re-ranks them utilizing unique vectors.
Why It Issues
- Storage: A 1536-dimension OpenAI vector shrinks from 6KB to 0.1875 KB.
- Pace: Boolean operations on 1-bit vectors execute sooner, decreasing latency.
- Scalability: Supreme for giant datasets (1M+ vectors) with minimal recall tradeoffs.
Keep away from binary quantization for low-dimension vectors (<1024), the place data loss considerably impacts accuracy. Conventional scalar quantization (e.g., uint8) could swimsuit smaller embeddings higher.
Constructing the Bhagavad Gita Assistant
Beneath is the circulate chart that explains on how we are able to construct Bhagwad Gita Assistant:
![Building the Bhagavad Gita Assistant](https://cdn.analyticsvidhya.com/wp-content/uploads/2025/02/Building-the-Bhagavad-Gita-Assistant.webp)
Structure Overview
- Information Ingestion: 900-page Bhagavad Gita PDF cut up into textual content chunks.
- Embedding: Qdrant FastEmbed’s text-to-vector embedding mannequin.
- Vector DB: Qdrant with BQ shops embeddings, enabling millisecond searches.
- LLM Inference: Deepseek R1 through Groq LPUs generates context-aware responses.
- UI: Streamlit app with expandable “considering course of” visibility.
Step-by-Step Implementation
Allow us to now observe the steps on by one:
Step1: Set up and Preliminary Setup
Let’s arrange the inspiration of our RAG pipeline utilizing LlamaIndex. We have to set up important packages together with the core LlamaIndex library, Qdrant vector retailer integration, FastEmbed for embeddings, and Groq for LLM entry.
Word:
- For doc indexing, we are going to use a GPU from Colab to retailer the info. It is a one-time course of.
- As soon as the info is saved, we are able to use the gathering identify to run inferences wherever, whether or not on VS Code, Streamlit, or different platforms.
!pip set up llama-index
!pip set up llama-index-vector-stores-qdrant llama-index-embeddings-fastembed
!pip set up llama-index-readers-file
!pip set up llama-index-llms-groq
As soon as the set up is completed, let’s import the required modules.
import logging
import sys
import os
import qdrant_client
from qdrant_client import fashions
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq # deep search r1 implementation
Step2: Doc Processing and Embedding
Right here, we deal with the essential job of changing uncooked textual content into vector representations. The SimpleDirectoryReader hundreds paperwork from a specified folder.
Create a folder, i.e., a knowledge listing, and add all of your paperwork inside it. In our case, we downloaded the Bhagavad Gita doc and saved it within the knowledge folder.
You may obtain the ~900-page Bhagavad Gita doc right here: iskconmangaluru
knowledge = SimpleDirectoryReader("knowledge").load_data()
texts = [doc.text for doc in data]
embeddings = []
BATCH_SIZE = 50
Qdrant’s FastEmbed is a light-weight, quick Python library designed for environment friendly embedding era. It helps in style textual content fashions and makes use of quantized mannequin weights together with the ONNX Runtime for inference, making certain excessive efficiency with out heavy dependencies.
To transform the textual content chunks into embeddings, we are going to use Qdrant’s FastEmbed. We course of these in batches of fifty paperwork to handle reminiscence effectively.
embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")
for web page in vary(0, len(texts), BATCH_SIZE):
page_content = texts[page:page + BATCH_SIZE]
response = embed_model.get_text_embedding_batch(page_content)
embeddings.lengthen(response)
Step3: Qdrant Setup with Binary Quantization
Time to configure Qdrant shopper, our vector database, with optimized settings for efficiency. We create a group named “bhagavad-gita” with particular vector parameters and allow binary quantization for environment friendly storage and retrieval.
There are 3 ways to make use of the Qdrant shopper:
- In-Reminiscence Mode: Utilizing location=”:reminiscence:”, which creates a brief occasion that runs solely as soon as.
- Localhost: Utilizing location=”localhost”, which requires working a Docker occasion. You may observe the setup information right here: Qdrant Quickstart.
- Cloud Storage: Storing collections within the cloud. To do that, create a brand new cluster, present a cluster identify, and generate an API key. Copy the important thing and retrieve the URL from the curl command.
Word the gathering identify must be distinctive, after each knowledge change this must be modified as properly.
collection_name = "bhagavad-gita"
shopper = qdrant_client.QdrantClient(
#location=":reminiscence:",
url = "QDRANT_URL", # exchange QDRANT_URL along with your endpoint
api_key = "QDRANT_API_KEY", # exchange QDRANT_API_KEY along with your API keys
prefer_grpc=True
)
We first test if a group with the desired collection_name exists in Qdrant. If it doesn’t, solely then we create a brand new assortment configured to retailer 1,024-dimensional vectors and use cosine similarity for distance measurement.
We allow on-disk storage for the unique vectors and apply binary quantization, which compresses the vectors to scale back reminiscence utilization and improve search velocity. The always_ram parameter ensures that the quantized vectors are stored in RAM for sooner entry.
if not shopper.collection_exists(collection_name=collection_name):
shopper.create_collection(
collection_name=collection_name,
vectors_config=fashions.VectorParams(measurement=1024,
distance=fashions.Distance.COSINE,
on_disk=True),
quantization_config=fashions.BinaryQuantization(
binary=fashions.BinaryQuantizationConfig(
always_ram=True,
),
),
)
else:
print("Assortment already exists")
Step4: Index the doc
The indexing course of uploads our processed paperwork and their embeddings to Qdrant in batches. Every doc is saved alongside its vector illustration, making a searchable information base.
The GPU shall be used at this stage, and relying on the info measurement, this step could take a couple of minutes.
for idx in vary(0, len(texts), BATCH_SIZE):
docs = texts[idx:idx + BATCH_SIZE]
embeds = embeddings[idx:idx + BATCH_SIZE]
shopper.upload_collection(collection_name=collection_name,
vectors=embeds,
payload=[{"context": context} for context in docs])
shopper.update_collection(collection_name= collection_name,
optimizer_config=fashions.OptimizersConfigDiff(indexing_threshold=20000))
Step5: RAG Pipeline with Deepseek R1
Course of-1: R- Retrieve related doc
The search operate takes a consumer question, converts it to an embedding, and retrieves essentially the most related paperwork from Qdrant primarily based on cosine similarity. We display this with a pattern question in regards to the Bhagavad-gītā, exhibiting how you can entry and print the retrieved context.
def search(question,okay=5):
# question = consumer immediate
query_embedding = embed_model.get_query_embedding(question)
outcome = shopper.query_points(
collection_name = collection_name,
question=query_embedding,
restrict = okay
)
return outcome
relevant_docs = search("In Bhagavad-gītā who's the individual dedicated to?")
print(relevant_docs.factors[4].payload['context'])
Course of-2: A- Augmenting immediate
For RAG it’s vital to outline the system’s interplay template utilizing ChatPromptTemplate. The template creates a specialised assistant educated in Bhagavad-gita, able to understanding a number of languages (English, Hindi, Sanskrit).
It consists of structured formatting for context injection and question dealing with, with clear directions for dealing with out-of-context questions.
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
message_templates = [
ChatMessage(
content="""
You are an expert ancient assistant who is well versed in Bhagavad-gita.
You are Multilingual, you understand English, Hindi and Sanskrit.
Always structure your response in this format:
<think>
[Your step-by-step thinking process here]
</assume>
[Your final answer here]
""",
position=MessageRole.SYSTEM),
ChatMessage(
content material="""
We have now offered context data under.
{context_str}
---------------------
Given this data, please reply the query: {question}
---------------------
If the query is just not from the offered context, say `I do not know. Not sufficient data acquired.`
""",
position=MessageRole.USER,
),
]
Course of-3: G- Producing the response
The ultimate pipeline brings every thing collectively in a cohesive RAG system. It follows the Retrieve-Increase-Generate sample: retrieving related paperwork, augmenting them with our specialised immediate template, and producing responses utilizing the LLM. Right here for LLM we are going to use Deepseek R-1 distill Llama 70 B hosted on Groq, get your keys from right here: Groq Console.
os.environ['GROQ_API_KEY'] = "GROQ_API_KEY" # exchange along with your keys
llm = Groq(mannequin="deepseek-r1-distill-llama-70b")
def pipeline(question):
# R - Retriver
relevant_documents = search(question)
context = [doc.payload['context'] for doc in relevant_documents.factors]
context = "n".be part of(context)
# A - Increase
chat_template = ChatPromptTemplate(message_templates=message_templates)
# G - Generate
response = llm.full(
chat_template.format(
context_str=context,
question=question)
)
return response
print(pipeline("""what's the PURPORT of O my instructor, behold the good military of the sons of Pāṇḍu, so
expertly organized by your clever disciple, the son of Drupada."""))
Output: (Syntax: <assume> reasoning </assume> response)
![output](https://cdn.analyticsvidhya.com/wp-content/uploads/2025/02/outputtt.webp)
print(pipeline("""
Jayas tu pāṇḍu-putrāṇāṁ yeṣāṁ pakṣe janārdanaḥ.
clarify this gita from translation
"""))
![output](https://cdn.analyticsvidhya.com/wp-content/uploads/2025/02/Screenshot_2025-02-08_at_15.23.10.webp)
Now what if it is advisable to use this software once more? Are we purported to endure all of the steps once more?
The reply isn’t any.
Step6: Saved Index Inference
There’s not a lot distinction in what you’ve gotten already written. We’ll reuse the identical search and pipeline operate together with the gathering identify that we have to run the query_points.
shopper = qdrant_client.QdrantClient(
url= "QDRANT_URL",
api_key = "QDRANT_API_KEY",
prefer_grpc = True
)
# the search and pipeline code stay the identical.
def search(question, shopper, embed_model, okay=5):
collection_name = "bhagavad-gita"
query_embedding = embed_model.get_query_embedding(question)
outcome = shopper.query_points(
collection_name=collection_name,
question=query_embedding,
restrict=okay
)
return outcome
def pipeline(question, embed_model, llm, shopper):
# R - Retriever
relevant_documents = search(question, shopper, embed_model)
context = [doc.payload['context'] for doc in relevant_documents.factors]
context = "n".be part of(context)
# A - Increase
chat_template = ChatPromptTemplate(message_templates=message_templates)
# G - Generate
response = llm.full(
chat_template.format(
context_str=context,
question=question)
)
return response
We’ll use the identical above two capabilities and message_template within the Streamlit app.py.
Step7: Streamlit UI
In Streamlit after each consumer query, the state is refreshed. To keep away from refreshing the whole web page once more, we are going to outline just a few initialization steps beneath Streamlit cache_resource.
Keep in mind when the consumer enters the query, the FastEmbed will obtain the mannequin weights simply as soon as, the identical goes for Groq and Qdrant instantiation.
import streamlit as st
from time import sleep
import qdrant_client
from qdrant_client import fashions
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq
from dotenv import load_dotenv
import os
load_dotenv()
@st.cache_resource
def initialize_models():
embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")
llm = Groq(mannequin="deepseek-r1-distill-llama-70b")
shopper = qdrant_client.QdrantClient(
url=os.getenv("QDRANT_URL"),
api_key=os.getenv("QDRANT_API_KEY"),
prefer_grpc=True
)
return embed_model, llm, shopper
st.title("🕉️ Bhagavad Gita Assistant")
# this may run solely as soon as, and be saved contained in the cache
embed_model, llm, shopper = initialize_models()
In the event you observed the response output, the format is <assume> reasoning </assume> response.
On the UI, I wish to hold the reasoning beneath the Streamlit expander, to retrieve the reasoning half, let’s use string indexing to extract the reasoning and the precise response.
def extract_thinking_and_answer(response_text):
"""Extract considering course of and remaining reply from response"""
strive:
considering = response_text[response_text.find("<think>") + 7:response_text.find("</think>")].strip()
reply = response_text[response_text.find("</think>") + 8:].strip()
return considering, reply
besides:
return "", response_text
Chatbot Element
Initializes a messages historical past in Streamlit’s session state. A “Clear Chat” button within the sidebar permits customers to reset this historical past.
Iterates by saved messages and shows them in a chat-like interface. For assistant responses, it separates the considering course of (proven in an expandable part) from the precise reply utilizing the extract_thinking_and_answer operate.
The remaining piece of code is a typical format to outline the chatbot part in Streamlit i.e., enter dealing with that creates an enter area for consumer questions. When a query is submitted, it’s displayed and added to the message historical past. Now it processes the consumer’s query by the RAG pipeline whereas exhibiting a loading spinner. The response is cut up into considering course of and reply elements.
def predominant():
if "messages" not in st.session_state:
st.session_state.messages = []
with st.sidebar:
if st.button("Clear Chat"):
st.session_state.messages = []
st.rerun()
# Show chat messages
for message in st.session_state.messages:
with st.chat_message(message["role"]):
if message["role"] == "assistant":
considering, reply = extract_thinking_and_answer(message["content"])
with st.expander("Present considering course of"):
st.markdown(considering)
st.markdown(reply)
else:
st.markdown(message["content"])
# Chat enter
if immediate := st.chat_input("Ask your query in regards to the Bhagavad Gita..."):
# Show consumer message
st.chat_message("consumer").markdown(immediate)
st.session_state.messages.append({"position": "consumer", "content material": immediate})
# Generate and show response
with st.chat_message("assistant"):
message_placeholder = st.empty()
with st.spinner("Pondering..."):
full_response = pipeline(immediate, embed_model, llm, shopper)
considering, reply = extract_thinking_and_answer(full_response.textual content)
with st.expander("Present considering course of"):
st.markdown(considering)
response = ""
for chunk in reply.cut up():
response += chunk + " "
message_placeholder.markdown(response + "▌")
sleep(0.05)
message_placeholder.markdown(reply)
# Add assistant response to historical past
st.session_state.messages.append({"position": "assistant", "content material": full_response.textual content})
if __name__ == "__main__":
predominant()
Essential Hyperlinks
- You will discover the total code
- Various Bhagavad Gita PDF- Obtain
- Exchange the “<replace-api-key>” placeholder along with your keys.
Conclusion
By combining Deepseek R1’s reasoning, Qdrant’s binary quantization, and LlamaIndex’s RAG pipeline, we’ve constructed an AI assistant that delivers sub-2-second responses on 1,000+ pages. This venture underscores how domain-specific LLMs and optimized vector databases can democratize entry to historical texts whereas sustaining price effectivity. As open-source fashions proceed to evolve, the probabilities for area of interest AI purposes are limitless.
Key Takeaways
- Deepseek R1 rivals OpenAI o1 in reasoning at 1/twenty seventh the price, excellent for domain-specific duties like scripture evaluation, whereas OpenAI fits broader information wants.
- Understanding RAG Pipeline Implementation with demonstrated code examples for doc processing, embedding era, and vector storage utilizing LlamaIndex and Qdrant.
- Environment friendly Vector Storage optimization by Binary Quantization in Qdrant, enabling processing of enormous doc collections whereas sustaining efficiency and accuracy.
- Structured Immediate Engineering implementation with clear templates for dealing with multilingual queries (English, Hindi, Sanskrit) and managing out-of-context questions successfully.
- Interactive UI utilizing Streamlit, to inference the appliance as soon as saved within the vector database.
Regularly Requested Questions
A. Minimal influence on recall! Qdrant’s oversampling re-ranks prime candidates utilizing unique vectors, sustaining accuracy whereas boosting velocity 40x and slashing reminiscence utilization by 97%.
A. Sure! The RAG pipeline makes use of FastEmbed’s embeddings and Deepseek R1’s language flexibility. Customized prompts information responses in English, Hindi, or Sanskrit. Whereas you should utilize the embedding mannequin that may perceive Hindi tokens, in our case the token used perceive English and Hindi textual content.
A. Deepseek R1 provides 27x decrease API prices, comparable reasoning accuracy (20.5% vs o1’s 21%), and superior coding/domain-specific efficiency. It’s excellent for specialised duties like scripture evaluation the place price and centered experience matter.
The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.