Constructing a Bhagavad Gita AI Assistant

Within the fast-evolving world of AI, giant language fashions are pushing boundaries in velocity, accuracy, and cost-efficiency. The latest launch of Deepseek R1, an open-source mannequin rivaling OpenAI’s o1, is a sizzling matter within the AI house, particularly given its 27x decrease price and superior reasoning capabilities. Pair this with Qdrant’s binary quantization for environment friendly and fast vector searches, we are able to index over 1,000+ web page paperwork. On this article, we’ll create a Bhagavad Gita AI Assistant, able to indexing 1,000+ pages, answering complicated queries in seconds utilizing Groq, and delivering insights with domain-specific precision.

Studying Aims

  • Implement binary quantization in Qdrant for memory-efficient vector indexing.
  • Perceive how you can construct a Bhagavad Gita AI Assistant utilizing Deepseek R1, Qdrant, and LlamaIndex for environment friendly textual content retrieval.
  • Be taught to optimize Bhagavad Gita AI Assistant with Groq for quick, domain-specific question responses and large-scale doc indexing.
  • Construct a RAG pipeline utilizing LlamaIndex and FastEmbed native embeddings to course of 1,000+ pages of the Bhagavad Gita.
  • Combine Deepseek R1 from Groq’s inferencing for real-time, low-latency responses.
  • Develop a Streamlit UI to showcase AI-powered insights with considering transparency.

This text was revealed as part of the Information Science Blogathon.

Deepseek R1 vs OpenAI o1

Deepseek R1 challenges OpenAI’s dominance with 27x decrease API prices and near-par efficiency on reasoning benchmarks. In contrast to OpenAI’s o1 closed, subscription-based mannequin ($200/month), Deepseek R1 is free, open-source, and excellent for budget-conscious tasks and experimentation. 

Reasoning- ARC-AGI Benchmark: [Source: ARC-AGI Deepseek

  • Deepseek: 20.5% accuracy (public), 15.8% (semi-private).
  • OpenAI: 21% accuracy (public), 18% (semi-private).

From my expertise to date, Deepseek does an amazing job with math reasoning, coding-related use circumstances, and context-aware prompts. Nevertheless, OpenAI retains an edge in common information breadth, making it preferable for fact-diverse purposes.

What’s Binary Quantization in Vector Databases?

Binary quantization (BQ) is Qdrant’s indexing compression method to optimize high-dimensional vector storage and retrieval. By changing 32-bit floating-point vectors into 1-bit binary values, it slashes reminiscence utilization by 40x and accelerates search speeds dramatically.

How It Works

  • Binarization: Vectors are simplified to 0s and 1s primarily based on a threshold (e.g., values >0 develop into 1).
  • Environment friendly Indexing: Qdrant’s HNSW algorithm makes use of these binary vectors for speedy approximate nearest neighbor (ANN) searches.
  • Oversampling: To stability velocity and accuracy, BQ retrieves further candidates (e.g., 200 for a restrict of 100) and re-ranks them utilizing unique vectors.

Why It Issues

  • Storage: A 1536-dimension OpenAI vector shrinks from 6KB to 0.1875 KB.
  • Pace: Boolean operations on 1-bit vectors execute sooner, decreasing latency.
  • Scalability: Supreme for giant datasets (1M+ vectors) with minimal recall tradeoffs.

Keep away from binary quantization for low-dimension vectors (<1024), the place data loss considerably impacts accuracy. Conventional scalar quantization (e.g., uint8) could swimsuit smaller embeddings higher.

Constructing the Bhagavad Gita Assistant

Beneath is the circulate chart that explains on how we are able to construct Bhagwad Gita Assistant:

Building the Bhagavad Gita Assistant

Structure Overview

  • Information Ingestion: 900-page Bhagavad Gita PDF cut up into textual content chunks.
  • Embedding: Qdrant FastEmbed’s text-to-vector embedding mannequin.
  • Vector DB: Qdrant with BQ shops embeddings, enabling millisecond searches.
  • LLM Inference: Deepseek R1 through Groq LPUs generates context-aware responses.
  • UI: Streamlit app with expandable “considering course of” visibility.

Step-by-Step Implementation

Allow us to now observe the steps on by one:

Step1: Set up and Preliminary Setup

Let’s arrange the inspiration of our RAG pipeline utilizing LlamaIndex. We have to set up important packages together with the core LlamaIndex library, Qdrant vector retailer integration, FastEmbed for embeddings, and Groq for LLM entry.

Word:

  • For doc indexing, we are going to use a GPU from Colab to retailer the info. It is a one-time course of.
  • As soon as the info is saved, we are able to use the gathering identify to run inferences wherever, whether or not on VS Code, Streamlit, or different platforms.
!pip set up llama-index
!pip set up llama-index-vector-stores-qdrant llama-index-embeddings-fastembed
!pip set up llama-index-readers-file
!pip set up llama-index-llms-groq  

As soon as the set up is completed, let’s import the required modules. 

import logging
import sys
import os

import qdrant_client
from qdrant_client import fashions

from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq # deep search r1 implementation

Step2: Doc Processing and Embedding 

Right here, we deal with the essential job of changing uncooked textual content into vector representations. The SimpleDirectoryReader hundreds paperwork from a specified folder.

Create a folder, i.e., a knowledge listing, and add all of your paperwork inside it. In our case, we downloaded the Bhagavad Gita doc and saved it within the knowledge folder.

You may obtain the ~900-page Bhagavad Gita doc right here: iskconmangaluru

knowledge = SimpleDirectoryReader("knowledge").load_data()
texts = [doc.text for doc in data]

embeddings = []
BATCH_SIZE = 50

Qdrant’s FastEmbed is a light-weight, quick Python library designed for environment friendly embedding era. It helps in style textual content fashions and makes use of quantized mannequin weights together with the ONNX Runtime for inference, making certain excessive efficiency with out heavy dependencies.

To transform the textual content chunks into embeddings, we are going to use Qdrant’s FastEmbed. We course of these in batches of fifty paperwork to handle reminiscence effectively. 

embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")


for web page in vary(0, len(texts), BATCH_SIZE):
    page_content = texts[page:page + BATCH_SIZE]
    response = embed_model.get_text_embedding_batch(page_content)
    embeddings.lengthen(response)

Step3: Qdrant Setup with Binary Quantization

Time to configure Qdrant shopper, our vector database, with optimized settings for efficiency. We create a group named “bhagavad-gita” with particular vector parameters and allow binary quantization for environment friendly storage and retrieval.

There are 3 ways to make use of the Qdrant shopper: 

  • In-Reminiscence Mode: Utilizing location=”:reminiscence:”, which creates a brief occasion that runs solely as soon as. 
  • Localhost: Utilizing location=”localhost”, which requires working a Docker occasion. You may observe the setup information right here: Qdrant Quickstart
  • Cloud Storage: Storing collections within the cloud. To do that, create a brand new cluster, present a cluster identify, and generate an API key. Copy the important thing and retrieve the URL from the curl command.

Word the gathering identify must be distinctive, after each knowledge change this must be modified as properly. 

collection_name = "bhagavad-gita"

shopper = qdrant_client.QdrantClient(
    #location=":reminiscence:",
    url = "QDRANT_URL", # exchange QDRANT_URL along with your endpoint
    api_key = "QDRANT_API_KEY", # exchange QDRANT_API_KEY along with your API keys
    prefer_grpc=True
)

We first test if a group with the desired collection_name exists in Qdrant. If it doesn’t, solely then we create a brand new assortment configured to retailer 1,024-dimensional vectors and use cosine similarity for distance measurement.

We allow on-disk storage for the unique vectors and apply binary quantization, which compresses the vectors to scale back reminiscence utilization and improve search velocity. The always_ram parameter ensures that the quantized vectors are stored in RAM for sooner entry.

if not shopper.collection_exists(collection_name=collection_name):
    shopper.create_collection(
        collection_name=collection_name,
        vectors_config=fashions.VectorParams(measurement=1024,
                                           distance=fashions.Distance.COSINE,
                                           on_disk=True),
        quantization_config=fashions.BinaryQuantization(
            binary=fashions.BinaryQuantizationConfig(
                always_ram=True,
            ),
        ),
    )
else:
    print("Assortment already exists")

Step4: Index the doc

The indexing course of uploads our processed paperwork and their embeddings to Qdrant in batches. Every doc is saved alongside its vector illustration, making a searchable information base.

The GPU shall be used at this stage, and relying on the info measurement, this step could take a couple of minutes.

for idx in vary(0, len(texts), BATCH_SIZE):
    docs = texts[idx:idx + BATCH_SIZE]
    embeds = embeddings[idx:idx + BATCH_SIZE]

    shopper.upload_collection(collection_name=collection_name,
                                vectors=embeds,
                                payload=[{"context": context} for context in docs])

shopper.update_collection(collection_name= collection_name,
                        optimizer_config=fashions.OptimizersConfigDiff(indexing_threshold=20000)) 

Step5: RAG Pipeline with Deepseek R1

Course of-1: R- Retrieve related doc

The search operate takes a consumer question, converts it to an embedding, and retrieves essentially the most related paperwork from Qdrant primarily based on cosine similarity. We display this with a pattern question in regards to the Bhagavad-gītā, exhibiting how you can entry and print the retrieved context.

def search(question,okay=5):
  # question = consumer immediate
  query_embedding = embed_model.get_query_embedding(question)
  outcome = shopper.query_points(
            collection_name = collection_name,
            question=query_embedding,
            restrict = okay
        )
  return outcome
  
relevant_docs = search("In Bhagavad-gītā who's the individual dedicated to?")

print(relevant_docs.factors[4].payload['context'])

Course of-2: A- Augmenting immediate

For RAG it’s vital to outline the system’s interplay template utilizing ChatPromptTemplate. The template creates a specialised assistant educated in Bhagavad-gita, able to understanding a number of languages (English, Hindi, Sanskrit).

It consists of structured formatting for context injection and question dealing with, with clear directions for dealing with out-of-context questions.

from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole

message_templates = [
    ChatMessage(
        content="""
        You are an expert ancient assistant who is well versed in Bhagavad-gita.
        You are Multilingual, you understand English, Hindi and Sanskrit.
        
        Always structure your response in this format:
        <think>
        [Your step-by-step thinking process here]
        </assume>
        
        [Your final answer here]
        """,
        position=MessageRole.SYSTEM),
    ChatMessage(
        content material="""
        We have now offered context data under.
        {context_str}
        ---------------------
        Given this data, please reply the query: {question}
        ---------------------
        If the query is just not from the offered context, say `I do not know. Not sufficient data acquired.`
        """,
        position=MessageRole.USER,
    ),
]

Course of-3: G- Producing the response

The ultimate pipeline brings every thing collectively in a cohesive RAG system. It follows the Retrieve-Increase-Generate sample: retrieving related paperwork, augmenting them with our specialised immediate template, and producing responses utilizing the LLM. Right here for LLM we are going to use Deepseek R-1 distill Llama 70 B hosted on Groq, get your keys from right here: Groq Console.

os.environ['GROQ_API_KEY'] = "GROQ_API_KEY" # exchange along with your keys
llm = Groq(mannequin="deepseek-r1-distill-llama-70b")


def pipeline(question):
    # R - Retriver
    relevant_documents = search(question)
    context = [doc.payload['context'] for doc in relevant_documents.factors]
    context = "n".be part of(context)

    # A - Increase
    chat_template = ChatPromptTemplate(message_templates=message_templates)

    # G - Generate
    response = llm.full(
        chat_template.format(
            context_str=context,
            question=question)
    )
    return response
    
    
print(pipeline("""what's the PURPORT of O my instructor, behold the good	military of	 the sons of Pāṇḍu, so
expertly organized by your clever disciple, the son of Drupada."""))

Output: (Syntax: <assume> reasoning </assume> response)

output
print(pipeline("""
Jayas	tu	pāṇḍu-putrāṇāṁ	yeṣāṁ	pakṣe	janārdanaḥ.
clarify this gita from translation
"""))
output

Now what if it is advisable to use this software once more? Are we purported to endure all of the steps once more?

The reply isn’t any. 

Step6: Saved Index Inference 

There’s not a lot distinction in what you’ve gotten already written. We’ll reuse the identical search and pipeline operate together with the gathering identify that we have to run the query_points. 

shopper = qdrant_client.QdrantClient(
        url= "QDRANT_URL",
        api_key = "QDRANT_API_KEY",
        prefer_grpc = True
    )
    

# the search and pipeline code stay the identical. 

def search(question, shopper, embed_model, okay=5):
    collection_name = "bhagavad-gita"
    query_embedding = embed_model.get_query_embedding(question)
    outcome = shopper.query_points(
        collection_name=collection_name,
        question=query_embedding,
        restrict=okay
    )
    return outcome

def pipeline(question, embed_model, llm, shopper):
    # R - Retriever
    relevant_documents = search(question, shopper, embed_model)
    context = [doc.payload['context'] for doc in relevant_documents.factors]
    context = "n".be part of(context)

    # A - Increase
    chat_template = ChatPromptTemplate(message_templates=message_templates)

    # G - Generate
    response = llm.full(
        chat_template.format(
            context_str=context,
            question=question)
    )
    return response

We’ll use the identical above two capabilities and message_template within the Streamlit app.py. 

Step7: Streamlit UI

In Streamlit after each consumer query, the state is refreshed. To keep away from refreshing the whole web page once more, we are going to outline just a few initialization steps beneath Streamlit cache_resource. 

Keep in mind when the consumer enters the query, the FastEmbed will obtain the mannequin weights simply as soon as, the identical goes for Groq and Qdrant instantiation. 

import streamlit as st
from time import sleep
import qdrant_client
from qdrant_client import fashions
from llama_index.core import ChatPromptTemplate
from llama_index.core.llms import ChatMessage, MessageRole
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.llms.groq import Groq
from dotenv import load_dotenv
import os

load_dotenv()

@st.cache_resource
def initialize_models():
    embed_model = FastEmbedEmbedding(model_name="thenlper/gte-large")
    llm = Groq(mannequin="deepseek-r1-distill-llama-70b")
    shopper = qdrant_client.QdrantClient(
        url=os.getenv("QDRANT_URL"),
        api_key=os.getenv("QDRANT_API_KEY"),
        prefer_grpc=True
    )
    return embed_model, llm, shopper
    
st.title("🕉️ Bhagavad Gita Assistant")
# this may run solely as soon as, and be saved contained in the cache
embed_model, llm, shopper = initialize_models() 

In the event you observed the response output, the format is <assume> reasoning </assume> response. 

On the UI, I wish to hold the reasoning beneath the Streamlit expander, to retrieve the reasoning half, let’s use string indexing to extract the reasoning and the precise response. 

def extract_thinking_and_answer(response_text):
    """Extract considering course of and remaining reply from response"""
    strive:
        considering = response_text[response_text.find("<think>") + 7:response_text.find("</think>")].strip()
        reply = response_text[response_text.find("</think>") + 8:].strip()
        return considering, reply
    besides:
        return "", response_text

Chatbot Element

Initializes a messages historical past in Streamlit’s session state. A “Clear Chat” button within the sidebar permits customers to reset this historical past. 

Iterates by saved messages and shows them in a chat-like interface. For assistant responses, it separates the considering course of (proven in an expandable part) from the precise reply utilizing the extract_thinking_and_answer operate.

The remaining piece of code is a typical format to outline the chatbot part in Streamlit i.e., enter dealing with that creates an enter area for consumer questions. When a query is submitted, it’s displayed and added to the message historical past. Now it processes the consumer’s query by the RAG pipeline whereas exhibiting a loading spinner. The response is cut up into considering course of and reply elements.

def predominant():
    if "messages" not in st.session_state:
        st.session_state.messages = []

    with st.sidebar:
        if st.button("Clear Chat"):
            st.session_state.messages = []
            st.rerun()

    # Show chat messages
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            if message["role"] == "assistant":
                considering, reply = extract_thinking_and_answer(message["content"])
                with st.expander("Present considering course of"):
                    st.markdown(considering)
                st.markdown(reply)
            else:
                st.markdown(message["content"])

    # Chat enter
    if immediate := st.chat_input("Ask your query in regards to the Bhagavad Gita..."):
        # Show consumer message
        st.chat_message("consumer").markdown(immediate)
        st.session_state.messages.append({"position": "consumer", "content material": immediate})

        # Generate and show response
        with st.chat_message("assistant"):
            message_placeholder = st.empty()
            with st.spinner("Pondering..."):
                full_response = pipeline(immediate, embed_model, llm, shopper)
                considering, reply = extract_thinking_and_answer(full_response.textual content)
                
                with st.expander("Present considering course of"):
                    st.markdown(considering)
                
                response = ""
                for chunk in reply.cut up():
                    response += chunk + " "
                    message_placeholder.markdown(response + "▌")
                    sleep(0.05)
                
                message_placeholder.markdown(reply)
                
        # Add assistant response to historical past
        st.session_state.messages.append({"position": "assistant", "content material": full_response.textual content})

if __name__ == "__main__":
    predominant()
  • You will discover the total code
  • Various Bhagavad Gita PDF- Obtain
  • Exchange the “<replace-api-key>” placeholder along with your keys.

Conclusion

By combining Deepseek R1’s reasoning, Qdrant’s binary quantization, and LlamaIndex’s RAG pipeline, we’ve constructed an AI assistant that delivers sub-2-second responses on 1,000+ pages. This venture underscores how domain-specific LLMs and optimized vector databases can democratize entry to historical texts whereas sustaining price effectivity. As open-source fashions proceed to evolve, the probabilities for area of interest AI purposes are limitless.

Key Takeaways

  • Deepseek R1 rivals OpenAI o1 in reasoning at 1/twenty seventh the price, excellent for domain-specific duties like scripture evaluation, whereas OpenAI fits broader information wants.
  • Understanding RAG Pipeline Implementation with demonstrated code examples for doc processing, embedding era, and vector storage utilizing LlamaIndex and Qdrant.
  • Environment friendly Vector Storage optimization by Binary Quantization in Qdrant, enabling processing of enormous doc collections whereas sustaining efficiency and accuracy.
  • Structured Immediate Engineering implementation with clear templates for dealing with multilingual queries (English, Hindi, Sanskrit) and managing out-of-context questions successfully.
  • Interactive UI utilizing Streamlit, to inference the appliance as soon as saved within the vector database.

Regularly Requested Questions

Q1. Does binary quantization scale back reply high quality?

A. Minimal influence on recall! Qdrant’s oversampling re-ranks prime candidates utilizing unique vectors, sustaining accuracy whereas boosting velocity 40x and slashing reminiscence utilization by 97%.

Q2. Can the FastEmbed deal with non-English texts like Sanskrit/Hindi?

A. Sure! The RAG pipeline makes use of FastEmbed’s embeddings and Deepseek R1’s language flexibility. Customized prompts information responses in English, Hindi, or Sanskrit. Whereas you should utilize the embedding mannequin that may perceive Hindi tokens, in our case the token used perceive English and Hindi textual content. 

Q3. Why select Deepseek R1 over OpenAI o1?

A. Deepseek R1 provides 27x decrease API prices, comparable reasoning accuracy (20.5% vs o1’s 21%), and superior coding/domain-specific efficiency. It’s excellent for specialised duties like scripture evaluation the place price and centered experience matter.

The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.

Information Scientist at AI Planet || YouTube- AIWithTarun || Google Developer Professional in ML || Gained 5 AI hackathons || Co-organizer of TensorFlow Consumer Group Bangalore || Pie & AI Ambassador at DeepLearningAI