Construct an Audio RAG with AssemblyAI, Qdrant & DeepSeek-R1

Bored with manually sifting by means of hours of audio to seek out key insights? This information teaches you to construct an AI-powered chatbot that transforms recordings – conferences, podcasts, interviews—into interactive conversations. Utilizing AssemblyAI for exact transcription with speaker labels, Qdrant for quick knowledge storage, and DeepSeek-R1 through SambaNova Cloud for sensible responses, you’ll create a RAG device that solutions questions like “What did [Speaker] say?” or “Summarize this section.” Let’s flip your audio right into a searchable, AI-driven dialogue by constructing a RAG system with AssemblyAI, Qdrant, and DeepSeek-R1.

Studying Goals

  • Leverage AssemblyAI API to transcribe audio information with speaker diarization, changing conversations into structured textual content knowledge for evaluation.
  • Deploy Qdrant Vector Database to retailer and effectively retrieve embeddings of transcribed audio content material utilizing HuggingFace fashions.
  • Implement RAG with DeepSeek R1 mannequin through SambaNova Cloud to generate context-aware chatbot responses.
  • Construct a Streamlit Net Interface for customers to add audio information, visualize transcripts, and work together with the chatbot in actual time.
  • Combine Finish-to-Finish Workflow combining audio processing, vector storage, and AI-driven response technology to create a scalable audio-based chat utility.

This text was printed as part of the Knowledge Science Blogathon.

What’s AssemblyAI?

AssemblyAI is your go-to device for turning audio into actionable insights. Whether or not you’re transcribing podcasts, analyzing buyer calls, or captioning movies, its AI-powered speech-to-text engine delivers pinpoint accuracy, even with accents or background noise.

Audio RAG with AssemblyAI, Qdrant & DeepSeek-R1

What’s SambaNova Cloud?

Think about operating large open-source fashions like DeepSeek-R1 (671B) as much as 10x sooner — and with out the standard infrastructure complications.

SambaNova Cloud

As an alternative of counting on GPUs, SambaNova UsesRDUs (Reconfigurable Dataflow Models), which unlock sooner efficiency with:

  • Large in-memory storage — no fixed reloading of fashions
  • Environment friendly dataflow design — optimized for high-throughput duties
  • On the spot mannequin switching — swap between fashions in microseconds
  • Run DeepSeek-R1 immediately — no difficult setup required
  • Practice and fine-tune on the identical platform — multi function place

What’s Qdrant?

Qdrant is a lightning-fast vector database constructed to supercharge AI purposes, consider it as a search engine that finds needles in haystacks. Whether or not you’re constructing a suggestion system, picture search device, or chatbot, Qdrant makes a speciality of similarity searches, shortly pinpointing the closest matches for complicated knowledge like textual content embeddings or visible options.

Audio RAG with AssemblyAI, Qdrant & DeepSeek-R1

What’s DeepSeek-R1?

Deepseek-R1 is a game-changing language mannequin that blends human-like adaptability with cutting-edge AI, making it a standout in pure language processing. Whether or not you’re crafting content material, translating languages, debugging code, or summarizing complicated experiences, R1 excels at understanding context, tone, and intent, delivering responses that really feel intuitive slightly than robotic. By prioritizing empathy and precision, Deepseek-R1 isn’t only a device; it’s a glimpse right into a future the place AI communicates as naturally as we do.

Audio RAG with AssemblyAI, Qdrant & DeepSeek-R1

Constructing the RAG Mannequin with AssemblyAI and DeepSeek-R1

Now that you simply perceive all of the elements, let’s dive into constructing our RAG. However earlier than we do this, let’s shortly cowl what you’ll must get began.

1. Vital Conditions

Under are the stipulations required:

Clone the repository:

git clone https://github.com/karthikponna/chat_with_audios.git
cd chat_with_audios

Create and activate the digital atmosphere:

# For macOS and Linux:
python3 -m venv venv
supply venv/bin/activate

# For Home windows:
python -m venv venv
.venvScriptsactivate

Set up Required Dependencies:

pip set up -r necessities.txt

Set Up Surroundings Variables:

Create a `.env` file and add your AssemblyAI and SambaNova API keys.

ASSEMBLYAI_API_KEY="your_assemblyai_api_key_string"
SAMBANOVA_API_KEY="your_sambanova_api_key_string"

Now lets begin with the coding half.

2. Retrieval Augmented Technology

RAG merges massive language fashions with exterior knowledge to provide extra correct, context-rich solutions. It fetches related data at question time, making certain responses depend on actual knowledge as a substitute of simply mannequin coaching.

2.1 Importing Vital Libraries

Let’s create a file named rag_code.py. We’ll stroll by means of the code step-by-step, beginning with importing the required modules and orchestrating the code structure utilizing the Llama Index.

from qdrant_client import fashions
from qdrant_client import QdrantClient
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.sambanovasystems import SambaNovaCloud
from llama_index.llms.ollama import Ollama
import assemblyai as aai
from typing import Listing, Dict

from llama_index.core.base.llms.sorts import (
    ChatMessage,
    MessageRole,
)

2.2 Batch Processing and Textual content Embedding with Hugging Face

Right here batch_iterate perform splits a listing of textual content into smaller chunks, making it simpler to course of massive datasets. The EmbedData class then hundreds a Hugging Face embedding mannequin, generates embeddings for every batch of textual content, and collects these embeddings for later use.

def batch_iterate(lst, batch_size):
    """Yield successive n-sized chunks from lst."""
    for i in vary(0, len(lst), batch_size):
        yield lst[i : i + batch_size]

class EmbedData:

    def __init__(self, embed_model_name="BAAI/bge-large-en-v1.5", batch_size = 32):
        self.embed_model_name = embed_model_name
        self.embed_model = self._load_embed_model()
        self.batch_size = batch_size
        self.embeddings = []
        
    def _load_embed_model(self):
        embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name, trust_remote_code=True, cache_folder="./hf_cache")
        return embed_model

    def generate_embedding(self, context):
        return self.embed_model.get_text_embedding_batch(context)
        
    def embed(self, contexts):
        
        self.contexts = contexts
        
        for batch_context in batch_iterate(contexts, self.batch_size):
            batch_embeddings = self.generate_embedding(batch_context)
            self.embeddings.prolong(batch_embeddings)

2.3 Qdrant Vector Database Setup and Ingestion

  • QdrantVDB_QB class initializes a Qdrant vector database by establishing key parameters like assortment identify, vector dimension, and batch measurement, and it connects to Qdrant whereas checking for an present assortment (creating one if wanted).
  • It effectively uploads knowledge by batching textual content contexts with their corresponding embeddings after which updating the gathering’s configuration accordingly.
class QdrantVDB_QB:

    def __init__(self, collection_name, vector_dim = 768, batch_size=512):
        self.collection_name = collection_name
        self.batch_size = batch_size
        self.vector_dim = vector_dim
        
    def define_client(self):
        
        self.consumer = QdrantClient(url="http://localhost:6333", prefer_grpc=True)
        
    def create_collection(self):
        
        if not self.consumer.collection_exists(collection_name=self.collection_name):

            self.consumer.create_collection(collection_name=f"{self.collection_name}",
                                          
                                          vectors_config=fashions.VectorParams(measurement=self.vector_dim,
                                                                             distance=fashions.Distance.DOT,
                                                                             on_disk=True),
                                          
                                          optimizers_config=fashions.OptimizersConfigDiff(default_segment_number=5,
                                                                                        indexing_threshold=0),
                                          
                                          quantization_config=fashions.BinaryQuantization(
                                                        binary=fashions.BinaryQuantizationConfig(always_ram=True)),
                                         )
            
    def ingest_data(self, embeddata):
    
        for batch_context, batch_embeddings in zip(batch_iterate(embeddata.contexts, self.batch_size), 
                                                    batch_iterate(embeddata.embeddings, self.batch_size)):
    
            self.consumer.upload_collection(collection_name=self.collection_name,
                                          vectors=batch_embeddings,
                                          payload=[{"context": context} for context in batch_context])

        self.consumer.update_collection(collection_name=self.collection_name,
                                      optimizer_config=fashions.OptimizersConfigDiff(indexing_threshold=20000)
                                     )

2.4 Question Embedding Retriever

  • The Retriever class is designed to bridge the hole between person queries and a vector database by initializing with a vector database consumer and an embedding mannequin.
  • Its search methodology transforms a question into an embedding utilizing the mannequin, then performs a vector search on the database with fine-tuned quantization parameters to shortly retrieve related outcomes.
class Retriever:

    def __init__(self, vector_db, embeddata):
        
        self.vector_db = vector_db
        self.embeddata = embeddata

    def search(self, question):
        query_embedding = self.embeddata.embed_model.get_query_embedding(question)
        
        
        consequence = self.vector_db.consumer.search(
            collection_name=self.vector_db.collection_name,
            
            query_vector=query_embedding,
            
            search_params=fashions.SearchParams(
                quantization=fashions.QuantizationSearchParams(
                    ignore=False,
                    rescore=True,
                    oversampling=2.0,
                )
            ),
            
            timeout=1000,
        )

        return consequence

2.5 RAG Sensible Question Assistant

The RAG class integrates a retriever and an LLM to generate context-aware responses. It retrieves related data from a vector database, codecs it right into a structured immediate, and sends it to the LLM for a response. I’m utilizing SambaNovaCloud to entry the LLM mannequin by means of their API for environment friendly textual content technology.

class RAG:

    def __init__(self,
                 retriever,
                 llm_name = "Meta-Llama-3.1-405B-Instruct"
                 ):
        
        system_msg = ChatMessage(
            position=MessageRole.SYSTEM,
            content material="You're a useful assistant that solutions questions concerning the person's doc.",
        )
        self.messages = [system_msg, ]
        self.llm_name = llm_name
        self.llm = self._setup_llm()
        self.retriever = retriever
        self.qa_prompt_tmpl_str = ("Context data is under.n"
                                   "---------------------n"
                                   "{context}n"
                                   "---------------------n"
                                   "Given the context data above I need you to suppose step-by-step to reply the question in a crisp method, incase case you do not know the reply say 'I do not know!'.n"
                                   "Question: {question}n"
                                   "Reply: "
                                   )

    def _setup_llm(self):

        return SambaNovaCloud(
                        mannequin=self.llm_name,
                        temperature=0.7,
                        context_window=100000,
                    )

        # return Ollama(mannequin=self.llm_name,
        #               temperature=0.7,
        #               context_window=100000,
        #             )

    def generate_context(self, question):

        consequence = self.retriever.search(question)
        context = [dict(data) for data in result]
        combined_prompt = []

        for entry in context[:2]:
            context = entry["payload"]["context"]

            combined_prompt.append(context)

        return "nn---nn".be part of(combined_prompt)

    def question(self, question):
        context = self.generate_context(question=question)
        
        immediate = self.qa_prompt_tmpl_str.format(context=context, question=question)

        user_msg = ChatMessage(position=MessageRole.USER, content material=immediate)

        # self.messages.append(ChatMessage(position=MessageRole.USER, content material=immediate))
                
        streaming_response = self.llm.stream_complete(user_msg.content material)
        
        return streaming_response

2.6 Audio Transcription

Right here Transcribe class initializes by setting the AssemblyAI API key and making a transcriber. It then processes an audio file utilizing a configuration that permits speaker labels, in the end returning a listing of dictionaries the place every entry maps a speaker to their transcribed textual content.

class Transcribe:
    def __init__(self, api_key: str):
        """Initialize the Transcribe class with AssemblyAI API key."""
        aai.settings.api_key = api_key
        self.transcriber = aai.Transcriber()
        
    def transcribe_audio(self, audio_path: str) -> Listing[Dict[str, str]]:
        """
        Transcribe an audio file and return speaker-labeled transcripts.
        
        Args:
            audio_path: Path to the audio file
            
        Returns:
            Listing of dictionaries containing speaker and textual content data
        """
        # Configure transcription with speaker labels
        config = aai.TranscriptionConfig(
            speaker_labels=True,
            speakers_expected=2  # Alter this based mostly in your wants
        )
        
        # Transcribe the audio
        transcript = self.transcriber.transcribe(audio_path, config=config)
        
        # Extract speaker utterances
        speaker_transcripts = []
        for utterance in transcript.utterances:
            speaker_transcripts.append({
                "speaker": f"Speaker {utterance.speaker}",
                "textual content": utterance.textual content
            })
            
        return speaker_transcripts

3. Streamlit App

Streamlit is a Python library that transforms knowledge scripts into interactive net apps, making it excellent for LLM-based options.

  • The under code builds a user-friendly app that lets customers add an audio file, view its transcript, and chat accordingly.
  • AssemblyAI transcribes the uploaded audio into speaker-labeled textual content.
  • The transcript is embedded and saved in a Qdrant vector database for environment friendly retrieval.
  • A retriever paired with a RAG engine generates context-aware chat responses utilizing these embeddings.
  • Session state manages chat historical past and file caching to make sure a clean expertise.
import os
import gc
import uuid
import tempfile
import base64
from dotenv import load_dotenv
from rag_code import Transcribe, EmbedData, QdrantVDB_QB, Retriever, RAG
import streamlit as st

if "id" not in st.session_state:
    st.session_state.id = uuid.uuid4()
    st.session_state.file_cache = {}

session_id = st.session_state.id
collection_name = "chat with audios"
batch_size = 32

load_dotenv()

def reset_chat():
    st.session_state.messages = []
    st.session_state.context = None
    gc.gather()

with st.sidebar:
    st.header("Add your audio file!")
    
    uploaded_file = st.file_uploader("Select your audio file", kind=["mp3", "wav", "m4a"])

    if uploaded_file:
        attempt:
            with tempfile.TemporaryDirectory() as temp_dir:
                file_path = os.path.be part of(temp_dir, uploaded_file.identify)
                
                with open(file_path, "wb") as f:
                    f.write(uploaded_file.getvalue())
                
                file_key = f"{session_id}-{uploaded_file.identify}"
                st.write("Transcribing with AssemblyAI and storing in vector database...")

                if file_key not in st.session_state.get('file_cache', {}):
                    # Initialize transcriber
                    transcriber = Transcribe(api_key=os.getenv("ASSEMBLYAI_API_KEY"))
                    
                    # Get speaker-labeled transcripts
                    transcripts = transcriber.transcribe_audio(file_path)
                    st.session_state.transcripts = transcripts
                    
                    # Every speaker section turns into a separate doc for embedding
                    paperwork = [f"Speaker {t['speaker']}: {t['text']}" for t in transcripts]

                    # embed knowledge    
                    embeddata = EmbedData(embed_model_name="BAAI/bge-large-en-v1.5", batch_size=batch_size)
                    embeddata.embed(paperwork)

                    # arrange vector database
                    qdrant_vdb = QdrantVDB_QB(collection_name=collection_name,
                                          batch_size=batch_size,
                                          vector_dim=1024)
                    qdrant_vdb.define_client()
                    qdrant_vdb.create_collection()
                    qdrant_vdb.ingest_data(embeddata=embeddata)

                    # arrange retriever
                    retriever = Retriever(vector_db=qdrant_vdb, embeddata=embeddata)

                    # arrange rag
                    query_engine = RAG(retriever=retriever, llm_name="DeepSeek-R1-Distill-Llama-70B")
                    st.session_state.file_cache[file_key] = query_engine
                else:
                    query_engine = st.session_state.file_cache[file_key]

                # Inform the person that the file is processed
                st.success("Able to Chat!")
                
                # Show audio participant
                st.audio(uploaded_file)
                
                # Show speaker-labeled transcript
                st.subheader("Transcript")
                with st.expander("Present full transcript", expanded=True):
                    for t in st.session_state.transcripts:
                        st.textual content(f"**{t['speaker']}**: {t['text']}")
                
        besides Exception as e:
            st.error(f"An error occurred: {e}")
            st.cease()     

col1, col2 = st.columns([6, 1])

with col1:
    st.markdown("""
    # RAG over Audio powered by <img src="knowledge:picture/png;base64,{}" width="200" fashion="vertical-align: -15px; padding-right: 10px;">  and <img src="knowledge:picture/png;base64,{}" width="200" fashion="vertical-align: -5px; padding-left: 10px;">
""".format(base64.b64encode(open("belongings/AssemblyAI.png", "rb").learn()).decode(),
           base64.b64encode(open("belongings/deep-seek.png", "rb").learn()).decode()), unsafe_allow_html=True)

with col2:
    st.button("Clear ↺", on_click=reset_chat)

# Initialize chat historical past
if "messages" not in st.session_state:
    reset_chat()

# Show chat messages from historical past on app rerun
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Settle for person enter
if immediate := st.chat_input("Ask concerning the audio dialog..."):
    # Add person message to speak historical past
    st.session_state.messages.append({"position": "person", "content material": immediate})
    # Show person message in chat message container
    with st.chat_message("person"):
        st.markdown(immediate)

    # Show assistant response in chat message container
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""
        
        # Get streaming response
        streaming_response = query_engine.question(immediate)
        
        for chunk in streaming_response:
            attempt:
                new_text = chunk.uncooked["choices"][0]["delta"]["content"]
                full_response += new_text
                message_placeholder.markdown(full_response + "▌")
            besides:
                cross

        message_placeholder.markdown(full_response)

    # Add assistant response to speak historical past
    st.session_state.messages.append({"position": "assistant", "content material": full_response})

Run the app.py file within the terminal, with the under code, the place you may add an audio file and work together with the chatbot.

streamlit run app.py

You’ll be able to see the demo utilizing the app right here. And you may obtain the pattern audio file from right here.

Conclusion

Now we have efficiently mixed AssemblyAI, SambaNova Cloud, Qdrant, and DeepSeek to construct a chatbot that makes use of Retrieval Augmented Technology over audio. The rag_code.py file manages the RAG workflow, whereas the app.py file gives a easy Streamlit interface. I need you to work together with this chatbot utilizing completely different audio information, tweak the code, add new options, and discover the infinite prospects of audio-based chat options.

GitHub Repo: https://github.com/karthikponna/chat_with_audios/tree/essential

Key Takeaways

  • Leveraging AssemblyAI for audio transcription allows correct speaker-labeled textual content, offering a strong basis for superior dialog experiences.
  • Integrating Qdrant ensures speedy vector-based retrieval, providing fast entry to related context for extra knowledgeable responses.
  • Making use of a RAG strategy combines retrieval and technology, guaranteeing solutions grounded in precise knowledge.
  • Using SambaNova Cloud for the LLM delivers sturdy language understanding, powering partaking, context-aware interactions.
  • Utilizing Streamlit for the person interface affords an easy, interactive atmosphere, simplifying audio-based chatbot deployment.

The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.

Steadily Requested Questions

Q1. What’s RAG, and the way does it assist in constructing this chatbot?

A. RAG stands for Retrieval Augmented Technology. It fetches related knowledge from a vector database, making certain the chatbot’s solutions are grounded in actual context slightly than simply mannequin predictions.

Q2. How do I customise the embedding mannequin utilized in rag_code.py?

A. Merely change the embed_model_name within the EmbedData class to your most well-liked Hugging Face mannequin, making certain it helps textual content embedding.

Q3. How can I modify the immediate template for various use instances?

A. Alter the qa_prompt_tmpl_str within the RAG class to incorporate any further directions or formatting wanted in your utility.

This fall. Why use Qdrant for storing embeddings?

A. Qdrant gives environment friendly vector search, making it simple to shortly discover related context inside massive units of embedded textual content.

Hello! I am Karthik Ponna, a Machine Studying Engineer at Antern. I am deeply keen about exploring the fields of AI and Knowledge Science, as they continuously evolve and form the long run. I consider writing blogs is a good way to not solely improve my expertise and solidify my understanding but in addition to share my data and insights with others in the neighborhood. This helps me join with like-minded people who share a curiosity for know-how and innovation.

Login to proceed studying and revel in expert-curated content material.