Bored with manually sifting by means of hours of audio to seek out key insights? This information teaches you to construct an AI-powered chatbot that transforms recordings – conferences, podcasts, interviews—into interactive conversations. Utilizing AssemblyAI for exact transcription with speaker labels, Qdrant for quick knowledge storage, and DeepSeek-R1 through SambaNova Cloud for sensible responses, you’ll create a RAG device that solutions questions like “What did [Speaker] say?” or “Summarize this section.” Let’s flip your audio right into a searchable, AI-driven dialogue by constructing a RAG system with AssemblyAI, Qdrant, and DeepSeek-R1.
Studying Goals
- Leverage AssemblyAI API to transcribe audio information with speaker diarization, changing conversations into structured textual content knowledge for evaluation.
- Deploy Qdrant Vector Database to retailer and effectively retrieve embeddings of transcribed audio content material utilizing HuggingFace fashions.
- Implement RAG with DeepSeek R1 mannequin through SambaNova Cloud to generate context-aware chatbot responses.
- Construct a Streamlit Net Interface for customers to add audio information, visualize transcripts, and work together with the chatbot in actual time.
- Combine Finish-to-Finish Workflow combining audio processing, vector storage, and AI-driven response technology to create a scalable audio-based chat utility.
This text was printed as part of the Knowledge Science Blogathon.
What’s AssemblyAI?
AssemblyAI is your go-to device for turning audio into actionable insights. Whether or not you’re transcribing podcasts, analyzing buyer calls, or captioning movies, its AI-powered speech-to-text engine delivers pinpoint accuracy, even with accents or background noise.

What’s SambaNova Cloud?
Think about operating large open-source fashions like DeepSeek-R1 (671B) as much as 10x sooner — and with out the standard infrastructure complications.

As an alternative of counting on GPUs, SambaNova UsesRDUs (Reconfigurable Dataflow Models), which unlock sooner efficiency with:
- Large in-memory storage — no fixed reloading of fashions
- Environment friendly dataflow design — optimized for high-throughput duties
- On the spot mannequin switching — swap between fashions in microseconds
- Run DeepSeek-R1 immediately — no difficult setup required
- Practice and fine-tune on the identical platform — multi function place
What’s Qdrant?
Qdrant is a lightning-fast vector database constructed to supercharge AI purposes, consider it as a search engine that finds needles in haystacks. Whether or not you’re constructing a suggestion system, picture search device, or chatbot, Qdrant makes a speciality of similarity searches, shortly pinpointing the closest matches for complicated knowledge like textual content embeddings or visible options.

What’s DeepSeek-R1?
Deepseek-R1 is a game-changing language mannequin that blends human-like adaptability with cutting-edge AI, making it a standout in pure language processing. Whether or not you’re crafting content material, translating languages, debugging code, or summarizing complicated experiences, R1 excels at understanding context, tone, and intent, delivering responses that really feel intuitive slightly than robotic. By prioritizing empathy and precision, Deepseek-R1 isn’t only a device; it’s a glimpse right into a future the place AI communicates as naturally as we do.

Constructing the RAG Mannequin with AssemblyAI and DeepSeek-R1
Now that you simply perceive all of the elements, let’s dive into constructing our RAG. However earlier than we do this, let’s shortly cowl what you’ll must get began.
1. Vital Conditions
Under are the stipulations required:
Clone the repository:
git clone https://github.com/karthikponna/chat_with_audios.git
cd chat_with_audios
Create and activate the digital atmosphere:
# For macOS and Linux:
python3 -m venv venv
supply venv/bin/activate
# For Home windows:
python -m venv venv
.venvScriptsactivate
Set up Required Dependencies:
pip set up -r necessities.txt
Set Up Surroundings Variables:
Create a `.env` file and add your AssemblyAI and SambaNova API keys.
ASSEMBLYAI_API_KEY="your_assemblyai_api_key_string"
SAMBANOVA_API_KEY="your_sambanova_api_key_string"
Now lets begin with the coding half.
2. Retrieval Augmented Technology
RAG merges massive language fashions with exterior knowledge to provide extra correct, context-rich solutions. It fetches related data at question time, making certain responses depend on actual knowledge as a substitute of simply mannequin coaching.
2.1 Importing Vital Libraries
Let’s create a file named rag_code.py. We’ll stroll by means of the code step-by-step, beginning with importing the required modules and orchestrating the code structure utilizing the Llama Index.
from qdrant_client import fashions
from qdrant_client import QdrantClient
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.sambanovasystems import SambaNovaCloud
from llama_index.llms.ollama import Ollama
import assemblyai as aai
from typing import Listing, Dict
from llama_index.core.base.llms.sorts import (
ChatMessage,
MessageRole,
)
2.2 Batch Processing and Textual content Embedding with Hugging Face
Right here batch_iterate perform splits a listing of textual content into smaller chunks, making it simpler to course of massive datasets. The EmbedData class then hundreds a Hugging Face embedding mannequin, generates embeddings for every batch of textual content, and collects these embeddings for later use.
def batch_iterate(lst, batch_size):
"""Yield successive n-sized chunks from lst."""
for i in vary(0, len(lst), batch_size):
yield lst[i : i + batch_size]
class EmbedData:
def __init__(self, embed_model_name="BAAI/bge-large-en-v1.5", batch_size = 32):
self.embed_model_name = embed_model_name
self.embed_model = self._load_embed_model()
self.batch_size = batch_size
self.embeddings = []
def _load_embed_model(self):
embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name, trust_remote_code=True, cache_folder="./hf_cache")
return embed_model
def generate_embedding(self, context):
return self.embed_model.get_text_embedding_batch(context)
def embed(self, contexts):
self.contexts = contexts
for batch_context in batch_iterate(contexts, self.batch_size):
batch_embeddings = self.generate_embedding(batch_context)
self.embeddings.prolong(batch_embeddings)
2.3 Qdrant Vector Database Setup and Ingestion
- QdrantVDB_QB class initializes a Qdrant vector database by establishing key parameters like assortment identify, vector dimension, and batch measurement, and it connects to Qdrant whereas checking for an present assortment (creating one if wanted).
- It effectively uploads knowledge by batching textual content contexts with their corresponding embeddings after which updating the gathering’s configuration accordingly.
class QdrantVDB_QB:
def __init__(self, collection_name, vector_dim = 768, batch_size=512):
self.collection_name = collection_name
self.batch_size = batch_size
self.vector_dim = vector_dim
def define_client(self):
self.consumer = QdrantClient(url="http://localhost:6333", prefer_grpc=True)
def create_collection(self):
if not self.consumer.collection_exists(collection_name=self.collection_name):
self.consumer.create_collection(collection_name=f"{self.collection_name}",
vectors_config=fashions.VectorParams(measurement=self.vector_dim,
distance=fashions.Distance.DOT,
on_disk=True),
optimizers_config=fashions.OptimizersConfigDiff(default_segment_number=5,
indexing_threshold=0),
quantization_config=fashions.BinaryQuantization(
binary=fashions.BinaryQuantizationConfig(always_ram=True)),
)
def ingest_data(self, embeddata):
for batch_context, batch_embeddings in zip(batch_iterate(embeddata.contexts, self.batch_size),
batch_iterate(embeddata.embeddings, self.batch_size)):
self.consumer.upload_collection(collection_name=self.collection_name,
vectors=batch_embeddings,
payload=[{"context": context} for context in batch_context])
self.consumer.update_collection(collection_name=self.collection_name,
optimizer_config=fashions.OptimizersConfigDiff(indexing_threshold=20000)
)
2.4 Question Embedding Retriever
- The Retriever class is designed to bridge the hole between person queries and a vector database by initializing with a vector database consumer and an embedding mannequin.
- Its search methodology transforms a question into an embedding utilizing the mannequin, then performs a vector search on the database with fine-tuned quantization parameters to shortly retrieve related outcomes.
class Retriever:
def __init__(self, vector_db, embeddata):
self.vector_db = vector_db
self.embeddata = embeddata
def search(self, question):
query_embedding = self.embeddata.embed_model.get_query_embedding(question)
consequence = self.vector_db.consumer.search(
collection_name=self.vector_db.collection_name,
query_vector=query_embedding,
search_params=fashions.SearchParams(
quantization=fashions.QuantizationSearchParams(
ignore=False,
rescore=True,
oversampling=2.0,
)
),
timeout=1000,
)
return consequence
2.5 RAG Sensible Question Assistant
The RAG class integrates a retriever and an LLM to generate context-aware responses. It retrieves related data from a vector database, codecs it right into a structured immediate, and sends it to the LLM for a response. I’m utilizing SambaNovaCloud to entry the LLM mannequin by means of their API for environment friendly textual content technology.
class RAG:
def __init__(self,
retriever,
llm_name = "Meta-Llama-3.1-405B-Instruct"
):
system_msg = ChatMessage(
position=MessageRole.SYSTEM,
content material="You're a useful assistant that solutions questions concerning the person's doc.",
)
self.messages = [system_msg, ]
self.llm_name = llm_name
self.llm = self._setup_llm()
self.retriever = retriever
self.qa_prompt_tmpl_str = ("Context data is under.n"
"---------------------n"
"{context}n"
"---------------------n"
"Given the context data above I need you to suppose step-by-step to reply the question in a crisp method, incase case you do not know the reply say 'I do not know!'.n"
"Question: {question}n"
"Reply: "
)
def _setup_llm(self):
return SambaNovaCloud(
mannequin=self.llm_name,
temperature=0.7,
context_window=100000,
)
# return Ollama(mannequin=self.llm_name,
# temperature=0.7,
# context_window=100000,
# )
def generate_context(self, question):
consequence = self.retriever.search(question)
context = [dict(data) for data in result]
combined_prompt = []
for entry in context[:2]:
context = entry["payload"]["context"]
combined_prompt.append(context)
return "nn---nn".be part of(combined_prompt)
def question(self, question):
context = self.generate_context(question=question)
immediate = self.qa_prompt_tmpl_str.format(context=context, question=question)
user_msg = ChatMessage(position=MessageRole.USER, content material=immediate)
# self.messages.append(ChatMessage(position=MessageRole.USER, content material=immediate))
streaming_response = self.llm.stream_complete(user_msg.content material)
return streaming_response
2.6 Audio Transcription
Right here Transcribe class initializes by setting the AssemblyAI API key and making a transcriber. It then processes an audio file utilizing a configuration that permits speaker labels, in the end returning a listing of dictionaries the place every entry maps a speaker to their transcribed textual content.
class Transcribe:
def __init__(self, api_key: str):
"""Initialize the Transcribe class with AssemblyAI API key."""
aai.settings.api_key = api_key
self.transcriber = aai.Transcriber()
def transcribe_audio(self, audio_path: str) -> Listing[Dict[str, str]]:
"""
Transcribe an audio file and return speaker-labeled transcripts.
Args:
audio_path: Path to the audio file
Returns:
Listing of dictionaries containing speaker and textual content data
"""
# Configure transcription with speaker labels
config = aai.TranscriptionConfig(
speaker_labels=True,
speakers_expected=2 # Alter this based mostly in your wants
)
# Transcribe the audio
transcript = self.transcriber.transcribe(audio_path, config=config)
# Extract speaker utterances
speaker_transcripts = []
for utterance in transcript.utterances:
speaker_transcripts.append({
"speaker": f"Speaker {utterance.speaker}",
"textual content": utterance.textual content
})
return speaker_transcripts
3. Streamlit App
Streamlit is a Python library that transforms knowledge scripts into interactive net apps, making it excellent for LLM-based options.
- The under code builds a user-friendly app that lets customers add an audio file, view its transcript, and chat accordingly.
- AssemblyAI transcribes the uploaded audio into speaker-labeled textual content.
- The transcript is embedded and saved in a Qdrant vector database for environment friendly retrieval.
- A retriever paired with a RAG engine generates context-aware chat responses utilizing these embeddings.
- Session state manages chat historical past and file caching to make sure a clean expertise.
import os
import gc
import uuid
import tempfile
import base64
from dotenv import load_dotenv
from rag_code import Transcribe, EmbedData, QdrantVDB_QB, Retriever, RAG
import streamlit as st
if "id" not in st.session_state:
st.session_state.id = uuid.uuid4()
st.session_state.file_cache = {}
session_id = st.session_state.id
collection_name = "chat with audios"
batch_size = 32
load_dotenv()
def reset_chat():
st.session_state.messages = []
st.session_state.context = None
gc.gather()
with st.sidebar:
st.header("Add your audio file!")
uploaded_file = st.file_uploader("Select your audio file", kind=["mp3", "wav", "m4a"])
if uploaded_file:
attempt:
with tempfile.TemporaryDirectory() as temp_dir:
file_path = os.path.be part of(temp_dir, uploaded_file.identify)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())
file_key = f"{session_id}-{uploaded_file.identify}"
st.write("Transcribing with AssemblyAI and storing in vector database...")
if file_key not in st.session_state.get('file_cache', {}):
# Initialize transcriber
transcriber = Transcribe(api_key=os.getenv("ASSEMBLYAI_API_KEY"))
# Get speaker-labeled transcripts
transcripts = transcriber.transcribe_audio(file_path)
st.session_state.transcripts = transcripts
# Every speaker section turns into a separate doc for embedding
paperwork = [f"Speaker {t['speaker']}: {t['text']}" for t in transcripts]
# embed knowledge
embeddata = EmbedData(embed_model_name="BAAI/bge-large-en-v1.5", batch_size=batch_size)
embeddata.embed(paperwork)
# arrange vector database
qdrant_vdb = QdrantVDB_QB(collection_name=collection_name,
batch_size=batch_size,
vector_dim=1024)
qdrant_vdb.define_client()
qdrant_vdb.create_collection()
qdrant_vdb.ingest_data(embeddata=embeddata)
# arrange retriever
retriever = Retriever(vector_db=qdrant_vdb, embeddata=embeddata)
# arrange rag
query_engine = RAG(retriever=retriever, llm_name="DeepSeek-R1-Distill-Llama-70B")
st.session_state.file_cache[file_key] = query_engine
else:
query_engine = st.session_state.file_cache[file_key]
# Inform the person that the file is processed
st.success("Able to Chat!")
# Show audio participant
st.audio(uploaded_file)
# Show speaker-labeled transcript
st.subheader("Transcript")
with st.expander("Present full transcript", expanded=True):
for t in st.session_state.transcripts:
st.textual content(f"**{t['speaker']}**: {t['text']}")
besides Exception as e:
st.error(f"An error occurred: {e}")
st.cease()
col1, col2 = st.columns([6, 1])
with col1:
st.markdown("""
# RAG over Audio powered by <img src="knowledge:picture/png;base64,{}" width="200" fashion="vertical-align: -15px; padding-right: 10px;"> and <img src="knowledge:picture/png;base64,{}" width="200" fashion="vertical-align: -5px; padding-left: 10px;">
""".format(base64.b64encode(open("belongings/AssemblyAI.png", "rb").learn()).decode(),
base64.b64encode(open("belongings/deep-seek.png", "rb").learn()).decode()), unsafe_allow_html=True)
with col2:
st.button("Clear ↺", on_click=reset_chat)
# Initialize chat historical past
if "messages" not in st.session_state:
reset_chat()
# Show chat messages from historical past on app rerun
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Settle for person enter
if immediate := st.chat_input("Ask concerning the audio dialog..."):
# Add person message to speak historical past
st.session_state.messages.append({"position": "person", "content material": immediate})
# Show person message in chat message container
with st.chat_message("person"):
st.markdown(immediate)
# Show assistant response in chat message container
with st.chat_message("assistant"):
message_placeholder = st.empty()
full_response = ""
# Get streaming response
streaming_response = query_engine.question(immediate)
for chunk in streaming_response:
attempt:
new_text = chunk.uncooked["choices"][0]["delta"]["content"]
full_response += new_text
message_placeholder.markdown(full_response + "▌")
besides:
cross
message_placeholder.markdown(full_response)
# Add assistant response to speak historical past
st.session_state.messages.append({"position": "assistant", "content material": full_response})
Run the app.py file within the terminal, with the under code, the place you may add an audio file and work together with the chatbot.
streamlit run app.py
You’ll be able to see the demo utilizing the app right here. And you may obtain the pattern audio file from right here.
Conclusion
Now we have efficiently mixed AssemblyAI, SambaNova Cloud, Qdrant, and DeepSeek to construct a chatbot that makes use of Retrieval Augmented Technology over audio. The rag_code.py file manages the RAG workflow, whereas the app.py file gives a easy Streamlit interface. I need you to work together with this chatbot utilizing completely different audio information, tweak the code, add new options, and discover the infinite prospects of audio-based chat options.
GitHub Repo: https://github.com/karthikponna/chat_with_audios/tree/essential
Key Takeaways
- Leveraging AssemblyAI for audio transcription allows correct speaker-labeled textual content, offering a strong basis for superior dialog experiences.
- Integrating Qdrant ensures speedy vector-based retrieval, providing fast entry to related context for extra knowledgeable responses.
- Making use of a RAG strategy combines retrieval and technology, guaranteeing solutions grounded in precise knowledge.
- Using SambaNova Cloud for the LLM delivers sturdy language understanding, powering partaking, context-aware interactions.
- Utilizing Streamlit for the person interface affords an easy, interactive atmosphere, simplifying audio-based chatbot deployment.
The media proven on this article is just not owned by Analytics Vidhya and is used on the Writer’s discretion.
Steadily Requested Questions
A. RAG stands for Retrieval Augmented Technology. It fetches related knowledge from a vector database, making certain the chatbot’s solutions are grounded in actual context slightly than simply mannequin predictions.
A. Merely change the embed_model_name within the EmbedData class to your most well-liked Hugging Face mannequin, making certain it helps textual content embedding.
A. Alter the qa_prompt_tmpl_str within the RAG class to incorporate any further directions or formatting wanted in your utility.
A. Qdrant gives environment friendly vector search, making it simple to shortly discover related context inside massive units of embedded textual content.
Login to proceed studying and revel in expert-curated content material.