AI Agent Utilizing Multimodal Strategy

With this weblog, I wish to present one small agent built-in with `LangGraph` and Google Gemini for analysis functions. The target is to show one analysis agent (Paper-to-Voice Assistant) who plans to summarize the analysis paper. This instrument will use a imaginative and prescient mannequin to deduce the data. This methodology solely identifies the step and its sub-steps and tries to get the reply for these motion gadgets. Lastly, all of the solutions are transformed into conversations during which two folks will focus on the paper. You’ll be able to take into account this a mini NotebookLM of Google.

To elaborate additional, I’m utilizing a single uni-directed graph the place communication between steps occurs from high to backside. I’ve additionally used conditional node connections to course of repeated jobs.

  • A course of to construct easy brokers with the assistance of Langgraph
  • MultiModal Dialog with Google Gemini llm
AI Agent Utilizing Multimodal Strategy

Paper-to-Voice Assistant: Map-reduce in Agentic AI

Think about a large downside so giant {that a} single individual would take a number of days/months to unravel. Now, image a crew of expert folks, every given a particular part of the duty to work on. They may begin by sorting the plans by goal or complexity, then steadily piecing collectively smaller sections. As soon as every solver has accomplished their part, they mix their options into the ultimate one.

That is basically how map-reduce works in Agentic AI. The primary “downside” will get divided into sub-problem statements. The “solvers” are particular person LLMs, which map every sub-plan with totally different solvers. Every solver works on its assigned sub-problem, performing calculations or inferring info as wanted. Lastly, the outcomes from all “solvers” are mixed (diminished) to supply the ultimate output.

From Automation to Help: The Evolving Function of AI Brokers

Map-Reduce in Agentic conversation
Map-reduce in Agentic Dialog

After developments in generative AI, LLM brokers are fairly common, and individuals are benefiting from their capabilities. Some argue that brokers can automate the method finish to finish. Nevertheless, I view them as productiveness enablers. They will help in problem-solving, designing workflow, and enabling people to deal with vital elements. For example, brokers can function automated provers exploring the house of mathematical proofs. They will provide new views and methods of pondering past “human-generated” proofs.

One other current instance is the AI-enabled Cursor Studio. The cursor is a managed atmosphere much like VS code that helps programming help.

Brokers are additionally turning into extra able to planning and taking motion, resembling the best way people motive, and most significantly, they’ll regulate their methods. They’re shortly enhancing of their capability to research duties, develop plans to finish them and refine their strategy by repeated self-critique. Some methods contain retaining people within the loop, the place brokers search steering from people at intervals after which proceed primarily based on these directions.

Agents in the field
Brokers within the Area

What will not be included?

  • I’ve not included any instruments like search or any customized operate to make it extra superior.
  • No routing strategy or reverse connection is developed.
  • No branching methods are used for parallel processing or conditional job
  • Related issues will be applied by loading pdf and parsing photographs and graphs.
  • Taking part in with solely 3 photographs in a single immediate.

Python Libraries Used

  • langchain-google-genai : To attach langchain with Google generative AI fashions
  • python-dotenv: To load secrete keys or any atmosphere variables
  • langgraph: To assemble the brokers
  • pypdfium2 & pillow: To transform PDF into photographs
  • pydub : to section the audio
  • gradio_client : to name HF 🤗 mannequin

Paper-to-Voice Assistant: Sensible Implementation

Right here’s the implementation:

Load the Supporting Libraries

from dotenv import dotenv_values
from langchain_core.messages import HumanMessage
import os
from langchain_google_genai import ChatGoogleGenerativeAI

from langgraph.graph import StateGraph, START, END,MessagesState
from langgraph.graph.message import add_messages
from langgraph.constants import Ship
import pypdfium2 as pdfium
import json
from PIL import Picture

import operator
from typing import Annotated, TypedDict # ,Non-compulsory, Listing
from langchain_core.pydantic_v1 import BaseModel # ,Area

Load Surroundings Variables

config = dotenv_values("../.env")
os.environ["GOOGLE_API_KEY"] = config['GEMINI_API']

Presently, I’m utilizing a Multimodal strategy on this mission. To attain this, I load a PDF file and convert every web page into a picture. These photographs are then fed into the Gemini imaginative and prescient mannequin for conversational functions.

The next code demonstrates load a PDF file, convert every web page into a picture, and save these photographs to a listing.

pdf = pdfium.PdfDocument("./pdf_image/imaginative and prescient.pdf")

for i in vary(len(pdf)):
    web page = pdf[i]
    picture = web page.render(scale=4).to_pil()
    picture.save(f"./pdf_image/vision_P{i:03d}.jpg")

Let’s show the one web page for our reference.

image_path = "./pdf_image/vision_P000.jpg"
img = Picture.open(image_path)
img

Output:

research paper

Google Vission Mannequin

Connecting Gemini mannequin by API key. The next are the totally different variants out there. One necessary factor to note is that we must always choose the mannequin that helps the information kind. Since I’m working with photographs within the dialog, I wanted to go for both the Gemini 1.5 Professional or Flash fashions, as these variants assist picture knowledge.

Google LLM stack
Google LLM stack
llm = ChatGoogleGenerativeAI(
    mannequin="gemini-1.5-flash-001", # "gemini-1.5-pro",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

Let’s transfer ahead by constructing the schema to handle the output and switch info between nodes. Make the most of the add operator to merge all of the steps and sub-steps that the agent will create.

class State(TypedDict):
    image_path: str # To retailer reference of all of the pages
    steps:  Annotated[list, operator.add] # Retailer all of the steps generated by the agent

    substeps: Annotated[list, operator.add] # Retailer all of the sub steps generated by the agent for each step
    options:  Annotated[list, operator.add] # Retailer all of the options generated by the agent for every step
    content material:str # retailer the content material of the paper
    plan:str # retailer the processed plan
    Dialog:  Annotated[list, operator.add]

The Schema to manage the output of step one.

class Job(BaseModel):
    process: str

Schema will retailer the output of every sub-step primarily based on the duties recognized within the earlier step.

class SubStep(BaseModel):
    substep: str

Schema to manage the output throughout conditional looping.

class StepState(TypedDict):
    step: str
    image_path: str
    options: str
    Dialog:  str

Step 1: Generate Duties

In our first step, we’ll go the pictures to the LLM and instruct it to determine all of the plans it intends to execute to grasp the analysis paper totally. I’ll present a number of pages without delay, asking the mannequin to generate a mixed plan primarily based on all the pictures.

def generate_steps(state: State):
    immediate="""
    Think about you're a analysis scientist in synthetic intelligence who's skilled in understanding analysis papers.
    You'll be given a analysis paper and it's good to determine all of the steps a researcher have to carry out.
    Establish every steps and their substeps.
    """
    message = HumanMessage(content material=[{'type':'text','text':prompt},
                                    *[{"type":'image_url','image_url':img} for img in state['image_path']]
                                    ]

                                    )
    response = llm.invoke([message])
    return {"content material": [response.content],"image_path":state['image_path']}

Step 2: Plan Parsing

On this step, we’ll take the plan recognized in step one and instruct the mannequin to transform it right into a structured format. I’ve outlined the schema and included that info with the immediate. It’s necessary to notice that exterior parsing instruments can be utilized to remodel the plan into a correct knowledge construction. To reinforce robustness, you possibly can use “Instruments” to parse the information or create a extra rigorous schema as nicely.

def markdown_to_json(state: State):
    immediate ="""
    You're given a markdown content material and it's good to parse this knowledge into json format. Comply with accurately key and worth
    pairs for every bullet level.
    Comply with following schema strictly.

    schema:
    [
    {
      "step": "description of step 1 ",
      "substeps": [
        {
          "key": "title of sub step 1 of step 1",
          "value": "description of sub step 1 of step 1"
        },
        {
          "key": "title of sub step 2 of step 1",
          "value": "description of sub step 2 of step 1"
        }]},
        {
      "step": "description of step 2",
      "substeps": [
        {
          "key": "title of sub step 1 of step 2",
          "value": "description of sub step 1 of step 2"
        },
        {
          "key": "title of sub step 2 of step 2",
          "value": "description of sub step 2 of step 2"
        }]}]'

    Content material:
    %s
    """% state['content']
    str_response = llm.invoke([prompt])
    return({'content material':str_response.content material,"image_path":state['image_path']})

Step 3 Textual content To Json

Within the third step, we’ll take the plan recognized in 2nd step and convert it into JSON format. Please word that this methodology might not all the time work if LLM violates the schema construction.

def parse_json(state: State):
    str_response_json = json.hundreds(state['content'][7:-3].strip())
    output = []
    for step in str_response_json:
        substeps = []
        for merchandise in step['substeps']:
            for okay,v in merchandise.gadgets():
                if okay=='worth':
                    substeps.append(v)
        output.append({'step':step['step'],'substeps':substeps})
    return({"plan":output})

Step 4: Resolution For Every Step

The answer findings of each step can be taken care of by this step. It should take one plan and mix all of the sub-plans in a pair of Questions and Solutions in a single immediate. Moreover, LLM will determine the answer of each sub-step.

This step can even mix the a number of outputs into one with the assistance of the Annotator and “add” operator. For now, this step is working with honest high quality. Nevertheless, it may be improved by utilizing branching and translating each sub-step into a correct reasoning immediate.

Primarily, each substep ought to get translated into a series of thought in order that LLM can put together an answer. One can use React as nicely.

def sovle_substeps(state: StepState):
    print(state)
    inp = state['step']
    print('fixing sub steps')
    qanda=" ".be a part of([f'n Question: {substep} n Answer:' for substep  in inp['substeps']])
    immediate=f""" You'll be given instruction to research analysis papers. It's essential to perceive the
    instruction and resolve all of the questions talked about within the record.
    Maintain the pair of Query and its reply in your response. Your response needs to be subsequent to the key phrase "Reply"

    Instruction:
    {inp['step']}
    Questions:
    {qanda}
    """
    message = HumanMessage(content material=[{'type':'text','text':prompt},
                                    *[{"type":'image_url','image_url':img} for img in state['image_path']]
                                    ]
                                    )
    response = llm.invoke([message])
    return {"steps":[inp['step']], 'options':[response.content]}

Step 5: Conditional Loop

This step is vital for managing the stream of the dialog. It includes an iterative course of that maps out the generated plans (steps) and constantly passes info from one node to a different in a loop. The loop terminates as soon as all of the plans have been executed. Presently, this step handles one-way communication between nodes, but when there’s a necessity for bi-directional communication, we would wish to contemplate implementing totally different branching methods.

def continue_to_substeps(state: State):
    steps = state['plan']
    return [Send("sovle_substeps", {"step": s,'image_path':state['image_path']}) for s in steps]

Step 6: Voice

After all of the solutions are generated, the next code will flip them right into a dialogue after which mix every part right into a podcast that includes two folks discussing the paper. The next immediate is taken from right here.

SYSTEM_PROMPT = """
You're a world-class podcast producer tasked with remodeling the offered enter textual content into a fascinating and informative podcast script. The enter could also be unstructured or messy, sourced from PDFs or internet pages. Your objective is to extract essentially the most fascinating and insightful content material for a compelling podcast dialogue.

# Steps to Comply with:

1. **Analyze the Enter:**
   Fastidiously study the textual content, figuring out key matters, factors, and fascinating details or anecdotes that
   might drive a fascinating podcast dialog. Disregard irrelevant info or formatting points.

2. **Brainstorm Concepts:**
   Within the `<scratchpad>`, creatively brainstorm methods to current the important thing factors engagingly. Think about:
   - Analogies, storytelling methods, or hypothetical eventualities to make content material relatable
   - Methods to make advanced matters accessible to a normal viewers
   - Thought-provoking inquiries to discover through the podcast
   - Artistic approaches to fill any gaps within the info

3. **Craft the Dialogue:**
   Develop a pure, conversational stream between the host (Jane) and the visitor speaker (the writer or an skilled on the subject). Incorporate:
   - The perfect concepts out of your brainstorming session
   - Clear explanations of advanced matters
   - An interesting and energetic tone to captivate listeners
   - A stability of data and leisure

   Guidelines for the dialogue:
   - The host (Jane) all the time initiates the dialog and interviews the visitor
   - Embrace considerate questions from the host to information the dialogue
   - Incorporate pure speech patterns, together with occasional verbal fillers (e.g., "um," "nicely," "you recognize")
   - Permit for pure interruptions and back-and-forth between host and visitor
   - Make sure the visitor's responses are substantiated by the enter textual content, avoiding unsupported claims
   - Preserve a PG-rated dialog acceptable for all audiences
   - Keep away from any advertising or self-promotional content material from the visitor
   - The host concludes the dialog

4. **Summarize Key Insights:**
   Naturally weave a abstract of key factors into the closing a part of the dialogue. This could really feel like an off-the-cuff dialog slightly than a proper recap, reinforcing the primary takeaways earlier than signing off.

5. **Preserve Authenticity:**
   All through the script, try for authenticity within the dialog. Embrace:
   - Moments of real curiosity or shock from the host
   - Situations the place the visitor may briefly battle to articulate a fancy thought
   - Mild-hearted moments or humor when acceptable
   - Temporary private anecdotes or examples that relate to the subject (inside the bounds of the enter textual content)

6. **Think about Pacing and Construction:**
   Make sure the dialogue has a pure ebb and stream:
   - Begin with a robust hook to seize the listener's consideration
   - Step by step construct complexity because the dialog progresses
   - Embrace temporary "breather" moments for listeners to soak up advanced info
   - Finish on a excessive word, maybe with a thought-provoking query or a call-to-action for listeners
"""
def generate_dialog(state):
    textual content = state['text']
    tone = state['tone']
    size = state['length']
    language = state['language']

    modified_system_prompt = SYSTEM_PROMPT
    modified_system_prompt += f"nPLEASE paraphrase the next TEXT in dialog format."

    if tone:
        modified_system_prompt += f"nnTONE: The tone of the podcast needs to be {tone}."
    if size:
        length_instructions = {
            "Quick (1-2 min)": "Maintain the podcast temporary, round 1-2 minutes lengthy.",
            "Medium (3-5 min)": "Goal for a reasonable size, about 3-5 minutes.",
        }
        modified_system_prompt += f"nnLENGTH: {length_instructions[length]}"
    if language:
        modified_system_prompt += (
            f"nnOUTPUT LANGUAGE <IMPORTANT>: The the podcast needs to be {language}."
        )

    messages = modified_system_prompt + 'nTEXT: '+ textual content

    response = llm.invoke([messages])
    return {"Step":[state['step']],"Discovering":[state['text']], 'Dialog':[response.content]}
def continue_to_substeps_voice(state: State):
    print('voice substeps')

    options = state['solutions']
    steps = state['steps']

    tone="Formal" #  ["Fun", "Formal"]
    return [Send("generate_dialog", {"step":st,"text": s,'tone':tone,'length':"Short (1-2 min)",'language':"EN"}) for st,s in zip(steps,solutions)]

Step 7: Graph Development

Now it’s time to assemble all of the steps we’ve outlined:

  • Initialize the Graph: Begin by initializing the graph and defining all the mandatory nodes.
  • Outline and Join Nodes: Join the nodes utilizing edges, making certain the stream of data from one node to a different.
  • Introduce the Loop: Implement the loop as described in step 5, permitting for iterative processing of the plans.
  • Terminate the Course of: Lastly, use the END methodology of LangGraph to shut and terminate the method correctly.

It’s time to show the community for higher understanding.

graph = StateGraph(State)
graph.add_node("generate_steps", generate_steps)
graph.add_node("markdown_to_json", markdown_to_json)
graph.add_node("parse_json", parse_json)
graph.add_node("sovle_substeps", sovle_substeps)
graph.add_node("generate_dialog", generate_dialog)

graph.add_edge(START, "generate_steps")
graph.add_edge("generate_steps", "markdown_to_json")
graph.add_edge("markdown_to_json", "parse_json")
graph.add_conditional_edges("parse_json", continue_to_substeps, ["sovle_substeps"])
graph.add_conditional_edges("sovle_substeps", continue_to_substeps_voice, ["generate_dialog"])
graph.add_edge("generate_dialog", END)
app = graph.compile()

It’s time to show the community for higher understanding.

from IPython.show import Picture, show

attempt:
    show(Picture(app.get_graph().draw_mermaid_png()))
besides Exception:
    print('There appears error')
Workflow for Agent
Workflow for Agent

Let’s start the method and consider how the present framework operates. For now, I’m testing with solely three photographs, although this may be prolonged to a number of photographs. The code will take photographs as enter and go this knowledge to the graph’s entry level. Subsequently, every step will take the enter from the earlier step and go the output to the subsequent step till the method is accomplished.

output = []
for s in app.stream({"image_path": [f"./pdf_image/vision_P{i:03d}.jpg" for i in range(6)]}):
    output.append(s)

with open('./knowledge/output.json','w') as f:
    json.dump(output,f)

Output:

Output
with open('./knowledge/output.json','r') as f:
    output1 = json.load(f)
print(output1[10]['generate_dialog']['Dialog'][0])

Output:

Output

I’m utilizing the next code to stream the outcomes sequentially, the place every plan and sub-plan can be mapped to a corresponding response.

import sys
import time
def printify(tx,stream=False):
    if stream:
        for char in tx:
            sys.stdout.write(char)
            sys.stdout.flush()
            time.sleep(0.05)
    else:
        print(tx)
matters =[]
substeps=[]
for idx, plan in enumerate(output1[2]['parse_json']['plan']):
    matters.append(plan['step'])
    substeps.append(plan['substeps'])

Let’s separate the solutions of each sub-step

text_planner = {}
stream = False

for matter,substep,responses in zip(matters,substeps,output1[3:10]):
    response = responses['sovle_substeps']
    response_topic = response['steps']
    if matter in response_topic:
        reply = response['solutions'][0].strip().break up('Reply:')
        reply =[ans.strip() for ans in answer if len(ans.strip())>0]
        for q,a in zip(substep,reply):
            printify(f'Sub-Step : {q}n',stream=stream)
            printify(f'Motion : {a}n',stream=stream)
        text_planner[topic]={'reply':record(zip(substep,reply))}

The output of Sub-steps and their Motion

Output
Step to Sub-step

The final node within the graph converts all of the solutions into dialog. Let’s retailer them in separate variables in order that we are able to convert them into voice.

stream = False

dialog_planner ={}
for matter,responses in zip(matters,output1[10:17]):
    dialog = responses['generate_dialog']['Dialog'][0]
    dialog = dialog.strip().break up('## Podcast Script')[-1].strip()
    dialog = dialog.exchange('[Guest Name]','Robin').exchange('**Visitor:**','**Robin:**')
    printify(f'Dialog: : {dialog}n',stream=stream)
    dialog_planner[topic]=dialog

Dialog Output

Output
Substep – Dialog

Dialog to Voice

from pydantic import BaseModel, Area
from typing import Listing, Literal, Tuple, Non-compulsory
import glob
import os
import time
from pathlib import Path
from tempfile import NamedTemporaryFile

from scipy.io.wavfile import write as write_wav
import requests
from pydub import AudioSegment
from gradio_client import Shopper
import json

from tqdm.pocket book import tqdm
import sys
from time import sleep
with open('./knowledge/dialog_planner.json','r') as f:
    dialog_planner1 = json.load(f)

Now, let’s convert these textual content dialogs into voice

1. Textual content To Speech 

For now, I’m accessing the tts mannequin from the HF endpoint, and for that, I have to set the URL and API key.

HF_API_URL = config['HF_API_URL']
HF_API_KEY = config['HF_API_KEY']
headers = {"Authorization": HF_API_KEY}

2. Gradio Shopper

I’m utilizing the Gradio consumer to name the mannequin and set the next path so the Gradio consumer can save the audio knowledge within the specified listing. If no path is outlined, the consumer will retailer the audio in a brief listing.

os.environ['GRADIO_TEMP_DIR'] = "path_to_data_dir"
hf_client = Shopper("mrfakename/MeloTTS")
hf_client.output_dir= "path_to_data_dir"

def get_text_to_voice(textual content,pace,accent,language):
   
    file_path = hf_client.predict(
                    textual content=textual content,
                    language=language,
                    speaker=accent,
                    pace=pace,
                    api_name="/synthesize",
                )
    return(file_path)

3. Voice Accent

To generate the dialog for the podcast, I’m assigning two totally different accents: one for the host and one other for the visitor.

def generate_podcast_audio(textual content: str, language: str) -> str:

    if "**Jane:**" in textual content:
        textual content = textual content.exchange("**Jane:**",'').strip()
        accent = "EN-US"
        pace = 0.9
    elif "**Robin:**" in textual content:  # host
        textual content = textual content.exchange("**Robin:**",'').strip()
        accent = "EN_INDIA"
        pace = 1
    else:
        return 'Empty Textual content'
    for try in vary(3):

        attempt:
            file_path = get_text_to_voice(textual content,pace,accent,language)
            return file_path
        besides Exception as e:
            if try == 2:  # Final try
                elevate  # Re-raise the final exception if all makes an attempt fail
            time.sleep(1)  # Await 1 second earlier than retrying

4. Retailer voice in mp3 file

Every audio clip can be beneath 2 minutes in size. This part will generate audio for every brief dialog, and the information can be saved within the specified listing.

def store_voice(topic_dialog):
    audio_path = []
    merchandise =0
    for matter,dialog in tqdm(topic_dialog.gadgets()):
        dialog_speaker = dialog.break up("n")
        for speaker in tqdm(dialog_speaker):
            one_dialog = speaker.strip()

            language_for_tts =  "EN"

            if len(one_dialog)>0:
                audio_file_path = generate_podcast_audio(
                one_dialog, language_for_tts
                )
                audio_path.append(audio_file_path)
                # proceed
            sleep(5)
        break
    return(audio_path)
audio_paths = store_voice(topic_dialog=dialog_planner1)

5. Mixed Audio

Lastly, let’s mix all of the brief audio clips to create an extended dialog.

def consolidate_voice(audio_paths,voice_dir):
    audio_segments =[]
    voice_path = [paths for paths in audio_paths if paths!='Empty Text']

    audio_segment = AudioSegment.from_file(voice_dir+"/light-guitar.wav")
    audio_segments.append(audio_segment)

    for audio_file_path in tqdm(voice_path):
        audio_segment = AudioSegment.from_file(audio_file_path)
        audio_segments.append(audio_segment)

    audio_segment = AudioSegment.from_file(voice_dir+"/ambient-guitar.wav")
    audio_segments.append(audio_segment)

    combined_audio = sum(audio_segments)
    temporary_directory = voice_dir+"/tmp/"
    os.makedirs(temporary_directory, exist_ok=True)

    temporary_file = NamedTemporaryFile(
        dir=temporary_directory,
        delete=False,
        suffix=".mp3",
    )
    combined_audio.export(temporary_file.title, format="mp3")
consolidate_voice(audio_paths=audio_paths,voice_dir="./knowledge")

Additionally, to grasp the Agent AI higher, discover: The Agentic AI Pioneer Program.

Conclusion

Total, this mission is meant purely for demonstration functions and would require vital effort and course of adjustments to create a production-ready agent. It will possibly function a proof of idea (POC) with minimal effort. On this demonstration, I’ve not accounted for components like time complexity, value, and accuracy, that are vital concerns in a manufacturing atmosphere. With that stated, I’ll conclude right here. Thanks for studying. For extra technical particulars, please refer GitHub.

Often Requested Questions

Q1. What’s the foremost goal of the Paper-to-Voice Assistant?

Ans. The Paper-to-Voice Assistant is designed to simplify the method of summarizing analysis papers by changing the extracted info right into a conversational format, enabling simple understanding and accessibility.

Q2. How does the Paper-to-Voice Assistant work?

Ans. The assistant makes use of a map-reduce strategy, the place it breaks down the analysis paper into steps and sub-steps, processes the data utilizing LangGraph and Google Gemini LLMs, after which combines the outcomes right into a coherent dialogue.

Q3. What applied sciences are used on this mission?

Ans. The mission makes use of LangGraph, Google Gemini generative AI fashions, multimodal processing (imaginative and prescient and textual content), and text-to-speech conversion with the assistance of Python libraries like pypdfium2, pillow, pydub, and gradio_client.

This fall. Can the agent course of advanced analysis papers with photographs and diagrams?

Ans. Sure, the agent can analyze PDFs containing photographs by changing every web page into photographs and feeding them into the imaginative and prescient mannequin, permitting it to extract visible and textual info.

Q5. Is the present implementation production-ready?

Ans. No, this mission is a proof of idea (POC) that demonstrates the workflow. Additional optimization, dealing with of time complexity, value, and accuracy changes are wanted to make it production-ready.