A Multimodal AI Assistant: Combining Native and Cloud Fashions | by Robert Martin-Brief | Jan, 2025

Spectacular! One may argue about whether or not or not it actually discovered all of the skyscrapers right here however I really feel like such a system has the potential to be fairly highly effective and helpful, particularly if we have been so as to add the power to crop the bounding bins, zoom in and proceed the dialog.

Within the following sections, let’s dive into the principle steps in a bit extra element. My hope is that a few of them is likely to be informative to your initiatives too.

My earlier article incorporates a extra detailed dialogue of brokers and LangGraph, so right here I’ll simply contact within the agent state for this challenge. The AgentState is made accessible to all of the nodes within the LangGraph graph, and it’s the place the data related to a question will get saved.

Every node might be informed to write down to certainly one of extra variables within the state, and by default they get overwritten. This isn’t the conduct we wish for the plan output, which is meant to be an inventory of outcomes from every step of the plan. To make sure that this listing will get appended because the agent goes about its work we use the add reducer, which you’ll be able to learn extra about right here.

Every of the nodes within the graph above is a technique within the class AgentNodes. They absorb state, carry out some motion (usually calling an LLM) and output their updates to the state. For example, right here’s the node used to construction the plan, copied from the code right here.

   def structure_plan_node(self, state: dict) -> dict:

messages = state["plan"]
response = self.llm_structure.name(messages)
final_plan_dict = self.post_process_plan_structure(response)
final_plan = json.dumps(final_plan_dict)

return {
"plan_structure": final_plan,
"current_step": 0,
"max_steps": len(final_plan_dict),
}

The routing node can also be necessary as a result of it’s visited a number of occasions over the course of plan execution. Within the present code it’s quite simple, simply updating the present step state worth in order that different nodes know which a part of the plan construction listing to take a look at.

   def routing_node(self, state: dict) -> dict:

plan_stage = state.get("current_step", 0)
return {"current_step": plan_stage + 1}

An extension right here can be so as to add one other LLM name within the routing node to verify if the output of the earlier step of the plan warrants any modifications to the following steps or early termination of the query has been answered.

Lastly we have to add two conditional edges, which use information saved within the AgentStateto find out which node must be run subsequent. For instance, the choose_model edge appears on the identify of the present step within the plan_structure object carried in AgentState after which makes use of a easy if stagement to return the identify of corresponding node that must be referred to as at that step.

    def choose_model(state: dict) -> str:

current_plan = json.masses(state.get("plan_structure"))
current_step = state.get("current_step", 1)
max_step = state.get("max_steps", 999)

if current_step > max_step:
return "finalize"
else:
step_to_execute = current_plan[str(current_step)]["tool_name"]
return step_to_execute

Your entire agent construction appears like this.

edges: AgentEdges = AgentEdges()
nodes: AgentNodes = AgentNodes()
agent: StateGraph = StateGraph(AgentState)

## Nodes
agent.add_node("planning", nodes.plan_node)
agent.add_node("structure_plan", nodes.structure_plan_node)
agent.add_node("routing", nodes.routing_node)
agent.add_node("special_vision", nodes.call_special_vision_node)
agent.add_node("general_vision", nodes.call_general_vision_node)
agent.add_node("evaluation", nodes.assessment_node)
agent.add_node("response", nodes.dump_result_node)

## Edges
agent.set_entry_point("planning")
agent.add_edge("planning", "structure_plan")
agent.add_edge("structure_plan", "routing")
agent.add_conditional_edges(
"routing",
edges.choose_model,
{
"special_vision": "special_vision",
"general_vision": "general_vision",
"finalize": "evaluation",
},
)
agent.add_edge("special_vision", "routing")
agent.add_edge("general_vision", "routing")
agent.add_conditional_edges(
"evaluation",
edges.back_to_plan,
{
"good_answer": "response",
"bad_answer": "planning",
"timeout": "response",
},
)
agent.add_edge("response", END)

And it may be vizualized in a pocket book utilizing the turorial right here.

The planning, construction and evaluation nodes are ideally suited to a text-based LLM that may motive and produce structured outputs. Probably the most simple possibility right here is to go together with a big, versatile mannequin like GPT4o-mini, which has the good thing about wonderful help for JSON output from a Pydantic schema.

With the assistance of some LangChain performance, we will make class to name such a mannequin.

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

class StructuredOpenAICaller:
def __init__(
self, api_key, system_prompt, output_model, temperature=0, max_tokens=1000
):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.output_model = output_model
self.llm = ChatOpenAI(
mannequin=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()

def _set_up_chain(self):
immediate = ChatPromptTemplate.from_messages(
[
("system", self.system_prompt.system_template),
("human", "{query}"),
]
)
structured_llm = self.llm.with_structured_output(self.output_model)
chain = immediate | structured_llm

return chain
def name(self, question):
return self.chain.invoke({"question": question})

To set this up, we provide a system immediate and an output mannequin (see right here for some examples of those) after which we will simply use the decision technique with an enter string to get a response that conforms to the construction of the output mannequin that we specified. With the code arrange like this we’d have to make a brand new occasion of StructuredOpenAICaller with each totally different system immediate and output mannequin we used within the agent. I personally favor this to maintain observe of the totally different fashions getting used, however because the agent turns into extra advanced it may very well be modified with one other technique to immediately replace the system immediate and output mannequin within the single occasion of the category.

Can we do that with native fashions too? On Apple Silicon, we will use the MLX library and MLX group on Hugging Face to simply experiment with open supply fashions like Llama3.2. LangChain additionally has help for MLX integration, so we will observe the construction of the category above to arrange an area mannequin.

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.llms.mlx_pipeline import MLXPipeline
from langchain_community.chat_models.mlx import ChatMLX

class StructuredLlamaCaller:
MODEL_PATH = "mlx-community/Llama-3.2-3B-Instruct-4bit"

def __init__(
self,
system_prompt: Any,
output_model: Any,
temperature: float = 0,
max_tokens: int = 1000,
) -> None:

self.system_prompt = system_prompt
# that is the identify of the Pydantic mannequin that defines
# the construction we wish to output
self.output_model = output_model
self.loaded_model = MLXPipeline.from_model_id(
self.MODEL_PATH,
pipeline_kwargs={"max_tokens": max_tokens, "temp": temperature, "do_sample": False},
)
self.llm = ChatMLX(llm=self.loaded_model)
self.temperature = temperature
self.max_tokens = max_tokens
self.chain = self._set_up_chain()

def _set_up_chain(self) -> Any:
# Arrange a parser
parser = PydanticOutputParser(pydantic_object=self.output_model)

# Immediate
immediate = ChatPromptTemplate.from_messages(
[
(
"system",
self.system_prompt.system_template,
),
("human", "{query}"),
]
).partial(format_instructions=parser.get_format_instructions())

chain = immediate | self.llm | parser
return chain

def name(self, question: str) -> Any:
return self.chain.invoke({"question": question})

There are just a few attention-grabbing factors right here. For a begin, we will simply obtain the weights and config for Llama3.2 as we’d every other Hugging Face mannequin, then underneath the hood they’re loaded into MLX utilizing the MLXPipeline device from LangChain. When the fashions are first downloaded they’re mechanically positioned within the Hugging Face cache. Typically it’s fascinating to listing the fashions and their cache areas, for instance if you wish to copy a mannequin to a brand new surroundings. The util scan_cache_dir will assist right here and can be utilized to make a helpful outcome with this operate.

from huggingface_hub import scan_cache_dir

def fetch_downloaded_model_details():

hf_cache_info = scan_cache_dir()

repo_paths = []
size_on_disk = []
repo_ids = []

for repo in sorted(
hf_cache_info.repos, key=lambda repo: repo.repo_path
):
repo_paths.append(str(repo.repo_path))
size_on_disk.append(repo.size_on_disk)
repo_ids.append(repo.repo_id)
repos_df = pd.DataFrame({
"local_path":repo_paths,
"size_on_disk":size_on_disk,
"model_name":repo_ids
})

repos_df.set_index("model_name",inplace=True)
return repos_df.to_dict(orient="index")

Llama3.2 doesn’t have a built-in help for structured output like GPT4o-mini, so we have to use the immediate to pressure it to generate JSON. LangChain’s PydanticOutputParser might help, though it it additionally doable to implement your personal model of this as proven right here.

In my expertise, the model of Llama that I’m utilizing right here, particularly Llama-3.2–3B-Instruct-4bit, shouldn’t be dependable for structured output past the only schemas. It’s moderately good on the “plan technology” stage of our agent given a immediate with just a few examples, however even with the assistance of the directions offered by PydanticOutputParser, it usually fails to show that plan into JSON. Bigger and/or much less quantized variations of Llama will seemingly be higher, however they might run into RAM points if run alongside the opposite fashions in our agent. Subsequently going forwards within the challenge, the orchestration mdoel is ready to be GPT4o-mini.

To have the ability to reply questions like “What’s happening on this picture?” or “what metropolis is that this?”, we’d like a multimodal LLM. Arguably Florence2 in picture captioning mode is likely to be to present good responses to one of these query, but it surely’s probably not designed for conversational output.

The sector of multimodal fashions sufficiently small to run on a laptop computer continues to be in its infancy (a not too long ago compiled listing might be discovered right here), however the Qwen2-VL collection from Alibaba is a promising growth. Moreover, we will make use of MLX-VLM, an extension of MLX particularly designed for tuning and inference of imaginative and prescient fashions, to arrange certainly one of these fashions inside our agent framework.

from mlx_vlm import load, apply_chat_template, generate

class QwenCaller:
MODEL_PATH = "mlx-community/Qwen2-VL-2B-Instruct-4bit"

def __init__(self, max_tokens=1000, temperature=0):
self.mannequin, self.processor = load(self.MODEL_PATH)
self.config = self.mannequin.config
self.max_tokens = max_tokens
self.temperature = temperature

def name(self, question, picture):
messages = [
{
"role": "system",
"content": ImageInterpretationPrompt.system_template,
},
{"role": "user", "content": query},
]
immediate = apply_chat_template(self.processor, self.config, messages)
output = generate(
self.mannequin,
self.processor,
picture,
immediate,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
return output

This class will load the smallest model of Qwen2-VL after which name it with an enter picture and immediate to get a textual response. For extra element concerning the performance of this mannequin and others that may very well be utilized in the identical approach, take a look at this listing of examples on the MLX-VLM github web page. Qwen2-VL can also be apparently able to producing bounding bins and object pointing coordinates, so this functionality may be explored and in contrast with Florence2.

In fact GPT-4o-mini additionally has imaginative and prescient capabilities and is probably going extra dependable than smaller native fashions. Subsequently when constructing these kinds of functions it’s helpful so as to add the power to name a cloud based mostly various, if something simply as a backup in case one of many native fashions fails. Be aware that enter photos should be transformed to base64 earlier than they are often despatched to the mannequin, however as soon as that’s performed we will additionally use the LangChain framework as proven under.

import base64
from io import BytesIO
from PIL import Picture
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser

def convert_PIL_to_base64(picture: Picture, format="jpeg"):
buffer = BytesIO()
# Save the picture to this buffer within the specified format
picture.save(buffer, format=format)
# Get binary information from the buffer
image_bytes = buffer.getvalue()
# Encode binary information to Base64
base64_encoded = base64.b64encode(image_bytes)
# Convert Base64 bytes to string (elective)
return base64_encoded.decode("utf-8")

class OpenAIVisionCaller:
MODEL_NAME = "gpt-4o-mini"

def __init__(self, api_key, system_prompt, temperature=0, max_tokens=1000):
self.temperature = temperature
self.max_tokens = max_tokens
self.system_prompt = system_prompt
self.llm = ChatOpenAI(
mannequin=self.MODEL_NAME,
api_key=api_key,
temperature=temperature,
max_tokens=max_tokens,
)
self.chain = self._set_up_chain()

def _set_up_chain(self):
immediate = ChatPromptTemplate.from_messages(
[
("system", self.system_prompt.system_template),
(
"user",
[
{"type": "text", "text": "{query}"},
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
},
],
),
]
)

chain = immediate | self.llm | StrOutputParser()
return chain

def name(self, question, picture):
base64image = convert_PIL_to_base64(picture)
return self.chain.invoke({"question": question, "image_data": base64image})

Florence2 is seen as a specialist mannequin within the context of our agent as a result of whereas it has many capabilities its inputs should be chosen from an inventory of predefined activity prompts. In fact the mannequin may very well be advantageous tuned to simply accept new prompts, however for our functions the model downloaded immediately from Hugging Face works nicely. The fantastic thing about this mannequin is that it makes use of a single coaching course of and set of weights, however but achieves excessive efficiency in a number of picture duties that beforehand would have demanded their very own fashions. The important thing to this success lies in its massive and punctiliously curated coaching dataset, FLD-5B. To study extra concerning the dataset, mannequin and coaching I like to recommend this wonderful article.

In our context, we use the orchestration mannequin to show the question right into a collection of Florence activity prompts, which we then name in a sequence. The choices obtainable to us embrace captioning, object detection, phrase grounding, OCR and segmentation. For a few of these choices (i.e. phrase grounding and area to segmentation) an enter phrase is required, so the orchestration mannequin generates that too. In distinction, duties like captioning want solely the picture. There are a lot of use circumstances for Florence2, that are explored in code right here. We prohibit ourselves to object detection, phrase grounding, captioning and OCR, although it could be simple so as to add extra by updating the prompts related to plan technology and structuring.

Florence2 seems to be supported by the MLX-VLM bundle, however on the time of writing I couldn’t discover any examples of its use and so opted for an method that makes use of Hugging Face transformers as proven under.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

def get_device_type():

if torch.cuda.is_available():
return "cuda"
else:
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
return "mps"
else:
return "cpu"

class FlorenceCaller:

MODEL_PATH: str = "microsoft/Florence-2-base-ft"
# See https://huggingface.co/microsoft/Florence-2-base-ft for different modes
# for Florence2
TASK_DICT: dict[str, str] = {
"normal object detection": "<OD>",
"particular object detection": "<CAPTION_TO_PHRASE_GROUNDING>",
"picture captioning": "<MORE_DETAILED_CAPTION>",
"OCR": "<OCR_WITH_REGION>",
}

def __init__(self) -> None:
self.gadget: str = (
get_device_type()
) # Operate to find out the gadget sort (e.g., 'cpu' or 'cuda').

with patch("transformers.dynamic_module_utils.get_imports", fixed_get_imports):
self.mannequin: AutoModelForCausalLM = AutoModelForCausalLM.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.processor: AutoProcessor = AutoProcessor.from_pretrained(
self.MODEL_PATH, trust_remote_code=True
)
self.mannequin.to(self.gadget)

def translate_task(self, task_name: str) -> str:
return self.TASK_DICT.get(task_name, "<DETAILED_CAPTION>")

def name(
self, task_prompt: str, picture: Any, text_input: Optionally available[str] = None
) -> Any:

# Get the corresponding activity code for the given immediate
task_code: str = self.translate_task(task_prompt)

# Stop text_input for duties that don't require it
if task_code in [
"<OD>",
"<MORE_DETAILED_CAPTION>",
"<OCR_WITH_REGION>",
"<DETAILED_CAPTION>",
]:
text_input = None

# Assemble the immediate based mostly on whether or not text_input is offered
immediate: str = task_code if text_input is None else task_code + text_input

# Preprocess inputs for the mannequin
inputs = self.processor(textual content=immediate, photos=picture, return_tensors="pt").to(
self.gadget
)

# Generate predictions utilizing the mannequin
generated_ids = self.mannequin.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
early_stopping=False,
do_sample=False,
num_beams=3,
)

# Decode and course of generated output
generated_text: str = self.processor.batch_decode(
generated_ids, skip_special_tokens=False
)[0]

parsed_answer: dict[str, Any] = self.processor.post_process_generation(
generated_text, activity=task_code, image_size=(picture.width, picture.top)
)

return parsed_answer[task_code]

On Apple Silicon, the gadget turns into mps and the latency of those mannequin calls is tolerable. This code must also work on GPU and CPU, although this has not been examined.

Let’s run via one other instance to see the agent outputs from every step. To name the agent on an enter question and picture we will use the Agent.invoke technique, which follows the identical course of as described in my earlier article so as to add every node output to an inventory of outcomes along with saving outputs in a LangGraph InMemoryStore object.

We’ll be utilizing the next picture, which presents an attention-grabbing problem if we ask a tough query like “Are there timber on this picture? In that case, discover them and describe what they’re doing”

Testing picture for this part. Photograph by Hannah Lim on Unsplash

from image_agent.agent.Agent import Agent
from image_agent.utils import load_secrets

secrets and techniques = load_secrets()

# use GPT4 for normal imaginative and prescient mode
full_agent_gpt_vision = Agent(
openai_api_key=secrets and techniques["OPENAI_API_KEY"],vision_mode="gpt"
)

# use native mannequin for normal imaginative and prescient
full_agent_qwen_vision = Agent(
openai_api_key=secrets and techniques["OPENAI_API_KEY"],vision_mode="native"
)

In a great world the reply is simple: There are not any timber.

Nevertheless this seems to be a troublesome query for the agent and it’s attention-grabbing to check the responses when it utilizing GPT-4o-mini vs. Qwen2 as the overall imaginative and prescient mannequin.

After we name full_agent_qwen_vision with this question, we get a nasty outcome: Each Qwen2 and Florence2 fall for the trick and report that timber are current (apparently, if we modify “timber” to “canine”, we get the proper reply)

Plan: 
Name generalist imaginative and prescient with the query 'Are there timber on this picture? In that case, what are they doing?'. Then name specialist imaginative and prescient in object particular mode with the phrase 'cat'.

Plan_structure:
{
"1": {"tool_name": "general_vision", "tool_mode": "dialog", "tool_input": "Are there timber on this picture? In that case, what are they doing?"},
"2": {"tool_name": "special_vision", "tool_mode": "particular object detection", "tool_input": "tree"}
}

Plan output:
[
{1: 'Yes, there are trees in the image. They appear to be part of a tree line against a pink background.'}
[
{2: '{"bboxes": [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], "labels": ["tree"]}'}
]

Evaluation:
The outcome adequately solutions the person's query by confirming the presence of timber within the picture and offering an outline of their look and context. The output from each the generalist and specialist imaginative and prescient instruments is constant and informative.

Qwen2 appears topic to blindly following the prompts trace that right here is likely to be timber current. Florence2 additionally fails right here, reporting a bounding field when it mustn’t

If requested “Are there timber on this picture, In that case, discover them and describe what they’re doing”, each Qwen2 and Florence2 fall for the trick. Picture generated by the writer.
If requested “Are there canine on this picture? In that case, discover them and describe what they’re doing”, each the Qwen and GPT-based brokers will produce the right reply. Picture generated by the writer.

If we name full_agent_gpt_visionwith the identical question, GPT4o-mini doesn’t fall for the trick, however the name to Florence2 hasn’t modified so it nonetheless fails. We then see the question evaluation step in motion as a result of the generalist and specialist fashions have produced conflicting outcomes.

Node : general_vision
Job : plan_output
[
{1: 'There are no trees in this image. It features a group of dogs sitting in front of a pink wall.'}
]

Node : special_vision
Job : plan_output
[
{2: '{"bboxes": [[235.77601623535156, 427.864501953125, 321.7920227050781, 617.2275390625]], "labels": ["tree"]}'}
]

Node : evaluation
Job : answer_assessment
The outcome incorporates conflicting info.
The primary half states that there are not any timber within the picture, whereas the second half supplies a bounding field and label indicating {that a} tree is current.
This inconsistency means the person's query shouldn't be adequately answered.

The agent then tries a number of occasions to restructure the plan, however Florence2 insists on producing a bounding field for “tree”, which the reply evaluation nodes all the time catches as inconsistent. This can be a higher outcome than the Qwen2 agent, however factors to a broader problem of false positives with Florence2. This may very well be addressed by having the routing node consider the plan after each step after which solely name Florence2 if completely needed.

With the fundamental constructing blocks in place, this method is ripe for experimentation, iteration and enchancment and I’ll proceed so as to add to the repo over the approaching weeks. For now although, this text is lengthy sufficient!

Thanks for making it to the tip and I hope the challenge right here prompts some inspiration to your personal initiatives! The orchestration of a number of specialist fashions inside agent frameworks is a robust and more and more accessible method to placing LLMs to work on advanced duties. Clearly there’s nonetheless lots of room for improvI for one look ahead to seeing how concepts on this discipline develop over the approaching yr.