Have you ever ever discovered your self looking at a product’s components record, googling unfamiliar chemical names to determine what they imply? It’s a typical battle – deciphering advanced product info on the spot could be overwhelming and time-consuming. Conventional strategies, like looking for every ingredient individually, usually result in fragmented and complicated outcomes. However what if there was a better, quicker method to analyze product components and get clear, actionable insights immediately? On this article, we’ll stroll you thru constructing a Product Substances Analyzer utilizing Gemini 2.0, Phidata, and Tavily Net Search. Let’s dive in and make sense of these ingredient lists as soon as and for all!
Studying Aims
- Design a Multimodal AI Agent structure utilizing Phidata and Gemini 2.0 for vision-language duties.
- Combine Tavily Net Search into agent workflows for higher context and knowledge retrieval.
- Construct a Product Ingredient Analyzer Agent that mixes picture processing and internet seek for detailed product insights.
- Learn the way system prompts and directions information agent habits in multimodal duties.
- Develop a Streamlit UI for real-time picture evaluation, vitamin particulars, and health-based recommendations.
This text was revealed as part of the Information Science Blogathon.
What are Multimodal Methods?
Multimodal programs course of and perceive a number of varieties of enter information—like textual content, photos, audio, and video—concurrently. Imaginative and prescient-language fashions, similar to Gemini 2.0 Flash, GPT-4o, Claude Sonnet 3.5, and Pixtral-12B, excel at understanding relationships between these modalities, extracting significant insights from advanced inputs.
On this context, we give attention to vision-language fashions that analyze photos and generate textual insights. These programs mix laptop imaginative and prescient and pure language processing to interpret visible info based mostly on consumer prompts.
Multimodal Actual-world Use Circumstances
Multimodal programs are reworking industries:
- Finance: Customers can take screenshots of unfamiliar phrases in on-line kinds and get immediate explanations.
- E-commerce: Customers can {photograph} product labels to obtain detailed ingredient evaluation and well being insights.
- Training: College students can seize textbook diagrams and obtain simplified explanations.
- Healthcare: Sufferers can scan medical stories or prescription labels for simplified explanations of phrases and dosage directions.
Why Multimodal Agent?
The shift from single-mode AI to multimodal brokers marks a serious leap in how we work together with AI programs. Right here’s what makes multimodal brokers so efficient:
- They course of each visible and textual info concurrently, delivering extra correct and context-aware responses.
- They simplify advanced info, making it accessible to customers who could battle with technical phrases or detailed content material.
- As a substitute of manually looking for particular person elements, customers can add a picture and obtain complete evaluation in a single step.
- By combining instruments like internet search and picture evaluation, they supply extra full and dependable insights.
Constructing Product Ingredient Analyzer Agent
Let’s break down the implementation of a Product Ingredient Evaluation Agent:
Step 1: Setup Dependencies
- Gemini 2.0 Flash: Handles multimodal processing with enhanced imaginative and prescient capabilities
- Tavily Search: Gives internet search integration for extra context
- Phidata: Orchestrates the Agent system and manages workflows
- Streamlit: To develop the prototype into Net-based functions.
!pip set up phidata google-generativeai tavily-python streamlit pillow
Step 2: API Setup and Configuration
On this step, we are going to arrange the atmosphere variables and collect the required API credentials to run this use case.
from phi.agent import Agent
from phi.mannequin.google import Gemini # wants a api key
from phi.instruments.tavily import TavilyTools # additionally wants a api key
import os
TAVILY_API_KEY = "<replace-your-api-key>"
GOOGLE_API_KEY = "<replace-your-api-key>"
os.environ['TAVILY_API_KEY'] = TAVILY_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY
Step 3: System immediate and Directions
To get higher responses from language fashions, it’s worthwhile to write higher prompts. This includes clearly defining the function and offering detailed directions within the system immediate for the LLM.
Let’s outline the function and obligations of an Agent with experience in ingredient evaluation and vitamin. The directions ought to information the Agent to systematically analyze meals merchandise, assess components, think about dietary restrictions, and consider well being implications.
SYSTEM_PROMPT = """
You might be an professional Meals Product Analyst specialised in ingredient evaluation and vitamin science.
Your function is to research product components, present well being insights, and determine potential considerations by combining ingredient evaluation with scientific analysis.
You make the most of your dietary information and analysis works to offer evidence-based insights, making advanced ingredient info accessible and actionable for customers.
Return your response in Markdown format.
"""
INSTRUCTIONS = """
* Learn ingredient record from product picture
* Keep in mind the consumer will not be educated in regards to the product, break it down in easy phrases like explaining to 10 12 months child
* Determine synthetic components and preservatives
* Verify towards main dietary restrictions (vegan, halal, kosher). Embrace this in response.
* Fee dietary worth on scale of 1-5
* Spotlight key well being implications or considerations
* Recommend more healthy alternate options if wanted
* Present transient evidence-based suggestions
* Use Search instrument for getting context
"""
Step 4: Outline the Agent Object
The Agent, constructed utilizing Phidata, is configured to course of markdown formatting and function based mostly on the system immediate and directions outlined earlier. The reasoning mannequin used on this instance is Gemini 2.0 Flash, identified for its superior potential to know photos and movies in comparison with different fashions.
For instrument integration, we are going to use Tavily Search, a complicated internet search engine that gives related context instantly in response to consumer queries, avoiding pointless descriptions, URLs, and irrelevant parameters.
agent = Agent(
mannequin = Gemini(id="gemini-2.0-flash-exp"),
instruments = [TavilyTools()],
markdown=True,
system_prompt = SYSTEM_PROMPT,
directions = INSTRUCTIONS
)
Step 5: Multimodal – Understanding the Picture
With the Agent elements now in place, the following step is to offer consumer enter. This may be executed in two methods: both by passing the picture path or the URL, together with a consumer immediate specifying what info must be extracted from the supplied picture.
Strategy: 1 Utilizing Picture Path
agent.print_response(
"Analyze the product picture",
photos = ["images/bournvita.jpg"],
stream=True
)
Output:
Strategy: 2 Utilizing URL
agent.print_response(
"Analyze the product picture",
photos = ["https://beardo.in/cdn/shop/products/9_2ba7ece4-0372-4a34-8040-5dc40c89f103.jpg?v=1703589764&width=1946"],
stream=True
)
Output:
Step 6: Develop the Net App utilizing Streamlit
Now that we all know the right way to execute the Multimodal Agent, let’s construct the UI half utilizing Streamlit.
import streamlit as st
from PIL import Picture
from io import BytesIO
from tempfile import NamedTemporaryFile
st.title("🔍 Product Ingredient Analyzer")
To optimize efficiency, outline the Agent inference beneath a cached operate. The cache decorator helps enhance effectivity by reusing the Agent occasion.
Since we’re utilizing Streamlit, which refreshes the whole web page after every occasion loop or widget set off, including st.cache_resource ensures the operate just isn’t refreshed and saves it within the cache.
@st.cache_resource
def get_agent():
return Agent(
mannequin=Gemini(id="gemini-2.0-flash-exp"),
system_prompt=SYSTEM_PROMPT,
directions=INSTRUCTIONS,
instruments=[TavilyTools(api_key=os.getenv("TAVILY_API_KEY"))],
markdown=True,
)
When a brand new picture path is supplied by the consumer, the analyze_image operate runs and executes the Agent object outlined in get_agent. For real-time seize and the choice to add photos, the uploaded file must be saved briefly for processing.
The picture is saved in a short lived file, and as soon as the execution is accomplished, the non permanent file is deleted to release assets. This may be executed utilizing the NamedTemporaryFile operate from the tempfile library.
def analyze_image(image_path):
agent = get_agent()
with st.spinner('Analyzing picture...'):
response = agent.run(
"Analyze the given picture",
photos=[image_path],
)
st.markdown(response.content material)
def save_uploaded_file(uploaded_file):
with NamedTemporaryFile(dir=".", suffix='.jpg', delete=False) as f:
f.write(uploaded_file.getbuffer())
return f.title
For a greater consumer interface, when a consumer selects a picture, it’s more likely to have various resolutions and sizes. To keep up a constant structure and correctly show the picture, we are able to resize the uploaded or captured picture to make sure it suits clearly on the display screen.
The LANCZOS resampling algorithm offers high-quality resizing, notably helpful for product photos the place textual content readability is essential for ingredient evaluation.
MAX_IMAGE_WIDTH = 300
def resize_image_for_display(image_file):
img = Picture.open(image_file)
aspect_ratio = img.top / img.width
new_height = int(MAX_IMAGE_WIDTH * aspect_ratio)
img = img.resize((MAX_IMAGE_WIDTH, new_height), Picture.Resampling.LANCZOS)
buf = BytesIO()
img.save(buf, format="PNG")
return buf.getvalue()
Step 7: UI Options for Streamlit
The interface is split into three navigation tabs the place the consumer can choose his selection of pursuits:
- Tab-1: Instance Merchandise that customers can choose to check the app
- Tab-2: Add an Picture of your selection if it’s already saved.
- Tab-3: Seize or Take a reside picture and analyze the product.
We repeat the identical logical movement for all the three tabs:
- First, select the picture of your selection and resize it to show on the Streamlit UI utilizing st.picture.
- Second, save that picture in a short lived listing to course of it to the Agent object.
- Third, analyze the picture the place the Agent execution will happen utilizing Gemini 2.0 LLM and Tavily Search instrument.
State administration is dealt with by means of Streamlit’s session state, monitoring chosen examples and evaluation standing.
def most important():
if 'selected_example' not in st.session_state:
st.session_state.selected_example = None
if 'analyze_clicked' not in st.session_state:
st.session_state.analyze_clicked = False
tab_examples, tab_upload, tab_camera = st.tabs([
"📚 Example Products",
"📤 Upload Image",
"📸 Take Photo"
])
with tab_examples:
example_images = {
"🥤 Vitality Drink": "photos/bournvita.jpg",
"🥔 Potato Chips": "photos/lays.jpg",
"🧴 Shampoo": "photos/shampoo.jpg"
}
cols = st.columns(3)
for idx, (title, path) in enumerate(example_images.objects()):
with cols[idx]:
if st.button(title, use_container_width=True):
st.session_state.selected_example = path
st.session_state.analyze_clicked = False
with tab_upload:
uploaded_file = st.file_uploader(
"Add product picture",
sort=["jpg", "jpeg", "png"],
assist="Add a transparent picture of the product's ingredient record"
)
if uploaded_file:
resized_image = resize_image_for_display(uploaded_file)
st.picture(resized_image, caption="Uploaded Picture", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Uploaded Picture", key="analyze_upload"):
temp_path = save_uploaded_file(uploaded_file)
analyze_image(temp_path)
os.unlink(temp_path)
with tab_camera:
camera_photo = st.camera_input("Take an image of the product")
if camera_photo:
resized_image = resize_image_for_display(camera_photo)
st.picture(resized_image, caption="Captured Picture", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Captured Picture", key="analyze_camera"):
temp_path = save_uploaded_file(camera_photo)
analyze_image(temp_path)
os.unlink(temp_path)
if st.session_state.selected_example:
st.divider()
st.subheader("Chosen Product")
resized_image = resize_image_for_display(st.session_state.selected_example)
st.picture(resized_image, caption="Chosen Instance", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Instance", key="analyze_example") and never st.session_state.analyze_clicked:
st.session_state.analyze_clicked = True
analyze_image(st.session_state.selected_example)
Necessary Hyperlinks
- You will discover the complete code right here.
- Exchange the “<replace-with-api-key>” placeholder along with your keys.
- For tab_examples, it’s worthwhile to have a folder picture. And save the pictures over there. Right here is the GitHub URL with photos listing right here.
- In case you are fascinated about utilizing the use case, right here is the deployed App right here.
Conclusion
Multimodal AI brokers symbolize a higher leap ahead in how we are able to work together with and perceive advanced info in our each day lives. By combining imaginative and prescient processing, pure language understanding, and internet search capabilities, these programs, just like the Product Ingredient Analyzer, can present immediate, complete evaluation of merchandise and their components, making knowledgeable decision-making extra accessible to everybody.
Key Takeaways
- Multimodal AI brokers enhance how we perceive product info. They mix textual content and picture evaluation.
- With Phidata, an open-source framework, we are able to construct and handle agent programs. These programs use fashions like GPT-4o and Gemini 2.0.
- Brokers use instruments like imaginative and prescient processing and internet search. This makes their evaluation extra full and correct. LLMs have restricted information, so brokers use instruments to deal with advanced duties higher.
- Streamlit makes it simple to construct internet apps for LLM-based instruments. Examples embrace RAG and multimodal brokers.
- Good system prompts and directions information the agent. This ensures helpful and correct responses.
Regularly Requested Questions
A. LLaVA (Giant Language and Imaginative and prescient Assistant), Pixtral-12B by Mistral.AI, Multimodal-GPT by OpenFlamingo, NVILA by Nvidia, and Qwen mannequin are a couple of open supply or weights multimodal imaginative and prescient language fashions that course of textual content and pictures for duties like visible query answering.
A. Sure, Llama 3 is multimodal, and in addition Llama 3.2 Imaginative and prescient fashions (11B and 90B parameters) course of each textual content and pictures, enabling duties like picture captioning and visible reasoning.
A. A Multimodal Giant Language Mannequin (LLM) processes and generates information throughout varied modalities, similar to textual content, photos, and audio. In distinction, a Multimodal Agent makes use of such fashions to work together with its atmosphere, carry out duties, and make choices based mostly on multimodal inputs, usually integrating extra instruments and programs to execute advanced actions.
The media proven on this article just isn’t owned by Analytics Vidhya and is used on the Creator’s discretion.