Modality | Supported Languages |
---|---|
Textual content | Arabic, Chinese language, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian |
Imaginative and prescient | English |
Audio | English, Chinese language, German, French, Italian, Japanese, Spanish, Portuguese |
Architectural Developments of Phi-4 Multimodal
1. Unified Illustration House
Phi-4’s mixture-of-LoRAs structure permits simultaneous processing of speech, imaginative and prescient, and textual content. Not like earlier fashions that required distinct sub-models, Phi-4 treats all inputs inside the identical framework, considerably enhancing effectivity and coherence.
2. Scalability and Effectivity
- Optimized for low-latency inference, making it well-suited for cell and edge computing purposes.
- Helps bigger vocabulary units, enhancing language reasoning throughout multimodal inputs.
- Constructed with smaller but highly effective parameterization (5.6B parameters), permitting environment friendly deployment with out compromising efficiency.
3. Improved AI Reasoning
Phi-4 performs exceptionally nicely in duties that require chart/desk understanding and doc reasoning, because of its means to synthesize imaginative and prescient and audio inputs. Benchmarks point out larger accuracy in comparison with different state-of-the-art multimodal fashions, notably in structured knowledge interpretation.
Imaginative and prescient Processing Pipeline
- Imaginative and prescient Encoder:
- Processes picture inputs and converts them right into a sequence of characteristic representations (tokens).
- Seemingly makes use of a pretrained imaginative and prescient mannequin (e.g., CLIP, Imaginative and prescient Transformer).
- Token Merging:
- Reduces the variety of visible tokens to enhance effectivity whereas preserving info.
- Imaginative and prescient Projector:
- Converts visible tokens right into a format suitable with the tokenizer for additional processing.
Audio Processing Pipeline
- Audio Encoder:
- Processes uncooked audio and converts it right into a sequence of characteristic tokens.
- Seemingly primarily based on a speech-to-text or waveform mannequin (e.g., Wav2Vec2, Whisper).
- Audio Projector:
- Maps audio embeddings right into a suitable token house for integration with the language mannequin.
Tokenization and Fusion
- The Tokenizer integrates info from imaginative and prescient, audio, and textual content by inserting picture and audio placeholders into the token sequence.
- This unified illustration is then despatched to the language mannequin.
The Phi-4 Mini Mannequin
The core Phi-4 Mini mannequin is answerable for reasoning, producing responses, and fusing multimodal info.
- Stacked Transformer Layers:
- It follows a transformer-based structure for processing multimodal enter.
- LoRA Adaptation (Low-Rank Adaptation):
- The mannequin is fine-tuned utilizing LoRA (Low-Rank Adaptation) for each imaginative and prescient (LoRAᵥ) and audio (LoRAₐ).
- LoRA helps effectively adapt pretrained weights with out considerably growing mannequin dimension.
How Phi-4 Structure Works?
- Picture and audio inputs are individually processed by their respective encoders.
- Encoded representations cross by means of projection layers to align with the language mannequin’s token house.
- The tokenizer fuses the knowledge, making ready it for processing by the Phi-4 Mini mannequin.
- The Phi-4 Mini mannequin, enhanced with LoRA, generates text-based outputs primarily based on multimodal context.
Comparability of Phi-4 Multimodal on Totally different Benchmarks
Phi-4 Multimodal Audio and Visible Benchmarks

The benchmarks seemingly assess the fashions’ capabilities in AI2D, ChartQA, DocVQA, and InfoVQA, that are customary datasets for evaluating multimodal fashions, notably in visible question-answering (VQA) and doc understanding.
- s_AI2D (AI2D Benchmark)
- Evaluates reasoning over diagrams and pictures.
- Phi-4-multimodal-instruct (68.9) performs higher than InternOmni-7B (53.9) and Gemini-2.0-Flash-Lite (62).
- Gemini-2.0-Flash (69.4) barely outperforms Phi-4, whereas Gemini-1.5-Professional (67.7) is barely decrease.
- s_ChartQA (Chart Query Answering)
- Focuses on deciphering charts and graphs.
- Phi-4-multimodal-instruct (69) outperforms all different fashions.
- The following closest competitor is InternOmni-7B (56.1), however Gemini-2.0-Flash (51.3) and Gemini-1.5-Professional (46.9) carry out considerably worse.
- s_DocVQA (Doc VQA – Studying Paperwork and Extracting Data)
- Evaluates how nicely a mannequin understands and solutions questions on paperwork.
- Phi-4-multimodal-instruct (87.3) leads the pack.
- Gemini-2.0-Flash (80.3) and Gemini-1.5-Professional (78.2) carry out nicely however stay behind Phi-4.
- s_InfoVQA (Data-based Visible Query Answering)
- Assessments the mannequin’s means to extract and purpose over info introduced in photos.
- Phi-4-multimodal-instruct (63.7) is once more the top-performing mannequin.
- Gemini-1.5-Professional (66.1) is barely forward, however the different Gemini fashions underperform.
Phi-4 Multimodal Speech Benchmarks.
- Phi-4-Multimodal-Instruct excels in Speech Recognition, beating all rivals in FLEURS, OpenASR, and CommonVoice.
- Phi-4 struggles in Speech Translation, performing worse than WhisperV3, Qwen2-Audio, and Gemini fashions.
- Speech QA is a weak spot, with Gemini-2.0-Flash and GPT-4o-RT far forward.
- Phi-4 is aggressive in Audio Understanding, however Gemini-2.0-Flash barely outperforms it.
- Speech Summarization is common, with GPT-4o-RT performing barely higher.
Phi-4 Multimodal Imaginative and prescient Benchmarks
- Phi-4 is a prime performer in OCR, doc intelligence, and science reasoning.
- It’s stable in multimodal duties however lags behind in video notion and a few math-related benchmarks.
- It competes nicely with fashions like Gemini-2.0-Flash and GPT-4o however has room for enchancment in multi-image and object presence duties.

Phi-4 Multimodal Imaginative and prescient High quality Radar Chart
Key Takeaways from the Radar Chart
1. Phi-4-Multimodal-Instruct’s Strengths
- Excels in Visible Science Reasoning: Phi-4 achieves one of many highest scores on this class, outperforming most rivals.
- Robust in In style Aggregated Benchmark: It ranks among the many prime fashions, suggesting sturdy total efficiency throughout multimodal duties.
- Aggressive in Object Visible Presence Verification: It performs equally to high-ranking fashions in verifying object presence in photos.
- Respectable in Chart & Desk Reasoning: Whereas not one of the best, Phi-4 maintains a aggressive edge on this area.
2. Phi-4’s Weaknesses
- Underperforms in Visible Math Reasoning: It isn’t a pacesetter on this space, with Gemini-2.0-Flash and GPT-4o outperforming it.
- Lags in Multi-Picture Notion: Phi-4 is weaker in dealing with multi-image or video-based notion in comparison with fashions like GPT-4o and Gemini-2.0-Flash.
- Common in Doc Intelligence: Whereas it performs nicely, it’s not one of the best on this class in comparison with some rivals.
Palms-On Expertise: Implementing Phi-4 Multimodal
Microsoft supplies open-source assets that permit builders to discover Phi-4-multimodal’s capabilities. Beneath, we discover sensible purposes utilizing Phi-4 multimodal.
Required packages
!pip flash_attn==2.7.4.post1 torch==2.6.0 transformers==4.48.2 speed up==1.3.0 soundfile==0.13.1 pillow==11.1.0 scipy==1.15.2 torchvision==0.21.0 backoff==2.2.1 peft==0.13.2
Required imports
import requests
import torch
import os
import io
from PIL import Picture
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen
Outline mannequin path
model_path = "microsoft/Phi-4-multimodal-instruct"
# Load mannequin and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
mannequin = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
attn_implementation='flash_attention_2',
).cuda()
Load era config
generation_config = GenerationConfig.from_pretrained(model_path)
Outline Immediate Construction
user_prompt="<|consumer|>"
assistant_prompt="<|assistant|>"
prompt_suffix = '<|finish|>'
Picture Processing
print("n--- IMAGE PROCESSING ---")
image_url="https://www.ilankelman.org/stopsigns/australia.jpg"
immediate = f'{user_prompt}<|image_1|>What's proven on this picture?{prompt_suffix}{assistant_prompt}'
print(f'>>> Promptn{immediate}')
Obtain and open the picture
picture = Picture.open(requests.get(image_url, stream=True).uncooked)
inputs = processor(textual content=immediate, photos=picture, return_tensors="pt").to('cuda:0')
Generate Response
generate_ids = mannequin.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].form[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Responsen{response}')
Enter Picture

Output
The picture reveals a avenue scene with a purple cease signal within the foreground. The
cease signal is mounted on a pole with an ornamental prime. Behind the cease signal,
there's a conventional Chinese language constructing with purple and inexperienced colours and
Chinese language characters on the signboard. The constructing has a tiled roof and is
adorned with purple lanterns hanging from the eaves. There are a number of folks
strolling on the sidewalk in entrance of the constructing. A black SUV is parked on
the road, and there are two trash cans on the sidewalk. The road is
lined with varied retailers and indicators, together with one for 'Optus' and one other
for 'Kuo'. The general scene seems to be in an city space with a mixture of
fashionable and conventional parts.
equally, you can even for audio processing
print("n--- AUDIO PROCESSING ---")
audio_url = "https://add.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to textual content, after which translate the audio to French. Use <sep> as a separator between the unique transcript and the interpretation."
immediate = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Promptn{immediate}')
# Downlowd and open audio file
audio, samplerate = sf.learn(io.BytesIO(urlopen(audio_url).learn()))
# Course of with the mannequin
inputs = processor(textual content=immediate, audios=[(audio, samplerate)], return_tensors="pt").to('cuda:0')
generate_ids = mannequin.generate(
**inputs,
max_new_tokens=1000,
generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].form[1]:]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Responsen{response}')
Use Case:
- AI-powered information reporting by means of real-time speech transcription.
- Voice-controlled digital assistants with clever interplay.
- Actual-time multilingual audio translation for international communication.
Way forward for Multimodal AI and Edge Functions
One of many standout points of Phi-4-multimodal is its means to function on edge gadgets, making it a perfect answer for IoT purposes and environments with restricted computing assets.
Potential Edge Deployments:
- Good House Assistants: Combine into IoT gadgets for superior house automation.
- Healthcare Functions: Enhance diagnostics and affected person monitoring by means of multimodal evaluation.
- Industrial Automation: Allow AI-driven monitoring and anomaly detection in manufacturing.
Conclusion
Microsoft’s Phi-4 Multimodal is a breakthrough in AI, seamlessly integrating textual content, imaginative and prescient, and speech processing in a compact, high-performance mannequin. Splendid for AI assistants, doc processing, and multilingual purposes, it unlocks new prospects in sensible, intuitive AI options.
For builders and researchers, hands-on entry to Phi-4 permits cutting-edge innovation—from code era to real-time voice translation and IoT purposes—pushing the boundaries of multimodal AI.