Multimodal LLMs (MLLMs) promise that they will interpret something on a picture. It’s true for many circumstances, comparable to picture captioning and object detection.
However can it fairly and precisely perceive information offered on a chart?
If you happen to actually need to construct an app that tells you what to do whenever you level your digicam at a automotive dashboard, the LLMs chart interpretation abilities ought to be distinctive.
After all, Multimodal LLMs can narrate what’s on a chart, however consuming information and answering advanced consumer questions is difficult.
I wished to learn how tough it’s.
I arrange eight challenges for LLMs to resolve. Each problem has a rudimentary chart and a query for the LLM to reply. We all know the right reply as a result of we created the information, however the LLM must determine it out solely utilizing the visualization given to it.
As of penning this, and based on my understanding, there are 5 distinguished Multimodal LLM suppliers out there: OpenAI (GPT4o), Meta Llama 3.2 (11B & 90B fashions), Mistral with its model new Pixtral 12B, Cloude 3.5 Sonnet, and Google’s Gemini 1.5.