Hallucination Attenuated Language and Imaginative and prescient Assistant

We use LLaVA-v1.5, a extensively used open-sourced MLLM, as our base mannequin and practice it utilizing our contrastive tuning framework (HALVA). We then consider its efficiency on object hallucination mitigation and common visible query answering duties (VQA) towards fine-tuning–based mostly approaches, HA-DPO and EOS. We take into account LLaVA-v1.5 because the decrease certain and GPT-4V as a powerful reference level given its efficiency on normal benchmarks.

We use the AMBER benchmark and Caption Hallucination Evaluation with Picture Relevance (CHAIR) metric to guage MLLM efficiency on picture description duties, assessing each hallucination fee and the extent of element of their generated picture descriptions. The latter facet is quantified by calculating the share of ground-truth objects current within the picture which are precisely captured within the mannequin’s output. Our purpose is to mitigate hallucinations whereas retaining or bettering the richness of picture descriptions. As proven within the left plot under, HALVA captures extra ground-truth objects whereas hallucinating lower than HA-DPO. Furthermore, whereas EOS achieves a barely decrease hallucination fee, it degrades the extent of element within the picture descriptions, performing worse than HALVA.

We additionally use the F1-score to check the efficiency of MLLMs on visible query answering duties utilizing the AMBER benchmark for object hallucination and TextVQA benchmark for common imaginative and prescient language accuracy. As proven in the suitable plot under, each HA-DPO and EOS underperform HALVA in mitigating object hallucination and even deteriorate common vision-language skills in comparison with the bottom mannequin.