LettuceDetect: A Hallucination Detection Framework for RAG Purposes -

Initially printed on HuggingFace

TL;DR

We current LettuceDetect, a light-weight hallucination detector for Retrieval-Augmented Era (RAG) pipelines. It’s an encoder-based mannequin constructed on ModernBERT, launched beneath the MIT license with ready-to-use Python packages and pretrained fashions.

What: LettuceDetect is a token-level detector that flags unsupported segments in LLM solutions. 🥬
How: Skilled on RAGTruth (18k examples), leveraging ModernBERT for context lengths as much as 4k tokens. 🚀
Why: It addresses (1) the context-window limits in prior encoder-only fashions, and (2) the excessive compute prices of LLM-based detectors. ⚖️

Highlights:
- Beats prior encoder-based fashions (e.g., Luna) on RAGTruth. ✅
- Surpasses fine-tuned Llama-2-13B [2] at a fraction of the scale, and is extremely environment friendly at inference. ⚡️
- Solely open-source with an MIT license. 🔓

LettuceDetect retains your RAG framework recent by recognizing rotten elements of your LLM’s outputs. 😊

Fast hyperlinks

Why LettuceDetect?

Giant Language Fashions (LLMs) have made appreciable developments in NLP duties, like GPT-4 [4], the Llama-3 fashions [5], or Mistral [6] (and lots of extra). Regardless of the success of LLMs, Hallucinations stay a key impediment deploying LLMs in high-stakes eventualities (akin to in healthcare or authorized) [7,8].

Retrieval-Augmented Era (RAG) makes an attempt to mitigate hallucinations by grounding an LLM’s responses in retrieved paperwork, offering exterior data that the mannequin can reference [9]. However though RAG is a robust methodology to scale back hallucinations, LLMs nonetheless endure from hallucinations in these settings [1]. Hallucinations are data within the output that’s nonsensical, factually incorrect, or inconsistent with the retrieved context [8]. Ji et al. [10] categorizes hallucinations into:

Intrinsic hallucinations: Stemming from the mannequin’s preexisting inner data.
Extrinsic hallucinations: Occurring when the reply conflicts with the context or references offered

Whereas RAG approaches can mitigate intrinsic hallucinations, they aren’t resistant to extrinsic hallucinations. Solar et al. [11] confirmed that fashions are likely to prioritize their intrinsic data over the exterior context. As LLMs stay vulnerable to hallucinations, their purposes in crucial domains e.g. medical or authorized, could be nonetheless flawed.

Present options for hallucination detection

Present options for hallucination detection could be categorized into completely different classes based mostly on the method they take:

Immediate-based detectors These strategies (e.g., RAGAS, Trulens, ARES) usually leverage zero-shot or few-shot prompts to detect hallucinations. They typically depend on giant LLMs (like GPT-4) and make use of methods akin to SelfCheckGPT [12], LM vs. LM [13], or Chainpoll [14]. Whereas typically efficient, they are often computationally costly resulting from repeated LLM calls.
Tremendous-tuned LLM detectors Giant fashions (e.g., Llama-2, Llama-3) could be fine-tuned for hallucination detection [1,15]. This could yield excessive accuracy (as proven by the RAGTruth authors utilizing Llama-2-13B or the RAG-HAT work on Llama-3-8B) however is resource-intensive to coach and deploy. Inference prices additionally are typically excessive resulting from their dimension and slower speeds.
Encoder-based detectors Fashions like Luna [2] depend on a BERT-style encoder (typically restricted to 512 tokens) for token-level classification. These strategies are typically extra environment friendly than working a full LLM at inference however are constrained by quick context home windows and a spotlight mechanisms optimized for smaller inputs.

ModernBERT for lengthy context

ModernBERT [3] is a drop-in alternative for BERT and is a state-of-the-art encoder-only transformers structure that comes with a number of fashionable design enhancements over the unique BERT mannequin such because it makes use of Rotary Positional Embeddings (RoPe) to deal with sequences of as much as 8,192 tokens, unpadding optimization to eradicate wasted computation on padding tokens, and GeGLU activation layers for enhanced expressiveness and alternating consideration for extra environment friendly consideration computation.

LettuceDetect capitalizes on ModernBERT’s prolonged context window to construct a token-level classifier for hallucination detection. This method sidesteps many limitations of older BERT-based fashions (e.g., quick context bounds) and avoids the inference overhead of huge LLM-based detectors. Our experiments present that LettuceDetect outperforms different encoder-based programs whereas remaining aggressive with fine-tuned LLM detectors at a fraction of their computational price.

Information

RAGTruth is the primary large-scale open-source (MIT License) benchmark particularly designed to guage hallucination detection in Retrieval-Augmented Era (RAG) settings. It comprises 18,000 annotated examples spanning a number of duties:

Query answering (QA): Sampled from the MS MARCO dataset, the place as much as three paperwork are retrieved for every query, and an LLM is prompted to provide a solution.
Information-to-text technology: Based mostly on the Yelp Open Dataset, the place LLMs generate opinions for sampled companies.
Information summarization: Makes use of random paperwork from CNN/DailyMail, prompting LLMs to offer summaries.

Every instance in RAGTruth contains:

A query and immediate,
A number of reference passages or context paperwork,
An LLM-generated response,
Human annotations on the span degree indicating whether or not any a part of the reply is factually unsupported or contradictory to the reference passages.

Under is an illustrative pattern from RAGTruth, exhibiting an occasion of hallucination:

QUESTION

The best way to clarify quartiles?

REFERENCE

Passage 2:
Second quartile (Q2) which is extra generally referred to as median splits the info in half (50%).
Median divides the info right into a decrease half and an higher half.
Third quartile (Q3), also referred to as higher quartile, splits lowest 75% (or highest 25%) of information.
(…)

RESPONSE

The primary quartile (Q1) splits the bottom 25% of the info, whereas the second quartile (Q2) splits the info into two equal halves, with the median being the center worth of the decrease half. Lastly, the third quartile (Q3) splits the very best 75% of the info.
(…)

ANNOTATION

Span: “highest 75%”
Sort: Evident Battle
Motive:
- Unique textual content: “lowest 75% (or highest 25%)”
- Generative textual content: “highest 75%”

This instance showcases a RAGTruth annotation involving particular spans within the reply that battle with the reference passages, offering each the kind of hallucination (e.g., Evident Battle) and a human-readable rationale. Our methodology solely makes use of binary annotations and doesn’t use the kind of hallucination annotations.

Methodology

Diagram of LettuceDetect — Picture by Writer

A high-level depiction of LettuceDetect. Right here, an instance Query, Context, and Reply triplet is processed. First, the textual content is tokenized, after which LettuceDetect performs token-level classification. Tokens from each the query and context are masked (indicated by the crimson line within the determine) to exclude them from the loss perform. Every token within the reply receives a chance indicating whether or not it’s hallucinated or supported. For span-level detection, we merge consecutive tokens with hallucination possibilities above 0.5 right into a single predicted span.

We prepare ModernBERT-base and ModernBERT-large variants as token-classification fashions on the RAGTruth dataset. The enter to the mannequin is a concatenation of Context, Query, and Reply segments, with specialised tokens ([CLS]) (for the context) and ([SEP]) (as separators). We restrict the sequence size to 4,096 tokens for computational feasibility, although ModernBERT can theoretically deal with as much as 8,192 tokens.

Tokenization and knowledge processing

Tokenizer: We make use of AutoTokenizer from the Transformers library to deal with subword Tokenization, inserting [CLS] and [SEP] appropriately.
Labeling:
- Context/query tokens are masked (i.e., assigned a label of -100 in PyTorch) in order that they don’t contribute to the loss.
- Every reply token receives a label of 0 (supported) or 1 (hallucinated).

Mannequin structure

Our fashions construct on Hugging Face’s AutoModelForTokenClassification, utilizing ModernBERT because the encoder and a classification head on high. Not like some earlier encoder-based approaches (e.g., ones pre-trained on NLI duties), our methodology makes use of solely ModernBERT with no extra pretraining stage.

Coaching configuration

Optimizer: AdamW, with a studying charge of 1 * 10^-5 and weight decay of 0.01.
{Hardware}: Single NVIDIA A100 GPU.
Epochs: 6 complete coaching epochs.
Batching:
- Batch dimension of 8,
- Information loading with PyTorch DataLoader (shuffling enabled),
- Dynamic padding by way of DataCollatorForTokenClassification to deal with variable-length sequences effectively.

Throughout coaching, we monitor token-level F1 scores on a validation break up, saving checkpoints utilizing the safetensors format. As soon as coaching is full, we add the best-performing fashions to Hugging Face for public entry.

At inference time, the mannequin outputs a chance of hallucination for every token within the reply. We combination consecutive tokens exceeding a 0.5 threshold to provide span-level predictions, indicating precisely which segments of the reply are prone to be hallucinated. The determine above illustrates this workflow.

Subsequent, we offer a extra detailed analysis of the mannequin’s efficiency.

Outcomes

We consider our fashions on the RAGTruth check set throughout all job varieties (Query Answering, Information-to-Textual content, and Summarization). For every instance, RAGTruth contains manually annotated spans indicating hallucinated content material.

Instance-level outcomes

We first assess the example-level query: Does the generated reply comprise any hallucination in any respect? Our giant mannequin (lettucedetect-large-v1) attains an total F1 rating of 79.22%, surpassing:

GPT-4 (63.4%),
Luna (65.4%) (the earlier state-of-the-art encoder-based mannequin),
Tremendous-tuned Llama-2-13B (78.7%) as offered within the RAGTruth paper.

It’s second solely to the fine-tuned Llama-3-8B from the RAG-HAT paper [15] (83.9%), however LettuceDetect is considerably smaller and quicker to run. In the meantime, our base mannequin (lettucedetect-base-v1) stays extremely aggressive whereas utilizing fewer parameters.

Comparison table illustrating how LettuceDetect aligns against both prompt-based methods (e.g., GPT-4) and alternative encoder-based solutions (e.g., Luna) — Picture by Writer

Above is a comparability desk illustrating how LettuceDetect aligns towards each prompt-based strategies (e.g., GPT-4) and different encoder-based options (e.g., Luna). Total, lettucedetect-large-v1 and lettucedect-base-v1 are very performant fashions, whereas being very efficient in inference settings.

Span-level outcomes

Past detecting if a solution comprises hallucinations, we additionally study LettuceDetect’s means to establish the actual spans of unsupported content material. Right here, LettuceDetect achieves state-of-the-art outcomes amongst fashions which have reported span-level efficiency, considerably outperforming the fine-tuned Llama-2-13B mannequin from the RAGTruth paper [1] and different baselines.

Most strategies, like RAG-HAT [15], don’t report span-level metrics, so we don’t evaluate to them right here.

Inference effectivity

Each lettucedetect-base-v1 and lettucedetect-large-v1 require fewer parameters than typical LLM-based detectors (e.g., GPT-4 or Llama-3-8B) and may course of 30–60 examples per second on a single NVIDIA A100 GPU. This makes them sensible for industrial workloads, real-time user-facing programs, and resource-constrained environments.

Total, these outcomes present that LettuceDetect has stability: it achieves close to state-of-the-art accuracy at a fraction of the scale and price in comparison with giant LLM-based judges, whereas providing exact, token-level hallucination detection.

Get going

Set up the package deal:

pip set up lettucedetect

Then, you need to use the package deal as follows:

from lettucedetect.fashions.inference import HallucinationDetector

# For a transformer-based method:

detector = HallucinationDetector(

    methodology="transformer", model_path="KRLabsOrg/lettucedect-base-modernbert-en-v1"

)

contexts = ["France is a country in Europe. The capital of France is Paris. The population of France is 67 million.",]

query = "What's the capital of France? What's the inhabitants of France?"

reply = "The capital of France is Paris. The inhabitants of France is 69 million."

# Get span-level predictions indicating which elements of the reply are thought-about hallucinated.

predictions = detector.predict(context=contexts, query=query, reply=reply, output_format="spans")

print("Predictions:", predictions)

# Predictions: [{'start': 31, 'end': 71, 'confidence': 0.9944414496421814, 'text': ' The population of France is 69 million.'}]

Conclusion

We launched LettuceDetect, a light-weight and environment friendly framework for hallucination detection in RAG programs. By using ModernBERT’s prolonged context capabilities, our fashions obtain robust efficiency on the RAGTruth benchmark whereas retaining excessive inference effectivity. This work lays the groundwork for future analysis instructions, akin to increasing to extra datasets, supporting a number of languages, and exploring extra superior architectures. Even at this stage, LettuceDetect demonstrates that efficient hallucination detection could be achieved utilizing lean, purpose-built encoder-based fashions.

Quotation

In the event you discover this work helpful, please cite it as follows:

@misc{Kovacs:2025,
      title={LettuceDetect: A Hallucination Detection Framework for RAG Purposes}, 
      writer={Ádám Kovács and Gábor Recski},
      yr={2025},
      eprint={2502.17125},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.17125}, 
}

Additionally, for those who use our code, please don’t overlook to present us a star ⭐ on our GitHub repository right here.