Case-Research: Multilingual LLM for Questionnaire Summarization | by Sria Louis | Jul, 2024

An LLM Method to Summarizing College students’ Responses for Open-ended Questionnaires in Language Programs

Illustration: Or Livneh

Madrasa (מדרסה in Hebrew) is an Israeli NGO devoted to educating Arabic to Hebrew audio system. Lately, whereas studying Arabic, I found that the NGO has distinctive information and that the group may profit from a radical evaluation. A buddy and I joined the NGO as volunteers, and we had been requested to work on the summarization job described beneath.

What makes this summarization job so fascinating is the distinctive mixture of paperwork in three languages — Hebrew, Arabic, and English — whereas additionally coping with the imprecise transcriptions amongst them.

A phrase on privateness: The information might embrace PII and due to this fact can’t be revealed presently. In case you imagine you’ll be able to contribute, please contact us.

Context of the Downside

As a part of its language programs, Madrasa distributes questionnaires to college students, which embrace each quantitative questions requiring numeric responses and open-ended questions the place college students present solutions in pure language.

On this weblog publish, we are going to focus on the open-ended pure language responses.

The Downside

The first problem is managing and extracting insights from a considerable quantity of responses to open-ended questions. Particularly, the difficulties embrace:

Multilingual Responses: Scholar responses are primarily in Hebrew but additionally embrace Arabic and English, creating a fancy multilingual dataset. Moreover, since transliteration is usually utilized in Spoken Arabic programs, we discovered that college students generally answered questions utilizing each transliteration and Arabic script. We had been shocked to see that some college students even transliterated Hebrew and Arabic into Latin letters.

Nuanced Sentiments: The responses differ broadly in sentiment and tone, together with humor, strategies, gratitude, and private reflections.

Numerous Matters: College students contact on a variety of topics, from praising lecturers to reporting technical points with the web site and app, to non-public aspirations.

The Information

There are couple of programs. Every course contains three questionnaires administered originally, center, and finish of the course. Every questionnaire incorporates a couple of open-ended questions.

The tables beneath gives examples of two questions together with a curated choice of pupil responses.

Instance of a Query and Scholar Responses. LEFT: Authentic query and pupil responses. RIGHT: Translation into English for the weblog publish reader. Word the combination of languages, together with Arabic-to-Hebrew transliteration, the number of matters even inside and the identical sentences, and the completely different language registers. . Credit score: Sria Louis / Madarsa
Instance of a query and pupil responses. LEFT: Authentic query and pupil responses. RIGHT: Translation into English for the weblog publish reader. Word the combination of languages and transliterations, together with each English-to-Hebrew and Hebrew-to-English. Credit score: Sria Louis / Madarsa

There are tens of 1000’s of pupil responses for every query, and after splitting into sentences (as described beneath), there might be as much as round 100,000 sentences per column. This quantity is manageable, permitting us to work regionally.

Our objective is to summarize pupil opinions on varied matters for every course, questionnaire, and open-ended query. We intention to seize the “important opinions” of the scholars whereas making certain that “area of interest opinions” or “helpful insights” supplied by particular person college students usually are not ignored.

The Resolution

To deal with challenges point out above, we applied a multi-step pure language processing (NLP) answer.

The method pipeline includes:

  1. Sentence Tokenization (utilizing NLTK Sentence Tokenizer)
  2. Matter Modeling (utilizing BERTopic)
  3. Matter illustration (utilizing BERTopic + LLM)
  4. Batch summarizing (LLM with mini-batch becoming into the context-size)
  5. Re-summarizing the batches to create a remaining complete abstract.

Sentence Tokenization: We use NLTK to divide pupil responses into particular person sentences. This course of is essential as a result of pupil inputs typically cowl a number of matters inside a single response. For instance, a pupil may write, “The trainer used day-to-day examples. The video games on the app had been excellent.” Right here, every sentence addresses a special facet of their expertise. Whereas sentence tokenization generally leads to the lack of context attributable to cross-references between sentences, it typically enhances the general evaluation by breaking down responses into extra manageable and topic-specific models. This method has confirmed to considerably enhance the tip outcomes.

NLTK’s Sentence Tokenizer (nltk.tokenize.sent_tokenize) splits paperwork into sentences utilizing linguistics guidelines and fashions to establish sentence boundaries. The default English mannequin labored nicely for our use case.

Matter Modeling with BERTopic: We utilized BERTopic to mannequin the matters of the tokenized sentences, establish underlying themes, and assign a subject to every sentence. This step is essential earlier than summarization for a number of causes. First, the number of matters inside the pupil responses is simply too huge to be dealt with successfully with out matter modeling. By splitting the scholars’ solutions into matters, we are able to handle and batch the info extra effectively, resulting in improved efficiency throughout evaluation. Moreover, matter modeling ensures that area of interest matters, talked about by just a few college students, don’t get overshadowed by mainstream matters in the course of the summarization course of.

BERTopic is a chic topic-modeling software that embeds paperwork into vectors, clusters them, and fashions every cluster’s illustration. Its key benefit is modularity, which we make the most of for Hebrew embeddings and hyperparameter tuning.

The BERTopic configuration was meticulously designed to handle the multilingual nature of the info and the particular nuances of the responses, thereby enhancing the accuracy and relevance of the subject project.

Particularly, word that we used a Hebrew Sentence-embedding mannequin. We did think about using an embedding on word-level, however the sentence-embedding proved to be capturing the wanted data.

For dimension-reduction and clustering we used BERTopic normal fashions UMAP and HDBSCAN, respectively, and with some hyper-parameter fantastic tuning the outcomes happy us.

Right here’s a incredible discuss on HDBSCAN by John Healy, one of many authors. It’s not simply very academic; the speaker is absolutely humorous and witty! Positively price a watch 🙂

BERTopic has glorious documentation and a supportive group, so I’ll share a code snippet to point out how straightforward it’s to make use of with superior fashions. Extra importantly, we wish to emphasize some hyperparameter decisions designed to realize excessive cluster granularity and permit smaller matters. Keep in mind that our objective isn’t solely to summarize the “mainstream” concepts that the majority college students agree upon but additionally to spotlight nuanced views and rarer college students’ strategies. This method comes with the trade-off of slower processing and the chance of getting too many matters, however managing ~40 matters remains to be possible.

  • UMAP dimension discount: higher-than-standard variety of parts and small variety of UMAP-neighbors.
  • HDBSCAN clustering: min_sample = 2 for top sensitivity, whereas min_cluster_size = 7 permits very small clusters.
  • BERTopic: nr_topics = 40.
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer
from bertopic.vectorizers import ClassTfidfTransformer

topic_size_ = 7

# Sentence Embedding in Hebrew (works nicely additionally on English)
sent_emb_model = "imvladikon/sentence-transformers-alephbert"
sentence_model = SentenceTransformer(sent_emb_model)

# Initialize UMAP mannequin for dimensionality discount to enhance BERTopic
umap_model = UMAP(n_components=128, n_neighbors=4, min_dist=0.0)

# Initialize HDBSCAN mannequin for BERTopic clustering
hdbscan_model = HDBSCAN(min_cluster_size = topic_size_,
gen_min_span_tree=True,
prediction_data=True,
min_samples=2)

# class-based TF-IDF vectorization for matter illustration previous to clustering
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Initialize MaximalMarginalRelevance for enhancing matter illustration
representation_model = MaximalMarginalRelevance(variety=0.1)

# Configuration for BERTopic
bert_config = {
'embedding_model': sentence_model,
'top_n_words': 20, # Variety of prime phrases to characterize every matter
'min_topic_size': topic_size_,
'nr_topics': 40,
'low_memory': False,
'calculate_probabilities': False,
'umap_model': umap_model,
'hdbscan_model': hdbscan_model,
'ctfidf_model': ctfidf_model,
'representation_model': representation_model
}

# Initialize BERTopic mannequin with the desired configuration
topic_model = BERTopic(**bert_config)

Matter Illustration & Summarization

For the following two components — matter illustration and matter summarization — we used chat-based LLMs, fastidiously crafting system and person prompts. The easy method concerned setting the system immediate to outline the duties of key phrase extraction and summarization, and utilizing the person immediate to enter a prolonged checklist of paperwork, constrained solely by context limits.

Earlier than diving deeper, let’s focus on the selection of chat-based LLMs and the infrastructure used. For a speedy proof of idea and improvement cycle, we opted for Ollama, recognized for its straightforward setup and minimal friction. we encountered some challenges switching fashions on Google Colab, so we determined to work regionally on my M3 laptop computer. Ollama makes use of the Mac iGPU effectively and proved satisfactory for my wants.

Initially, we examined varied multilingual fashions, together with LLaMA2, LLaMA3 and LLaMA3.1. Nonetheless, a brand new model of the Dicta 2.0 mannequin was launched just lately, which outperformed the others instantly. Dicta 2.0 not solely delivered higher semantic outcomes but additionally featured improved Hebrew tokenization (~one token per Hebrew character), permitting for longer context lengths and due to this fact bigger batch processing with out high quality loss.

Dicta is an LLM, bilingual (Hebrew/English), fine-tuned on Mistral-7B-v0.1. and is obtainable on Hugging Face.

Matter Illustration: This significant step in matter modeling includes defining and describing matters by way of consultant key phrases or phrases, capturing the essence of every matter. The intention is to create clear, concise descriptions to grasp the content material related to every matter. Whereas BERTopic provides efficient instruments for matter illustration, we discovered it simpler to make use of exterior LLMs for this objective. This method allowed for extra versatile experimentation, akin to key phrase immediate engineering, offering larger management over matter description and refinement.

“תפקידך למצוא עד חמש מילות מפתח של הטקסט ולהחזירן מופרדות בסימון נקודה. הקפד שכל מילה נבחרת תהיה מהטקסט הנתון ושהמילים תהיינה שונות אחת מן השניה. החזר לכל היותר חמש מילים שונות, בעברית, בשורה אחת קצרה, ללא אף מילה נוספת לפני או אחרי, ללא מספור וללא מעבר שורה וללא הסבר נוסף.”

  • Person immediate was merely key phrases and consultant sentences returned by BERTopic default illustration mannequin (c-tf-idf).

Batch Summarization with LLM Fashions: For every matter, we employed an LLM to summarize pupil responses. As a result of giant quantity of information, responses had been processed in batches, with every batch summarized individually earlier than aggregating these summaries right into a remaining complete overview.

“המטרה שלך היא לתרגם לעברית ואז לסכם בעברית. הקלט הוא תשובות התלמידים לגבי השאלה הבאה [<X>]. סכם בפסקה אחת בדיוק עם לכל היותר 10 משפטים. הקפד לוודא שהתשובה מבוססת רק על הדעות שניתנו. מבחינה דקדוקית, נסח את הסיכום בגוף ראשון יחיד, כאילו אתה אחד הסטודנטים. כתוב את הסיכום בעברית בלבד, ללא תוספות לפני או אחרי הסיכום”

[<X>] above is the string of the query, that we try to summarize.

  • Person immediate was was a batch of scholars’ response (as within the instance above)

Word that we required translation to Hebrew earlier than summarization. With out this specification, the mannequin sometimes responded in English or Arabic if the enter contained a mixture of languages.

[Interestingly, Dicta 2.0 was able to converse in Arabic as well. This is surprising because Dicta 2.0 was not trained on Arabic (according to its release post, it was trained on 50% English and 50% Hebrew), and its base model, Mistral, was not specifically trained on Arabic either.]

Re-group the Batches: The non-trivial remaining step concerned re-summarizing the aggregated batches to supply a single cohesive abstract per matter per query. This required meticulous immediate engineering to make sure related insights from every batch had been precisely captured and successfully introduced. By refining prompts, we guided the LLM to give attention to key factors, leading to a complete and insightful abstract.

This multi-step method allowed us to successfully handle the multilingual and nuanced dataset, extract vital insights, and supply actionable suggestions to reinforce the academic expertise at מדרסה (Madrasa).

Analysis

Evaluating the summarization job usually includes guide scoring of the abstract’s high quality by people. In our case, the duty contains not solely summarization but additionally enterprise insights. Subsequently, we require a abstract that captures not solely the typical pupil’s response but additionally the sting instances and uncommon or radical insights from a small variety of college students.

To deal with these wants, we cut up the analysis into the six steps talked about and assess them manually with a business-oriented method. You probably have a extra rigorous methodology for holistic analysis of such a undertaking, we’d love to listen to your concepts 🙂

As an example, let’s have a look at one query from a questionnaire in the course of a rookies’ course. The scholars had been requested: “אנא שתף אותנו בהצעות לשיפור הקורס” (in English: “Please share with us strategies for bettering the course”).

Most college students responded with optimistic suggestions, however some supplied particular strategies. The number of strategies is huge, and utilizing clustering (matter modeling) and summarization, we are able to derive impactful insights for the NGO’s administration crew.

Here’s a plot of the subject clusters, introduced utilizing BERTopic visualization instruments.

Hierarchical Clustering: For visualization functions, we current a set of 10 matters. Nonetheless, in some instances our evaluation included experimentation with tens of matters. Credit score: Sria Louis / Madrasa.

And eventually, beneath are seven matters (out of 40) summarizing the scholars’ responses to the above query. Every matter contains its key phrases (generated by the key phrase immediate), three consultant responses from the cluster (chosen utilizing Illustration Mannequin), and the ultimate summarization.

Backside line, word the number of matters and the insightful summaries.

A number of the matters: key phrases, representing sentences and summaries. Credit score: Sria Louis / Madrasa

What subsequent?

We have now six steps in thoughts:

  1. Optimization: Experimenting with completely different architectures and hyperparameters.
  2. Robustness: Understanding and addressing sudden sensitivity to sure hyperparameters.
  3. Hallucinations: Tackling hallucinations, significantly in small clusters/matters the place the variety of enter sentences is proscribed, inflicting the mannequin to generate ‘imaginary’ data.
  4. Enriching Summarizations: Utilizing chain-of-thought strategies.
  5. Enriching Matter Modeling: Including sentiment evaluation earlier than clustering. For instance, if in a selected matter 95% of the responses had been optimistic however 5% had been very detrimental, it may be useful to cluster based mostly on each the subject and the sentiment within the sentence. This may assist the summarizer keep away from converging to the imply.
  6. Enhancing Person Expertise: Implementing RAG or LLM-explainability strategies. As an example, given a selected non-trivial perception, we would like the person to click on on the perception and hint again to the precise pupil’s response that led to the perception.

Leave a Reply