What occurs while you take a working chatbot that’s already serving hundreds of shoppers a day in 4 totally different languages, and attempt to ship a good higher expertise utilizing Massive Language Fashions? Good query.
It’s well-known that evaluating and evaluating LLMs is hard. Benchmark datasets might be laborious to come back by, and metrics equivalent to BLEU are imperfect. However these are largely tutorial issues: How are trade information groups tackling these points when incorporating LLMs into manufacturing initiatives?
In my work as a Conversational AI Engineer, I’m doing precisely that. And that’s how I ended up centre-stage at a latest information science convention, giving the (optimistically titled) speak, “No baseline? No benchmarks? No biggie!” Right this moment’s submit is a recap of this, that includes:
- The challenges of evaluating an evolving, LLM-powered PoC towards a working chatbot
- How we’re utilizing various kinds of testing at totally different phases of the PoC-to-production course of
- Sensible execs and cons of various check varieties