Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Duties in CURIE are diversified and have ground-truth annotations in blended and heterogeneous type, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form era is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our circumstances, the response to every area can have differing varieties. For instance, supplies grid factors might generally be specified as “[p, q, r]” and at different occasions as “p × q × r”. Therefore, along with the programmatic analysis metrics, similar to ROUGE-L, intersection-over-inion (used for BIOGR), and identification ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are lots of minor errors, and “dangerous” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to supply a remaining confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered checklist of dictionaries or data. We use a chain-of-thought (CoT) immediate that asks the LLM to take a look at every ground-truth document and determine the anticipated data that accurately match every area (key) and worth of the bottom reality. As soon as we match the ground-truth data with predicted data, we will then measure precision and recall for the retrieval job, and compute the imply common precision, recall and F1 scores throughout all paperwork.