Evaluating progress of LLMs on scientific problem-solving -

Programmatic and model-based evaluations

Duties in CURIE are diversified and have ground-truth annotations in blended and heterogeneous type, e.g., as JSONs, latex equations, YAML recordsdata, or free-form textual content. Evaluating free-form era is difficult as a result of solutions are sometimes descriptive, and even when a format is specified, as in most of our circumstances, the response to every area can have differing varieties. For instance, supplies grid factors might generally be specified as “[p, q, r]” and at different occasions as “p × q × r”. Therefore, along with the programmatic analysis metrics, similar to ROUGE-L, intersection-over-inion (used for BIOGR), and identification ratio (utilized in PDB), we suggest two model-based analysis metrics.

(1) LMScore: Prompts an LLM asking how carefully the predictions match floor reality on a 3-point scale: “good” if the prediction has few minor errors, “okay” if there are lots of minor errors, and “dangerous” if there are main errors. We contemplate the weighted common of the log-likelihood scores of the tokens to supply a remaining confidence.

(2) LLMSim: Is used for retrieval duties the place we ask the mannequin to exhaustively extract many particulars, e.g., descriptors, properties and values of supplies from a analysis doc, and supply as output an unordered checklist of dictionaries or data. We use a chain-of-thought (CoT) immediate that asks the LLM to take a look at every ground-truth document and determine the anticipated data that accurately match every area (key) and worth of the bottom reality. As soon as we match the ground-truth data with predicted data, we will then measure precision and recall for the retrieval job, and compute the imply common precision, recall and F1 scores throughout all paperwork.

Evaluating progress of LLMs on scientific problem-solving

Programmatic and model-based evaluations

Why worldwide alignment of cybersecurity rules must be a precedence

Can Google Do Higher Than OpenAI?

Contained in the controversial tree farms powering Apple’s carbon impartial purpose

Discover insights from the AI in Schooling Report

Synthetic intelligence enhances air mobility planning | MIT Information

Why worldwide alignment of cybersecurity rules must be a precedence

Can Google Do Higher Than OpenAI?

Contained in the controversial tree farms powering Apple’s carbon impartial purpose

Discover insights from the AI in Schooling Report