Evaluating progress of LLMs on scientific problem-solving (opens in new tab)
Current scientific benchmarks for large language models (LLMs) often focus on simple knowledge recall and multiple-choice responses, which do not reflect the complex, context-rich reasoning required in real-world research. To bridge this gap, Google Research has introduced CURIE, alongside the SPIQA and FEABench datasets, to evaluate LLMs on their ability to understand long-form documents, analyze multimodal data, and solve multi-step problems. These benchmarks aim to move AI from merely surfacing facts to actively assisting scientists in workflows involving information extraction, algebraic manipulation, and tool use.
The CURIE Multitask Benchmark
- CURIE spans six diverse scientific disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins.
- The benchmark includes 10 challenging tasks, such as concept tracking, information aggregation, and cross-domain expertise, based on 429 full-length research documents.
- The complexity of the benchmark is reflected in its scale, with input queries averaging 15,000 words and ground truth responses averaging 954 words.
- Domain experts were involved in every phase of development, from sourcing papers to creating nuanced ground-truth answers in formats like JSON, LaTeX, and YAML.
Multimodal Reasoning and Agentic Simulation
- The SPIQA (Scientific Paper Image Question Answering) dataset evaluates the ability of multimodal LLMs to ground their answers in complex figures and tables found in scientific literature.
- FEABench (Finite Element Analysis Benchmark) measures the ability of LLM agents to simulate and solve multiphysics, mathematics, and engineering problems.
- These tools specifically test whether models can choose the correct computational tools and reason through the physical constraints of a given problem.
Programmatic and Model-Based Evaluation
- Because scientific answers are often descriptive or formatted heterogeneously, the evaluation uses programmatic metrics like ROUGE-L and Intersection-over-Union (IoU).
- For free-form and complex technical generation, the framework incorporates model-based evaluations to ensure AI responses align with expert assessments.
- Task difficulty is quantified by expert ratings, ensuring the benchmark measures high-level reasoning rather than just pattern matching.
These new benchmarks provide a rigorous framework for developing LLMs that can act as true collaborators in the scientific process. By focusing on long-context understanding and tool-integrated reasoning, researchers can better track the progress of AI in handling the actual complexities of modern scientific discovery.