google

Evaluating progress of LLMs on scientific problem-solving (opens in new tab)

Current scientific benchmarks for large language models (LLMs) often focus on simple knowledge recall and multiple-choice responses, which do not reflect the complex, context-rich reasoning required in real-world research. To bridge this gap, Google Research has introduced CURIE, alongside the SPIQA and FEABench datasets, to evaluate LLMs on their ability to understand long-form documents, analyze multimodal data, and solve multi-step problems. These benchmarks aim to move AI from merely surfacing facts to actively assisting scientists in workflows involving information extraction, algebraic manipulation, and tool use.

The CURIE Multitask Benchmark

  • CURIE spans six diverse scientific disciplines: materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins.
  • The benchmark includes 10 challenging tasks, such as concept tracking, information aggregation, and cross-domain expertise, based on 429 full-length research documents.
  • The complexity of the benchmark is reflected in its scale, with input queries averaging 15,000 words and ground truth responses averaging 954 words.
  • Domain experts were involved in every phase of development, from sourcing papers to creating nuanced ground-truth answers in formats like JSON, LaTeX, and YAML.

Multimodal Reasoning and Agentic Simulation

  • The SPIQA (Scientific Paper Image Question Answering) dataset evaluates the ability of multimodal LLMs to ground their answers in complex figures and tables found in scientific literature.
  • FEABench (Finite Element Analysis Benchmark) measures the ability of LLM agents to simulate and solve multiphysics, mathematics, and engineering problems.
  • These tools specifically test whether models can choose the correct computational tools and reason through the physical constraints of a given problem.

Programmatic and Model-Based Evaluation

  • Because scientific answers are often descriptive or formatted heterogeneously, the evaluation uses programmatic metrics like ROUGE-L and Intersection-over-Union (IoU).
  • For free-form and complex technical generation, the framework incorporates model-based evaluations to ensure AI responses align with expert assessments.
  • Task difficulty is quantified by expert ratings, ensuring the benchmark measures high-level reasoning rather than just pattern matching.

These new benchmarks provide a rigorous framework for developing LLMs that can act as true collaborators in the scientific process. By focusing on long-context understanding and tool-integrated reasoning, researchers can better track the progress of AI in handling the actual complexities of modern scientific discovery.