BenchmarkAgent

A fully autonomous agentic system for building customized, high-quality benchmarks for LLMs & MLLMs — at scale, on demand.

Abstract

Benchmark Everything, Everywhere, All at Once

Benchmarks are crucial for evaluating LLMs and MLLMs, yet their construction is labor-intensive, difficult to scale, and often prone to rapid performance saturation.

We present BenchmarkAgent, a fully autonomous agentic system that standardizes and automates benchmark construction, covering user requirement analysis, subtask design, data annotation, and quality control.

Across 15 representative benchmarks spanning text, multimodal, and domain-specific reasoning, BenchmarkAgent produces high-quality samples with minimal human involvement, as validated by human evaluation, LLM-as-a-judge assessment, and consistency checks.

Examples

Try the Benchmarks Yourself

Real samples generated by BenchmarkAgent — click an option to see whether you got it right.

1 / 20