Evaluate and compare large language models with an open, extensible framework.
AI21 Labs’ lm-evaluation is an open-source, extensible framework for benchmarking large language models (LLMs). Hosted on GitHub and aligned with AI21 Labs’ mission to make machines thought partners for humans, it enables researchers and developers to test, compare, and analyze model performance at scale. With support for custom properties, compliance decoration, and add-on tasks, the toolkit helps teams adapt evaluations to specific requirements while leveraging robust documentation and a collaborative open-source community.