You can only optimize what you can measure - and when it comes to LLMs, the key for measurement is good evaluators. We've long prioritized making it simple for developers to evaluate models to make better decisions about where to route requests. We have a built in evaluation feature that can generate in-depth LLM performance reports, using pre-defined or custom evaluators.
Today we are expanding this feature by letting you use custom dependencies in your evaluators! Any custom code-based evaluator can now define script metadata to require any python package, like so:
We utilize the PEP 723 syntax for custom script metadata and runtime information. Read more about inline metadata in the official Python docs.
Under the Hood
To make this scalable and performant, we are also switching over to using UV to manage script dependencies and running these evaluator scripts. The UV switch makes it possible to dynamically install and run custom packages, balancing speed with proper script-level encapsulation. Read more about running scripts with UV.
Beyond Evals
We recently introduced the ability to run evaluators during model training. This allows you to gain a deeper understanding of how the model improves on key metrics you care about as the training is happening — shortening the model development lifecycle.
We are also planning to introduce the ability to run evaluations "online" as inference logs stream into the system - letting you use evaluators as part of your LLM observability stack.
All these places will benefit from our improved & more capable evaluators - giving you the flexibility to create bespoke evaluations to measure LLMs in the way most relevant to your application!
This feature is now live and available to all users. Happy evals!


