Custom Evaluations in Model Fine Tuning

Back to blog

Tutorials

Run Custom Evaluations During Training with Datawizz

Nov 3, 2025

7 min read

You can now run custom evaluators during model training to get deeper insight into model improvement during training. When you fine-tune a model on Datawizz, we always show metrics like training and evaluation loss in a live chart. But sometimes, loss doesn't tell the whole story. Instead of waiting for the training run to finish and only then running your evaluators - you can run evaluators during model training to get a live view of how well your model is improving.

The evaluators will run on every eval step against the eval data, and report live alongside the eval loss score.

When training a model, you can open the “Custom Evaluators” options section to select custom evaluators to run: these can be some of the built in evaluators or any custom evaluators you have defined in your project.

You should also pay attention to a couple of related training configurations:

Validation Split - The portion of the training data reserved for validation during training. The evaluators will run against this data.
Number of Evaluations per Epoch - this controls how many evaluation steps run every epoch. Adding custom evaluators slows down evaluation steps, so consider lowering the number of evaluations per epochs for a faster training run.

During the training process, you’ll be able to see the evaluator results alongside the training and evaluation loss in the live updating chart. Remember that evaluators are typically aligned so higher is better while for loss lower is better - so expect an x style shape…

Custom Evaluators and LLM as Judge Evaluations

Beyond using our built in metrics like ROUGE scores, string equality and JSON comparison (amongst others), you can define your own custom evaluators for fine grained control. For instance:

In agentic use case, you might want to evaluate tool calls generated by the models
For data extraction, you may want to validate specific keys & values in the generated data
For multi-label classification, you might want to test for things like precision and recall
For SQL generation, you can ensure the generated text is valid SQL (or even set up a test database and validate queries against it)

Datawizz allows you to create custom evaluation functions easily in Python - or use prompting to leverage LLM as Judge. Read our docs on defining custom evaluators. All custom evaluators can be used during training now!