Evaluate the performance of multiple models using your real data, and based on several evaluation metrics.

Automated Evaluation

Automated Evaluation

Automated Evaluation

Evaluate AI models for better results

Run automatic benchmarks on various models using your data to find the right model for every job

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

Evaluation Functions

Evaluation Functions

Evaluation Functions

Sophisticated Benchmark Functions

Datawizz offers multiple evaluation metrics like ROUGUE score, BERT score, string matching and so forth for maximum flexibility

  • wer

  • BLEU

  • cer

  • ROUGE

  • CHRF

Only in Datawizz

Only in Datawizz

Only in Datawizz

LLM as Judge

Datawizz let’s you use LLMs to judge outputs from candidate models, perfect for unique cases with complex evaluation

Only in Datawizz

Only in Datawizz

Only in Datawizz

JSON Evaluation

Unique comparison for JSON objects and lists, perfect for evaluating data extraction, structured output and function calling use cases

  • String Equality

  • SacreBLEU

  • METEOR

  • BERTScore

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

Manual Comparison

Manual Comparison

Manual Comparison

Easily compare models “side by side”

The Datawizz platform has a convinient UI for sending requests to multiple models (both public like OpenAI and custom trained models) and comparing their responses side-by-side

Pricing

Pricing

Pricing

How cheap it really is

Pricing depends on your selected base model, and applies for the public models, as well as any models you train from it.

Pricing depends on your selected base model, and applies for the public models, as well as any models you train from it.

Pricing depends on your selected base model, and applies for the public models, as well as any models you train from it.

Base Model

Input Tokens ($/1M)

Output Tokens ($/1M)

Llama 3.2 1B

$0.10

$0.10

Llama 3.2 3B

$0.15

$0.15

Phi-3 Mini

$0.15

$0.15

Command-R 7B

$0.25

$0.25

Ministral 8B

$0.25

$0.25

Llama 3.3 70B

$1.20

$1.20

Input Tokens

Output Tokens

Base Model

Tokens ($/1M)

Llama 3.2 1B

$0.10

Llama 3.2 3B

$0.15

Phi-3 Mini

$0.15

Command-R 7B

$0.25

Ministral 8B

$0.25

Llama 3.3 70B

$1.20

Base Model

Input Tokens ($/1M)

Output Tokens ($/1M)

Llama 3.2 1B

$0.10

$0.10

Llama 3.2 3B

$0.15

$0.15

Phi-3 Mini

$0.15

$0.15

Command-R 7B

$0.25

$0.25

Ministral 8B

$0.25

$0.25

Llama 3.3 70B

$1.20

$1.20