Serverless Model Deployment

Home

Product Features

Model Evaluation

Home

Product Features

Model Evaluation

Evaluate the performance of multiple models using your real data, and based on several evaluation metrics.

Book a Call

Automated Evaluation

Evaluate AI models for better results

Run automatic benchmarks on various models using your data to find the right model for every job

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

68%

35%

20%

40%

60%

80%

100%

Avg Score

Llama-3.2-1B-Trained

meta-llama/Llama-3.2-1B-Instruct

Evaluation Functions

Sophisticated Benchmark Functions

Datawizz offers multiple evaluation metrics like ROUGUE score, BERT score, string matching and so forth for maximum flexibility

wer
BLEU
cer
ROUGE
CHRF

Only in Datawizz

LLM as Judge

Datawizz let’s you use LLMs to judge outputs from candidate models, perfect for unique cases with complex evaluation

Only in Datawizz

JSON Evaluation

Unique comparison for JSON objects and lists, perfect for evaluating data extraction, structured output and function calling use cases

String Equality
SacreBLEU
METEOR
BERTScore

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

You:

Create a short tagline for a new cybersecurity software for businesses

gpt-4o:

Protect what matters with advanced cybersecurity

llama-3.2-1B:

Fortify your business with unbeatable cyber defense.

You:

I like the GPT's answer better

Your prompt...

Manual Comparison

Easily compare models “side by side”

The Datawizz platform has a convinient UI for sending requests to multiple models (both public like OpenAI and custom trained models) and comparing their responses side-by-side

Pricing

How cheap it really is

Pricing depends on your selected base model, and applies for the public models, as well as any models you train from it.

Base Model

Input Tokens ($/1M)

Output Tokens ($/1M)

Llama 3.2 1B

$0.10

Llama 3.2 3B

$0.15

Phi-3 Mini

$0.15

Command-R 7B

$0.25

Ministral 8B

$0.25

Llama 3.3 70B

$1.20

Input Tokens

Output Tokens

Base Model

Tokens ($/1M)

Llama 3.2 1B

$0.10

Llama 3.2 3B

$0.15

Phi-3 Mini

$0.15

Command-R 7B

$0.25

Ministral 8B

$0.25

Llama 3.3 70B

$1.20

Base Model

Input Tokens ($/1M)

Output Tokens ($/1M)

Llama 3.2 1B

$0.10

Llama 3.2 3B

$0.15

Phi-3 Mini

$0.15

Command-R 7B

$0.25

Ministral 8B

$0.25

Llama 3.3 70B

$1.20

Model Evaluation

Evaluate AI models for better results

Sophisticated Benchmark Functions

LLM as Judge

JSON Evaluation

Easily compare models “side by side”

How cheap it really is

How It Works

How It Works

Benefits

Benefits

Contact

Contact

Blog

Blog

How It Works

How It Works

Benefits

Benefits

Contact

Contact

Blog

Blog

How It Works

How It Works

Benefits

Benefits

Contact

Contact

Blog

Blog