Core Metrics and Techniques for SLM and LLM Evaluations

Back to blog

Tutorials

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Feb 4, 2025

12 min read

1. Introduction

When you fine-tune a Specialized Language Model (SLM) from a teacher LLM, you often want to compare the SLM’s output directly with the teacher’s. This can be for:

Distillation: ensuring the SLM mimics the teacher’s responses.
Performance Consistency: verifying you haven’t lost crucial domain knowledge.
Style & Quality: checking if the SLM meets domain or style constraints the teacher had.

Below, we survey essential NLP evaluation metrics—both classic (e.g., BLEU, ROUGE) and embedding-based (e.g., BERTScore)—and show how to treat the teacher’s output as the “reference.” We also discuss optional data cleansingand how to use an LLM as a judge for more nuanced or human-like feedback. Finally, we provide seven default prompt templates you can use as a starting point for LLM-based evaluations.

2. Types of Evaluation Metrics

2.1 Token-Overlap Metrics

BLEU - Compares candidate text to reference(s) via overlapping n-grams (1-4). Common in machine translation.
- Pros - Quick, widely used, historically recognized.
- Cons - Overly literal—penalizes synonyms, paraphrasing.
ROUGE - Standard for summarization (ROUGE-1, ROUGE-2, ROUGE-L). Measures n-gram and subsequence overlap.
- Pros - Commonly used in summarization; easy to compute.
- Cons - Similar pitfalls as BLEU regarding paraphrasing.
CHRF - Uses character n-grams; robust for morphological variations or languages without clear word boundaries.
- Pros - More tolerant of word segmentation differences.
- Cons - Still an n-gram approach, can miss deeper semantic changes.

Overall:

Pros: Quick to run, well-known in research and industry.
Cons: Can be too literal, ignoring synonyms or rearranged wording.

2.2 Embedding-Based Metrics

BERTScore - Uses contextual embeddings (e.g., BERT) to measure semantic similarity between candidate & reference.
- Pros - Captures paraphrasing, synonyms better than pure n-gram metrics.
- Cons - Depends on embedding model quality; may be slower to compute.
BLEURT, MoverScore (community) - Other semantic approaches mixing embeddings & advanced distance measures.
- Pros - Often correlate better with human judgments.
- Cons - Typically require extra packages or more complex setup.

2.3 Task-Specific Metrics

Accuracy / F1: For classification tasks (e.g., sentiment, topic identification).
Factual Consistency: If you want to check correctness in medical/legal domains (requires knowledge sources or specialized checkers).
Word Error Rate (WER) / CER: For speech recognition or text fidelity tasks.

3. Optional Data Cleansing & Preprocessing

While many models and metrics handle raw text, you may consider preprocessing:

Trimming & Lowercasing: Remove extra spaces, unify case.
HTML/Markdown Removal: If outputs contain formatting tags (<p>, *bold**).
Stemming / Lemmatization: If using BLEU/ROUGE and you expect frequent morphological variations (e.g., running vs. ran).
Stopword Removal: Only if you want to emphasize content words.

Caution: Embedding-based metrics (BERTScore) and LLM-based judging typically work best with unmodified or lightly cleaned text, to preserve the full context and semantics.

Sample Code

import re
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')        # Needed for word_tokenize
nltk.download('wordnet')      # Needed for WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def cleanse_text(text):
    # 1. Remove HTML tags or Markdown if present
    text = re.sub(r"<[^>]+>", "", text)          # strip HTML
    text = re.sub(r"[*_`~]", "", text)           # simplistic markdown removal

    # 2. Convert to lowercase and trim whitespace
    text = text.lower().strip()

    # 3. Tokenize
    tokens = nltk.word_tokenize(text)

    # 4. Optional: Lemmatize
    lemmas = [lemmatizer.lemmatize(tok) for tok in tokens]

    # Combine back to string
    return " ".join(lemmas)

# Example usage
raw_example = "The cats were <b>RUNNING</b> in the **house**."
cleaned_example = cleanse_text(raw_example)
print("Cleaned Text:", cleaned_example)
# e.g., "the cat be running in the house ."

Note:

For embedding-based or LLM-based methods, often you’d want minimal alteration (maybe just removing HTML and normalizing whitespace) because context is important.
For BLEU/ROUGE, consistent cleaning of references & candidates can help avoid false mismatches.

4. Using the Teacher's Output as “Reference”

Rationale

If you lack official ground-truth references, the teacher’s output can be your “gold” text.
This is common in distillation scenarios: you want to see if the SLM can produce outputs similar to the teacher.

Hugging Face Evaluate Examples

Hugging Face Evaluate offers quick metric computation for BLEU, ROUGE, BERTScore, etc.

BLEU & ROUGE Example

import evaluate

teacher_outputs = [
    "Paris is the capital of France.",
    "Cats are commonly kept as pets."
]
slm_outputs = [
    "Paris is France’s main city.",
    "Cats are small domestic animals."
]

bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

bleu_result = bleu_metric.compute(predictions=slm_outputs, references=teacher_outputs)
rouge_result = rouge_metric.compute(predictions=slm_outputs, references=teacher_outputs)

print("SLM vs Teacher - BLEU:", bleu_result["bleu"])
print("SLM vs Teacher - ROUGE:", rouge_result)

Interpretation:

If BLEU/ROUGE is high, the SLM’s output lexically resembles the teacher.
However, low BLEU/ROUGE might still be acceptable if the SLM is paraphrasing (check with an embedding-based or LLM approach).

BERTScore Example

bs_metric = evaluate.load("bertscore")
bs_result = bs_metric.compute(
    predictions=slm_outputs,
    references=teacher_outputs,
    model_type="bert-base-uncased"
)

print("SLM vs Teacher - BERTScore F1:", bs_result["f1"])

Interpretation:

A higher BERTScore suggests the SLM output is semantically close to the teacher’s text, even if wording differs.

5. LLM as a Judge

Why: Classic metrics can’t always capture factual correctness, stylistic preferences, or domain-specific requirements. An LLM-based evaluation can offer human-like judgments.

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def llm_judge(teacher_text, slm_text):
    prompt = f"""
    You are an impartial text quality evaluator.

    Teacher's text: {teacher_text}
    SLM's text: {slm_text}

    Compare them for accuracy, completeness, style, or any major differences.
    Provide a short summary or verdict.
    """
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )
    return response["choices"][0]["message"]["content"]

# Example usage
teacher_out = "Paris is the capital of France."
slm_out = "Paris is the largest city in France, popular for art."

verdict = llm_judge(teacher_out, slm_out)
print(verdict)

Pros:

Captures nuances beyond string overlap.
Flexible: can incorporate domain instructions or style guidelines.

Cons:

Cost: LLM calls can be expensive at scale.
Prompt Sensitivity: Different prompts can yield different results.
Bias: The LLM might not always reflect a “universal standard” of correctness.

5.1 Seven Default Prompt Templates

Below are seven sample prompts you can adapt as starter templates for your manual LLM-judge approach. Each covers a different style or scenario of evaluation.

1. Numeric Scoring (Simple)

Goal: Get a single numeric score (1–10) plus a short explanation.

Use Cases: Quick overall quality check (summarization, translation, short answers).

2. Pairwise Comparison (Teacher vs. SLM)

Goal: Judge which of two texts (teacher or SLM) is better, or if they’re equivalent.

Use Cases: Distillation tasks; see if the SLM’s output surpasses or matches the teacher’s.

3. Multi-Dimensional Scoring

Goal: Collect scores for several criteria (Fluency, Relevance, Factual Correctness).

Use Cases: More granular evaluation where multiple dimensions matter.

4. Summarization Check (for Summaries)

Goal: Evaluate how well a candidate summary captures the main points of a reference document.

Use Cases: Summarizing lengthy text, ensuring key info is preserved.

5. Dialogue Coherence (Multi-Turn)

Goal: Evaluate coherence in a multi-turn conversation.

Use Cases: Chatbots, Q&A sessions, multi-turn dialogues.

6. Fact-Checking Prompt

Goal: Specifically test for factual correctness if the reference text is considered ground truth.

Use Cases: Domain-specific tasks (medical, legal), knowledge-intensive checks.

System: You are a fact-checking AI with deep domain knowledge.

User:
Reference (Facts or Source): {reference_text}
Candidate: {candidate_text}

Identify any factual errors or contradictions in the candidate text.
Provide a short explanation if errors exist.
Finally, rate the candidate text’s accuracy on a scale from 0 to 100.

Answer Format:
Accuracy Score: X
Errors Found: [...]

7. Style & Tone Alignment

Goal: Evaluate if the candidate text matches a specific style or brand voice.

Use Cases: Marketing copy, creative writing, brand consistency.

6. Conclusion

When comparing an SLM to a teacher LLM:

Token-overlap metrics (BLEU, ROUGE, CHRF) provide quick, historically recognized checks on textual similarity—but can be too literal.
Embedding-based metrics (BERTScore, BLEURT, etc.) capture paraphrases and semantic overlap.
Task-specific metrics (Accuracy, F1, WER) address classification or speech tasks.
Optional data cleansing (lemmatization, HTML removal) standardizes text if needed.
LLM-as-a-judge adds a qualitative, human-like perspective, capturing style or domain correctness numeric metrics might overlook.

By treating the teacher’s output as the reference, you measure how closely your SLM follows or distills the teacher’s knowledge and style. For deeper qualitative checks, use the seven default prompt templates above as a jumping-off point, customizing them to your domain or scoring preferences.

Combining overlap-based, embedding-based, and LLM-based methods offers the fullest picture of your SLM’s performance and alignment with the teacher’s outputs.

In this post

Title

Learn more about AI

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Benchmarks

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

Benchmarks

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

AI Models

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

AI Models

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

Tutorials

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Benchmarks

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Evaluating a Specialized Language Model (SLM) Against Its Teacher

1. Introduction

2. Types of Evaluation Metrics

2.1 Token-Overlap Metrics

2.2 Embedding-Based Metrics

2.3 Task-Specific Metrics

3. Optional Data Cleansing & Preprocessing

Sample Code

4. Using the Teacher's Output as “Reference”

Rationale

Hugging Face Evaluate Examples

BLEU & ROUGE Example

BERTScore Example

5. LLM as a Judge

5.1 Seven Default Prompt Templates

1. Numeric Scoring (Simple)

2. Pairwise Comparison (Teacher vs. SLM)

3. Multi-Dimensional Scoring

4. Summarization Check (for Summaries)

5. Dialogue Coherence (Multi-Turn)

6. Fact-Checking Prompt

7. Style & Tone Alignment

6. Conclusion

Further Reading & Resources

In this post

Learn more about AI

Learn more about AI

Learn more about AI

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

Are Newer LLMs Hallucinating More? Ways to Solve AI Hallucinations

Apple Foundation Models Framework Benchmarks and Custom Adapters Training with Datawizz

The Death of RAG? Do We Still Need Retrieval Augmented Generation in the Age of Large Contexts?