Tutorials

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Evaluating a Specialized Language Model (SLM) Against Its Teacher

Feb 4, 2025

12 min read

1. Introduction

When you fine-tune a Specialized Language Model (SLM) from a teacher LLM, you often want to compare the SLM’s output directly with the teacher’s. This can be for:

  • Distillation: ensuring the SLM mimics the teacher’s responses.

  • Performance Consistency: verifying you haven’t lost crucial domain knowledge.

  • Style & Quality: checking if the SLM meets domain or style constraints the teacher had.

Below, we survey essential NLP evaluation metrics—both classic (e.g., BLEU, ROUGE) and embedding-based (e.g., BERTScore)—and show how to treat the teacher’s output as the “reference.” We also discuss optional data cleansingand how to use an LLM as a judge for more nuanced or human-like feedback. Finally, we provide seven default prompt templates you can use as a starting point for LLM-based evaluations.

2. Types of Evaluation Metrics

2.1 Token-Overlap Metrics

  • BLEU - Compares candidate text to reference(s) via overlapping n-grams (1-4). Common in machine translation.

    • Pros - Quick, widely used, historically recognized.

    • Cons - Overly literal—penalizes synonyms, paraphrasing.

  • ROUGE - Standard for summarization (ROUGE-1, ROUGE-2, ROUGE-L). Measures n-gram and subsequence overlap.

    • Pros - Commonly used in summarization; easy to compute.

    • Cons - Similar pitfalls as BLEU regarding paraphrasing.

  • CHRF - Uses character n-grams; robust for morphological variations or languages without clear word boundaries.

    • Pros - More tolerant of word segmentation differences.

    • Cons - Still an n-gram approach, can miss deeper semantic changes.

Overall:

  • Pros: Quick to run, well-known in research and industry.

  • Cons: Can be too literal, ignoring synonyms or rearranged wording.

2.2 Embedding-Based Metrics

  • BERTScore - Uses contextual embeddings (e.g., BERT) to measure semantic similarity between candidate & reference.

    • Pros - Captures paraphrasing, synonyms better than pure n-gram metrics.

    • Cons - Depends on embedding model quality; may be slower to compute.

  • BLEURT, MoverScore (community) - Other semantic approaches mixing embeddings & advanced distance measures.

    • Pros - Often correlate better with human judgments.

    • Cons - Typically require extra packages or more complex setup.

2.3 Task-Specific Metrics

  • Accuracy / F1: For classification tasks (e.g., sentiment, topic identification).

  • Factual Consistency: If you want to check correctness in medical/legal domains (requires knowledge sources or specialized checkers).

  • Word Error Rate (WER) / CER: For speech recognition or text fidelity tasks.

3. Optional Data Cleansing & Preprocessing

While many models and metrics handle raw text, you may consider preprocessing:

  1. Trimming & Lowercasing: Remove extra spaces, unify case.

  2. HTML/Markdown Removal: If outputs contain formatting tags (<p>, *bold**).

  3. Stemming / Lemmatization: If using BLEU/ROUGE and you expect frequent morphological variations (e.g., running vs. ran).

  4. Stopword Removal: Only if you want to emphasize content words.

Caution: Embedding-based metrics (BERTScore) and LLM-based judging typically work best with unmodified or lightly cleaned text, to preserve the full context and semantics.

Sample Code

import re
import nltk
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')        # Needed for word_tokenize
nltk.download('wordnet')      # Needed for WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

def cleanse_text(text):
    # 1. Remove HTML tags or Markdown if present
    text = re.sub(r"<[^>]+>", "", text)          # strip HTML
    text = re.sub(r"[*_`~]", "", text)           # simplistic markdown removal

    # 2. Convert to lowercase and trim whitespace
    text = text.lower().strip()

    # 3. Tokenize
    tokens = nltk.word_tokenize(text)

    # 4. Optional: Lemmatize
    lemmas = [lemmatizer.lemmatize(tok) for tok in tokens]

    # Combine back to string
    return " ".join(lemmas)

# Example usage
raw_example = "The cats were <b>RUNNING</b> in the **house**."
cleaned_example = cleanse_text(raw_example)
print("Cleaned Text:", cleaned_example)
# e.g., "the cat be running in the house ."

Note:

  • For embedding-based or LLM-based methods, often you’d want minimal alteration (maybe just removing HTML and normalizing whitespace) because context is important.

  • For BLEU/ROUGE, consistent cleaning of references & candidates can help avoid false mismatches.

4. Using the Teacher's Output as “Reference”

Rationale

  • If you lack official ground-truth references, the teacher’s output can be your “gold” text.

  • This is common in distillation scenarios: you want to see if the SLM can produce outputs similar to the teacher.

Hugging Face Evaluate Examples

Hugging Face Evaluate offers quick metric computation for BLEU, ROUGE, BERTScore, etc.

BLEU & ROUGE Example

import evaluate

teacher_outputs = [
    "Paris is the capital of France.",
    "Cats are commonly kept as pets."
]
slm_outputs = [
    "Paris is France’s main city.",
    "Cats are small domestic animals."
]

bleu_metric = evaluate.load("bleu")
rouge_metric = evaluate.load("rouge")

bleu_result = bleu_metric.compute(predictions=slm_outputs, references=teacher_outputs)
rouge_result = rouge_metric.compute(predictions=slm_outputs, references=teacher_outputs)

print("SLM vs Teacher - BLEU:", bleu_result["bleu"])
print("SLM vs Teacher - ROUGE:", rouge_result)

Interpretation:

  • If BLEU/ROUGE is high, the SLM’s output lexically resembles the teacher.

  • However, low BLEU/ROUGE might still be acceptable if the SLM is paraphrasing (check with an embedding-based or LLM approach).

BERTScore Example

bs_metric = evaluate.load("bertscore")
bs_result = bs_metric.compute(
    predictions=slm_outputs,
    references=teacher_outputs,
    model_type="bert-base-uncased"
)

print("SLM vs Teacher - BERTScore F1:", bs_result["f1"])

Interpretation:

  • A higher BERTScore suggests the SLM output is semantically close to the teacher’s text, even if wording differs.

5. LLM as a Judge

Why: Classic metrics can’t always capture factual correctness, stylistic preferences, or domain-specific requirements. An LLM-based evaluation can offer human-like judgments.

import os
import openai

openai.api_key = os.getenv("OPENAI_API_KEY")

def llm_judge(teacher_text, slm_text):
    prompt = f"""
    You are an impartial text quality evaluator.

    Teacher's text: {teacher_text}
    SLM's text: {slm_text}

    Compare them for accuracy, completeness, style, or any major differences.
    Provide a short summary or verdict.
    """
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )
    return response["choices"][0]["message"]["content"]

# Example usage
teacher_out = "Paris is the capital of France."
slm_out = "Paris is the largest city in France, popular for art."

verdict = llm_judge(teacher_out, slm_out)
print(verdict)

Pros:

  • Captures nuances beyond string overlap.

  • Flexible: can incorporate domain instructions or style guidelines.

Cons:

  • Cost: LLM calls can be expensive at scale.

  • Prompt Sensitivity: Different prompts can yield different results.

  • Bias: The LLM might not always reflect a “universal standard” of correctness.

5.1 Seven Default Prompt Templates

Below are seven sample prompts you can adapt as starter templates for your manual LLM-judge approach. Each covers a different style or scenario of evaluation.

1. Numeric Scoring (Simple)

Goal: Get a single numeric score (1–10) plus a short explanation.

Use Cases: Quick overall quality check (summarization, translation, short answers).


2. Pairwise Comparison (Teacher vs. SLM)

Goal: Judge which of two texts (teacher or SLM) is better, or if they’re equivalent.

Use Cases: Distillation tasks; see if the SLM’s output surpasses or matches the teacher’s.


3. Multi-Dimensional Scoring

Goal: Collect scores for several criteria (Fluency, Relevance, Factual Correctness).

Use Cases: More granular evaluation where multiple dimensions matter.


4. Summarization Check (for Summaries)

Goal: Evaluate how well a candidate summary captures the main points of a reference document.

Use Cases: Summarizing lengthy text, ensuring key info is preserved.


5. Dialogue Coherence (Multi-Turn)

Goal: Evaluate coherence in a multi-turn conversation.

Use Cases: Chatbots, Q&A sessions, multi-turn dialogues.


6. Fact-Checking Prompt

Goal: Specifically test for factual correctness if the reference text is considered ground truth.

Use Cases: Domain-specific tasks (medical, legal), knowledge-intensive checks.

System: You are a fact-checking AI with deep domain knowledge.

User:
Reference (Facts or Source): {reference_text}
Candidate: {candidate_text}

Identify any factual errors or contradictions in the candidate text.
Provide a short explanation if errors exist.
Finally, rate the candidate text’s accuracy on a scale from 0 to 100.

Answer Format:
Accuracy Score: X
Errors Found: [...]

7. Style & Tone Alignment

Goal: Evaluate if the candidate text matches a specific style or brand voice.

Use Cases: Marketing copy, creative writing, brand consistency.


6. Conclusion

When comparing an SLM to a teacher LLM:

  1. Token-overlap metrics (BLEU, ROUGE, CHRF) provide quick, historically recognized checks on textual similarity—but can be too literal.

  2. Embedding-based metrics (BERTScore, BLEURT, etc.) capture paraphrases and semantic overlap.

  3. Task-specific metrics (Accuracy, F1, WER) address classification or speech tasks.

  4. Optional data cleansing (lemmatization, HTML removal) standardizes text if needed.

  5. LLM-as-a-judge adds a qualitative, human-like perspective, capturing style or domain correctness numeric metrics might overlook.

By treating the teacher’s output as the reference, you measure how closely your SLM follows or distills the teacher’s knowledge and style. For deeper qualitative checks, use the seven default prompt templates above as a jumping-off point, customizing them to your domain or scoring preferences.

Combining overlap-based, embedding-based, and LLM-based methods offers the fullest picture of your SLM’s performance and alignment with the teacher’s outputs.

Further Reading & Resources

Author’s Note:

If you need more guidance on SLM fine-tuning or evaluation best practices, feel free to reach out. We specialize in building robust pipelines for domain-specific model development and assessment.

Happy Evaluating!

In this post

In this post

In this post