1. Introduction
When you fine-tune a Specialized Language Model (SLM) from a teacher LLM, you often want to compare the SLM’s output directly with the teacher’s. This can be for:
Distillation: ensuring the SLM mimics the teacher’s responses.
Performance Consistency: verifying you haven’t lost crucial domain knowledge.
Style & Quality: checking if the SLM meets domain or style constraints the teacher had.
Below, we survey essential NLP evaluation metrics—both classic (e.g., BLEU, ROUGE) and embedding-based (e.g., BERTScore)—and show how to treat the teacher’s output as the “reference.” We also discuss optional data cleansingand how to use an LLM as a judge for more nuanced or human-like feedback. Finally, we provide seven default prompt templates you can use as a starting point for LLM-based evaluations.
2. Types of Evaluation Metrics
2.1 Token-Overlap Metrics
BLEU - Compares candidate text to reference(s) via overlapping n-grams (1-4). Common in machine translation.
Pros - Quick, widely used, historically recognized.
Cons - Overly literal—penalizes synonyms, paraphrasing.
ROUGE - Standard for summarization (ROUGE-1, ROUGE-2, ROUGE-L). Measures n-gram and subsequence overlap.
Pros - Commonly used in summarization; easy to compute.
Cons - Similar pitfalls as BLEU regarding paraphrasing.
CHRF - Uses character n-grams; robust for morphological variations or languages without clear word boundaries.
Pros - More tolerant of word segmentation differences.
Cons - Still an n-gram approach, can miss deeper semantic changes.
Overall:
Pros: Quick to run, well-known in research and industry.
Cons: Can be too literal, ignoring synonyms or rearranged wording.
2.2 Embedding-Based Metrics
BERTScore - Uses contextual embeddings (e.g., BERT) to measure semantic similarity between candidate & reference.
Pros - Captures paraphrasing, synonyms better than pure n-gram metrics.
Cons - Depends on embedding model quality; may be slower to compute.
BLEURT, MoverScore (community) - Other semantic approaches mixing embeddings & advanced distance measures.
Pros - Often correlate better with human judgments.
Cons - Typically require extra packages or more complex setup.
2.3 Task-Specific Metrics
Accuracy / F1: For classification tasks (e.g., sentiment, topic identification).
Factual Consistency: If you want to check correctness in medical/legal domains (requires knowledge sources or specialized checkers).
Word Error Rate (WER) / CER: For speech recognition or text fidelity tasks.
3. Optional Data Cleansing & Preprocessing
While many models and metrics handle raw text, you may consider preprocessing:
Trimming & Lowercasing: Remove extra spaces, unify case.
HTML/Markdown Removal: If outputs contain formatting tags (
<p>
,*bold**
).Stemming / Lemmatization: If using BLEU/ROUGE and you expect frequent morphological variations (e.g., running vs. ran).
Stopword Removal: Only if you want to emphasize content words.
Caution: Embedding-based metrics (BERTScore) and LLM-based judging typically work best with unmodified or lightly cleaned text, to preserve the full context and semantics.
Sample Code
Note:
For embedding-based or LLM-based methods, often you’d want minimal alteration (maybe just removing HTML and normalizing whitespace) because context is important.
For BLEU/ROUGE, consistent cleaning of references & candidates can help avoid false mismatches.
4. Using the Teacher's Output as “Reference”
Rationale
If you lack official ground-truth references, the teacher’s output can be your “gold” text.
This is common in distillation scenarios: you want to see if the SLM can produce outputs similar to the teacher.
Hugging Face Evaluate Examples
Hugging Face Evaluate offers quick metric computation for BLEU, ROUGE, BERTScore, etc.
BLEU & ROUGE Example
Interpretation:
If BLEU/ROUGE is high, the SLM’s output lexically resembles the teacher.
However, low BLEU/ROUGE might still be acceptable if the SLM is paraphrasing (check with an embedding-based or LLM approach).
BERTScore Example
Interpretation:
A higher BERTScore suggests the SLM output is semantically close to the teacher’s text, even if wording differs.
5. LLM as a Judge
Why: Classic metrics can’t always capture factual correctness, stylistic preferences, or domain-specific requirements. An LLM-based evaluation can offer human-like judgments.
Pros:
Captures nuances beyond string overlap.
Flexible: can incorporate domain instructions or style guidelines.
Cons:
Cost: LLM calls can be expensive at scale.
Prompt Sensitivity: Different prompts can yield different results.
Bias: The LLM might not always reflect a “universal standard” of correctness.
5.1 Seven Default Prompt Templates
Below are seven sample prompts you can adapt as starter templates for your manual LLM-judge approach. Each covers a different style or scenario of evaluation.
1. Numeric Scoring (Simple)
Goal: Get a single numeric score (1–10) plus a short explanation.
Use Cases: Quick overall quality check (summarization, translation, short answers).
2. Pairwise Comparison (Teacher vs. SLM)
Goal: Judge which of two texts (teacher or SLM) is better, or if they’re equivalent.
Use Cases: Distillation tasks; see if the SLM’s output surpasses or matches the teacher’s.
3. Multi-Dimensional Scoring
Goal: Collect scores for several criteria (Fluency, Relevance, Factual Correctness).
Use Cases: More granular evaluation where multiple dimensions matter.
4. Summarization Check (for Summaries)
Goal: Evaluate how well a candidate summary captures the main points of a reference document.
Use Cases: Summarizing lengthy text, ensuring key info is preserved.
5. Dialogue Coherence (Multi-Turn)
Goal: Evaluate coherence in a multi-turn conversation.
Use Cases: Chatbots, Q&A sessions, multi-turn dialogues.
6. Fact-Checking Prompt
Goal: Specifically test for factual correctness if the reference text is considered ground truth.
Use Cases: Domain-specific tasks (medical, legal), knowledge-intensive checks.
7. Style & Tone Alignment
Goal: Evaluate if the candidate text matches a specific style or brand voice.
Use Cases: Marketing copy, creative writing, brand consistency.
6. Conclusion
When comparing an SLM to a teacher LLM:
Token-overlap metrics (BLEU, ROUGE, CHRF) provide quick, historically recognized checks on textual similarity—but can be too literal.
Embedding-based metrics (BERTScore, BLEURT, etc.) capture paraphrases and semantic overlap.
Task-specific metrics (Accuracy, F1, WER) address classification or speech tasks.
Optional data cleansing (lemmatization, HTML removal) standardizes text if needed.
LLM-as-a-judge adds a qualitative, human-like perspective, capturing style or domain correctness numeric metrics might overlook.
By treating the teacher’s output as the reference, you measure how closely your SLM follows or distills the teacher’s knowledge and style. For deeper qualitative checks, use the seven default prompt templates above as a jumping-off point, customizing them to your domain or scoring preferences.
Combining overlap-based, embedding-based, and LLM-based methods offers the fullest picture of your SLM’s performance and alignment with the teacher’s outputs.
Further Reading & Resources
Hugging Face Evaluate: GitHub Repo
BERTScore Paper: arXiv:1904.09675
NLTK Stemming & Lemmatization: NLTK Docs
Author’s Note:
If you need more guidance on SLM fine-tuning or evaluation best practices, feel free to reach out. We specialize in building robust pipelines for domain-specific model development and assessment.
Happy Evaluating!