Advances in
Model Distillation
Abstract
Model distillation is a technique for transferring knowledge from a larger, more complex teacher model to a smaller, simpler student model. It enables deploying high-performing models on resource-constrained devices and compressing knowledge from large LLMs into lightweight open-source alternatives.
We evaluate and compare BERT-Large, DistilBERT, and MiniLM on the full SQuAD v1.1 training set (~87,599 examples) for Extractive Question Answering, measuring accuracy, inference speed, and efficiency trade-offs to understand how distilled models perform relative to their teacher.
Methodology
Evaluation Metrics
We use multiple evaluation metrics to capture both exact-match accuracy and semantic quality. BERT-Score captures semantic similarity through contextual embeddings, complementing traditional Exact Match and token-level F1 scores.
- Exact Match (EM) Accuracy
- Token-level F1 Score
- BERTScore F1 (Semantic Quality)
- Inference Time per Question
Inference Pipeline
We used Hugging Face's pipeline("question-answering") for standardized inference across all three models. Each model was run on the full SQuAD v1.1 training set, logging predicted answers, confidence scores, and inference time.
System Pipeline
Performance Results
SQuAD v1.1 · 87,599 QuestionsFig 1. Exact Match and F1 scores on SQuAD v1.1 training set.
Fig 2. Inference time per question measured during evaluation.
Fig 3. BERTScore F1 comparing predicted answers against ground truth.
Model Comparative Summary
| Model | Parameters | Layers | Distillation Method | EM / F1 | Speedup | Best For |
|---|---|---|---|---|---|---|
| BERT-Large | ~340M | 24 | — (Teacher model) | 83.2% / 92.8% | 1.0× | High-accuracy needs |
| DistilBERT | ~66M | 6 | Logits, hidden states, attention — task-agnostic | 77.4% / 89.0% | ~5.8× | Real-time / low-resource |
| MiniLM | ~33M | 6 | Self-attention + value matrices — task-specific | 79.4% / 90.7% | ~4.6× | Best accuracy-speed balance |
Highest accuracy (EM: 83.2%, F1: 92.8%) but slowest. Best when accuracy is the top priority and compute resources are available.
Fastest model (~5.8× speedup). Most efficient on F1-per-inference-time. Ideal for real-time and low-resource deployment.
Strongest accuracy among distilled models (EM: 79.4%, F1: 90.7%) with ~4.6× speedup. Best general-purpose choice.
Dataset Characteristics
SQuAD v1.1 — Stanford Question Answering Dataset · ~18K contexts · ~87K questions · ~87K answer spans
// SQuAD v1.1 Sample Instance
{
"context": "Architecturally, the school has a Catholic character.
Atop the Main Building's gold dome is a golden statue
of the Virgin Mary...",
"question": "To whom did the Virgin Mary allegedly appear
in 1858 in Lourdes France?",
"answers": {
"text": ["Saint Bernadette Soubirous"],
"answer_start": [515]
}
}
// Model outputs logged per question:
// → Predicted answer text
// → Confidence score (from logits)
// → Inference time (seconds)Future Work
Expand Model Coverage
Benchmark smaller models like TinyBERT and MobileBERT to extend the comparison across a wider range of distillation approaches.
Edge Device Testing
Measure latency and accuracy on real edge devices (e.g., Raspberry Pi, mobile) to validate distillation gains in production-like environments.
Custom Distillation
Apply task-specific distillation from BERT using custom datasets and loss functions for domain-specific QA tasks.