Research Presentation

Advances in
Model Distillation

Anshul AgarwalRicha GuptaAzusa Ito

6.S964 | Topics in Data Science for Society

Abstract

Model distillation is a technique for transferring knowledge from a larger, more complex teacher model to a smaller, simpler student model. It enables deploying high-performing models on resource-constrained devices and compressing knowledge from large LLMs into lightweight open-source alternatives.

We evaluate and compare BERT-Large, DistilBERT, and MiniLM on the full SQuAD v1.1 training set (~87,599 examples) for Extractive Question Answering, measuring accuracy, inference speed, and efficiency trade-offs to understand how distilled models perform relative to their teacher.

Methodology

Evaluation Metrics

We use multiple evaluation metrics to capture both exact-match accuracy and semantic quality. BERT-Score captures semantic similarity through contextual embeddings, complementing traditional Exact Match and token-level F1 scores.

Exact Match (EM) Accuracy
Token-level F1 Score
BERTScore F1 (Semantic Quality)
Inference Time per Question

Inference Pipeline

We used Hugging Face's pipeline("question-answering") for standardized inference across all three models. Each model was run on the full SQuAD v1.1 training set, logging predicted answers, confidence scores, and inference time.

Teacher

BERT-Large

Students

DistilBERT

MiniLM

System Pipeline

Step 1

SQuAD v1.1

~87K QA pairs

Step 2

HF Pipeline

QA Inference

Step 3

Log Outputs

Answer, Confidence, Time

Step 4

Evaluate

EM, F1, BERTScore

Performance Results

SQuAD v1.1 · 87,599 Questions

Accuracy Metrics

BERT-LargeEM: 83.2% · F1: 92.8%

MiniLMEM: 79.4% · F1: 90.7%

DistilBERTEM: 77.4% · F1: 89.0%

Fig 1. Exact Match and F1 scores on SQuAD v1.1 training set.

Inference Speed & Efficiency

BERT-Large

~340M params

0.048s/question

Baseline

DistilBERT

~66M params

0.0083s/question

~5.8× faster

MiniLM

~33M params

~0.010s/question

~4.6× faster

Fig 2. Inference time per question measured during evaluation.

BERTScore F1 · Semantic Quality

0.905

BERT-Large

0.8825

MiniLM

0.867

DistilBERT

Fig 3. BERTScore F1 comparing predicted answers against ground truth.

Model Comparative Summary

Model	Parameters	Layers	Distillation Method	EM / F1	Speedup	Best For
BERT-Large	~340M	24	— (Teacher model)	83.2% / 92.8%	1.0×	High-accuracy needs
DistilBERT	~66M	6	Logits, hidden states, attention — task-agnostic	77.4% / 89.0%	~5.8×	Real-time / low-resource
MiniLM	~33M	6	Self-attention + value matrices — task-specific	79.4% / 90.7%	~4.6×	Best accuracy-speed balance

🎯 BERT-Large

Highest accuracy (EM: 83.2%, F1: 92.8%) but slowest. Best when accuracy is the top priority and compute resources are available.

⚡ DistilBERT

Fastest model (~5.8× speedup). Most efficient on F1-per-inference-time. Ideal for real-time and low-resource deployment.

⚖️ MiniLM

Strongest accuracy among distilled models (EM: 79.4%, F1: 90.7%) with ~4.6× speedup. Best general-purpose choice.

Dataset Characteristics

SQuAD v1.1 — Stanford Question Answering Dataset · ~18K contexts · ~87K questions · ~87K answer spans

// SQuAD v1.1 Sample Instance
{
  "context": "Architecturally, the school has a Catholic character.
    Atop the Main Building's gold dome is a golden statue
    of the Virgin Mary...",
  "question": "To whom did the Virgin Mary allegedly appear
    in 1858 in Lourdes France?",
  "answers": {
    "text": ["Saint Bernadette Soubirous"],
    "answer_start": [515]
  }
}

// Model outputs logged per question:
// → Predicted answer text
// → Confidence score (from logits)
// → Inference time (seconds)

Future Work

🔬

Expand Model Coverage

Benchmark smaller models like TinyBERT and MobileBERT to extend the comparison across a wider range of distillation approaches.

📱

Edge Device Testing

Measure latency and accuracy on real edge devices (e.g., Raspberry Pi, mobile) to validate distillation gains in production-like environments.

🎯

Custom Distillation

Apply task-specific distillation from BERT using custom datasets and loss functions for domain-specific QA tasks.

Advances in Model Distillation

Abstract

Methodology

Evaluation Metrics

Inference Pipeline

System Pipeline

Performance Results

Model Comparative Summary

Dataset Characteristics

Future Work

Expand Model Coverage

Edge Device Testing

Custom Distillation

Advances in
Model Distillation