Research Presentation

Advances in
Model Distillation

Anshul AgarwalRicha GuptaAzusa Ito
6.S964 | Topics in Data Science for Society

Abstract

Model distillation is a technique for transferring knowledge from a larger, more complex teacher model to a smaller, simpler student model. It enables deploying high-performing models on resource-constrained devices and compressing knowledge from large LLMs into lightweight open-source alternatives.

We evaluate and compare BERT-Large, DistilBERT, and MiniLM on the full SQuAD v1.1 training set (~87,599 examples) for Extractive Question Answering, measuring accuracy, inference speed, and efficiency trade-offs to understand how distilled models perform relative to their teacher.

Methodology

Evaluation Metrics

We use multiple evaluation metrics to capture both exact-match accuracy and semantic quality. BERT-Score captures semantic similarity through contextual embeddings, complementing traditional Exact Match and token-level F1 scores.

  • Exact Match (EM) Accuracy
  • Token-level F1 Score
  • BERTScore F1 (Semantic Quality)
  • Inference Time per Question

Inference Pipeline

We used Hugging Face's pipeline("question-answering") for standardized inference across all three models. Each model was run on the full SQuAD v1.1 training set, logging predicted answers, confidence scores, and inference time.

Teacher
BERT-Large
Students
DistilBERT
MiniLM

System Pipeline

Step 1
SQuAD v1.1
~87K QA pairs
Step 2
HF Pipeline
QA Inference
Step 3
Log Outputs
Answer, Confidence, Time
Step 4
Evaluate
EM, F1, BERTScore

Performance Results

SQuAD v1.1 · 87,599 Questions
Accuracy Metrics
BERT-LargeEM: 83.2% · F1: 92.8%
MiniLMEM: 79.4% · F1: 90.7%
DistilBERTEM: 77.4% · F1: 89.0%

Fig 1. Exact Match and F1 scores on SQuAD v1.1 training set.

Inference Speed & Efficiency
BERT-Large
~340M params
0.048s/question
Baseline
DistilBERT
~66M params
0.0083s/question
~5.8× faster
MiniLM
~33M params
~0.010s/question
~4.6× faster

Fig 2. Inference time per question measured during evaluation.

BERTScore F1 · Semantic Quality
0.905
BERT-Large
0.8825
MiniLM
0.867
DistilBERT

Fig 3. BERTScore F1 comparing predicted answers against ground truth.

Model Comparative Summary

ModelParametersLayersDistillation MethodEM / F1SpeedupBest For
BERT-Large~340M24— (Teacher model)83.2% / 92.8%1.0×High-accuracy needs
DistilBERT~66M6Logits, hidden states, attention — task-agnostic77.4% / 89.0%~5.8×Real-time / low-resource
MiniLM~33M6Self-attention + value matrices — task-specific79.4% / 90.7%~4.6×Best accuracy-speed balance
🎯 BERT-Large

Highest accuracy (EM: 83.2%, F1: 92.8%) but slowest. Best when accuracy is the top priority and compute resources are available.

⚡ DistilBERT

Fastest model (~5.8× speedup). Most efficient on F1-per-inference-time. Ideal for real-time and low-resource deployment.

⚖️ MiniLM

Strongest accuracy among distilled models (EM: 79.4%, F1: 90.7%) with ~4.6× speedup. Best general-purpose choice.

Dataset Characteristics

SQuAD v1.1 — Stanford Question Answering Dataset · ~18K contexts · ~87K questions · ~87K answer spans

// SQuAD v1.1 Sample Instance
{
  "context": "Architecturally, the school has a Catholic character.
    Atop the Main Building's gold dome is a golden statue
    of the Virgin Mary...",
  "question": "To whom did the Virgin Mary allegedly appear
    in 1858 in Lourdes France?",
  "answers": {
    "text": ["Saint Bernadette Soubirous"],
    "answer_start": [515]
  }
}

// Model outputs logged per question:
// → Predicted answer text
// → Confidence score (from logits)
// → Inference time (seconds)

Future Work

🔬

Expand Model Coverage

Benchmark smaller models like TinyBERT and MobileBERT to extend the comparison across a wider range of distillation approaches.

📱

Edge Device Testing

Measure latency and accuracy on real edge devices (e.g., Raspberry Pi, mobile) to validate distillation gains in production-like environments.

🎯

Custom Distillation

Apply task-specific distillation from BERT using custom datasets and loss functions for domain-specific QA tasks.

MIT 6.S964 Research Group

Department of Electrical Engineering and Computer Science

© 2025 Agarwal, Gupta, & Ito. Academic Research Project.