KD-CIFAR10: Knowledge Distillation Ablation Study
Published:
GitHub: umutonuryasar/kd-cifar10 · Stack: PyTorch · ResNet · CIFAR-10 ·
Overview
This project is a controlled ablation study on Knowledge Distillation (KD) for model compression, targeting the ResNet-50 → ResNet-18 pair on CIFAR-10. The central question: under what conditions does KD actually help, and which KD variant — Logit-KD or Feature-KD — yields more reliable gains?
| Teacher | Student | |
|---|---|---|
| Architecture | ResNet-50 | ResNet-18 |
| Parameters | 23.5M | 11.2M |
| Compression ratio | — | ~2.1× |
Experiment 1 — Baseline Ablation (Standard Architecture)
The first experiment used off-the-shelf ImageNet-pretrained ResNet architectures with no modifications. Both teacher and student retain the standard 7×7 conv stem and MaxPool layer designed for ImageNet’s 224×224 inputs.
Result: KD provided no meaningful improvement over the standalone student baseline. Neither Logit-KD nor Feature-KD outperformed training the student from scratch.
Diagnosis: The standard stem aggressively downsamples CIFAR-10’s 32×32 images before the residual blocks process any useful spatial information. The teacher’s representations are degraded at the source — there is nothing richer to distill.
Experiment 2 — CIFAR-Specific Architecture Fix
The second experiment replaced the 7×7 conv + MaxPool stem with a 3×3 conv / stride 1 stem (no pooling), following the common adaptation for small-input datasets. Both teacher and student received this fix.
Results:
| Method | Top-1 Accuracy |
|---|---|
| Student baseline (no KD) | 94.97% |
| Feature-KD | 95.21% |
| Logit-KD | 95.47% |
Logit-KD with temperature T=4 achieved +0.50pp over the non-distilled student — a consistent and reproducible gain across seeds.
Key Findings
Architecture gap is the primary bottleneck. No distillation method compensates for a stem that destroys spatial information before feature extraction begins. Fixing the inductive bias of the architecture unlocks the gains KD is theoretically expected to provide.
Logit-KD outperforms Feature-KD consistently. Once the architecture is corrected, soft label transfer from the teacher’s output distribution proves more sample-efficient than intermediate feature alignment. Feature-KD introduces alignment overhead (projection layers, layer selection) without proportional accuracy return in this regime.
Temperature T=4 is optimal. Sweeping T ∈ {1, 2, 4, 8, 16} consistently identified T=4 as the best trade-off between soft target entropy and label sharpness for this teacher-student capacity gap.
Takeaways
This study highlights a failure mode that is easy to overlook in KD benchmarks: evaluating distillation methods on architectures that are mismatched to the dataset conflates architecture error with distillation error. Controlled ablation — isolating one variable at a time — is necessary to attribute accuracy differences correctly.
The results are being written up for submission. Code and configs are available in the repository.
