KD-CIFAR10: Knowledge Distillation Ablation Study

Published:

GitHub: umutonuryasar/kd-cifar10  ·  Stack: PyTorch · ResNet · CIFAR-10  · 


Overview

This project is a controlled ablation study on Knowledge Distillation (KD) for model compression, targeting the ResNet-50 → ResNet-18 pair on CIFAR-10. The central question: under what conditions does KD actually help, and which KD variant — Logit-KD or Feature-KD — yields more reliable gains?

 TeacherStudent
ArchitectureResNet-50ResNet-18
Parameters23.5M11.2M
Compression ratio~2.1×

Experiment 1 — Baseline Ablation (Standard Architecture)

The first experiment used off-the-shelf ImageNet-pretrained ResNet architectures with no modifications. Both teacher and student retain the standard 7×7 conv stem and MaxPool layer designed for ImageNet’s 224×224 inputs.

Result: KD provided no meaningful improvement over the standalone student baseline. Neither Logit-KD nor Feature-KD outperformed training the student from scratch.

Diagnosis: The standard stem aggressively downsamples CIFAR-10’s 32×32 images before the residual blocks process any useful spatial information. The teacher’s representations are degraded at the source — there is nothing richer to distill.


Experiment 2 — CIFAR-Specific Architecture Fix

The second experiment replaced the 7×7 conv + MaxPool stem with a 3×3 conv / stride 1 stem (no pooling), following the common adaptation for small-input datasets. Both teacher and student received this fix.

Results:

MethodTop-1 Accuracy
Student baseline (no KD)94.97%
Feature-KD95.21%
Logit-KD95.47%

Logit-KD with temperature T=4 achieved +0.50pp over the non-distilled student — a consistent and reproducible gain across seeds.


Key Findings

Architecture gap is the primary bottleneck. No distillation method compensates for a stem that destroys spatial information before feature extraction begins. Fixing the inductive bias of the architecture unlocks the gains KD is theoretically expected to provide.

Logit-KD outperforms Feature-KD consistently. Once the architecture is corrected, soft label transfer from the teacher’s output distribution proves more sample-efficient than intermediate feature alignment. Feature-KD introduces alignment overhead (projection layers, layer selection) without proportional accuracy return in this regime.

Temperature T=4 is optimal. Sweeping T ∈ {1, 2, 4, 8, 16} consistently identified T=4 as the best trade-off between soft target entropy and label sharpness for this teacher-student capacity gap.


Takeaways

This study highlights a failure mode that is easy to overlook in KD benchmarks: evaluating distillation methods on architectures that are mismatched to the dataset conflates architecture error with distillation error. Controlled ablation — isolating one variable at a time — is necessary to attribute accuracy differences correctly.

The results are being written up for submission. Code and configs are available in the repository.

Direct Link