Student Capacity Moderates Knowledge Distillation Effectiveness

A systematic study of knowledge distillation (KD) effectiveness across three ResNet teacher-student capacity pairs on CIFAR-10, comparing Logit-KD and Feature-KD under fully reproducible conditions (3 seeds, mean ± std reported throughout).

Key Findings

Student capacity, not the teacher-student accuracy gap, is the primary moderating factor in KD effectiveness. ResNet-34 students consistently benefit more from distillation than ResNet-18 students, even when gap magnitudes are comparable.

Implementation correctness critically affects Feature-KD. An unclipped projection-layer gradient suppresses Feature-KD performance and produces misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs.

Results

TeacherStudentLogit-KD ΔFeature-KD ΔBest
R34 (95.70%)R18 (95.13%)+0.00 pp+0.18 ppFeature
R50 (95.81%)R18 (95.13%)+0.21 pp+0.08 ppLogit
R50 (95.81%)R34 (95.25%)+0.21 pp+0.30 ppFeature

All gains relative to the corresponding student baseline across seeds {0, 1, 2}.

Key topics

temperature scaling · feature alignment · model compression · ResNet · ablation study · reproducibility

Tech stack

Python · PyTorch · CIFAR-10 · ResNet-18/34/50 · NVIDIA A100

GitHub Repository · arXiv Paper · HuggingFace Space