Student Capacity Moderates Knowledge Distillation Effectiveness

A systematic study of knowledge distillation (KD) effectiveness across three ResNet teacher-student capacity pairs on CIFAR-10, comparing Logit-KD and Feature-KD under fully reproducible conditions (3 seeds, mean ± std reported throughout).

Key Findings

Student capacity, not the teacher-student accuracy gap, is the primary moderating factor in KD effectiveness. ResNet-34 students consistently benefit more from distillation than ResNet-18 students, even when gap magnitudes are comparable.

Implementation correctness critically affects Feature-KD. An unclipped projection-layer gradient suppresses Feature-KD performance and produces misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs.

Results

Teacher	Student	Logit-KD Δ	Feature-KD Δ	Best
R34 (95.70%)	R18 (95.13%)	+0.00 pp	+0.18 pp	Feature
R50 (95.81%)	R18 (95.13%)	+0.21 pp	+0.08 pp	Logit
R50 (95.81%)	R34 (95.25%)	+0.21 pp	+0.30 pp	Feature

All gains relative to the corresponding student baseline across seeds {0, 1, 2}.

Key topics

temperature scaling · feature alignment · model compression · ResNet · ablation study · reproducibility

Tech stack

Python · PyTorch · CIFAR-10 · ResNet-18/34/50 · NVIDIA A100

Umut Onur Yaşar

Student Capacity Moderates Knowledge Distillation Effectiveness

Key Findings

Results

Key topics

Tech stack

Links

Share on