I was looking at a photo of Mt. Rushmore the other day and something clicked.
Before Gutzon Borglum started carving, that mountain had more granite than it does now. Obviously. He removed around 450,000 tons of rock. What remained was less material by every metric. But nobody looks at Mt. Rushmore and thinks "that mountain is missing something." What remained was better than what was there before. Not just aesthetically. Structurally. The form was coherent in a way the raw mountain never was.
Scaling adds capacity. But capacity isn't structure. You can have a model with billions of parameters where the probability mass is poorly calibrated, where half the internal representations are redundant, and it still passes benchmarks because nobody checks for that. What we found is that selective pruning combined with distillation acts as a correction. It exposes the slack that dense training left behind, and then repairs it.
The unexpected result
We expected the standard trade-off. Less model, slightly worse quality. That's the deal everyone assumes. You remove parameters, you lose something. The question is just how much you lose and whether it's tolerable.
That's not what happened.
When we ran structured SVD pruning alone, performance degraded. Expected. You're removing capacity, of course quality drops. When we ran self-referential knowledge distillation alone, perplexity actually improved. Less expected, but not shocking. Distillation has a smoothing effect that people have observed before.
But when we combined them, prune first then distill, the pruned-and-distilled model recovered to the dense baseline. And at certain configurations, it exceeded it. Not by a huge margin. But consistently. Across sparsity levels. Across configurations. It wasn't a one-off result we could explain away.
I checked the numbers three times. Reran the experiments. Same pattern. The model with fewer parameters was generalizing better than the model with all of them, which is not supposed to happen.
The key thing we noticed: KD, not pruning, was the dominant quality driver. Pruning exposed the slack. Distillation corrected it. Removing stone doesn't create the faces on Mt. Rushmore. It creates the conditions. The shaping afterward is where the form comes from.
Structural slack
Here's the hypothesis that started forming.
Dense training converges to something that works. It produces functionally correct outputs. But "functionally correct" and "structurally optimal" aren't the same thing. An overparameterized model distributes probability mass across its internal representations inefficiently. Some subspaces are redundant. Some activations encode noise or high-variance structure that doesn't contribute to generalization. The model passes its benchmarks, so nobody notices.
It's the same problem Borglum discovered partway through carving Mt. Rushmore. The original placement for Jefferson's face sat on granite with poor grain structure. From the surface, the rock looked fine. But once he started carving, he found it would have cracked under detailed work. He moved the entire face to sound rock. Relocated it to where the stone could actually hold the form he needed.
You can't see bad grain from the surface. You find it when you start carving.
The way I think about it is this: dense models converge to a probability manifold that's functional but noisy. There's slack in the distribution. Probability mass sitting in directions that don't help, variance hiding in subspaces that never needed to be there. Pruning exposes that slack. Self-distillation corrects it. Pulls the probability mass back onto a tighter surface, reduces the noise. I'm still not totally sure why self-distillation works as well as it does here. I have a hypothesis, but I'm holding it loosely.
Self-distillation as correction
The mechanism is simpler than it sounds.
First, you cache the dense model's output probabilities. The full distribution before you touch anything. Then you prune structurally, removing excess capacity. Then you distill from the original probabilities, training the pruned model to match what the dense model was producing. The student isn't learning new information. It's learning a corrected representation of the same distribution. The original model knew the answers. The pruned-and-distilled model knows them with less noise.
I keep thinking of it as a projection. The dense model lives in a high-dimensional space with a lot of room to wander. Self-distillation projects it onto a tighter manifold. Fewer directions to drift, less variance in the outputs, more concentrated probability mass where it actually matters. Basically it's a way to reduce variance. The model stops wandering. It doesn't get smarter. It gets more precise about what it already knew.
After the dynamite at Mt. Rushmore, the fine carvers came in. They weren't adding anything. They were removing everything that wasn't a face. Same idea here. The distillation step isn't teaching the model new things. It's clearing away what shouldn't have been there.
I'm not claiming this is fully understood. But the empirical behavior is consistent, and it points in a direction that makes sense.
Why selectivity matters
So what does selectivity actually mean here? It's not just "use fewer parameters."
Selectivity means a few things at once. You keep the parameters that carry actual information and cut the ones that don't. You concentrate probability mass where it matters for generalization instead of smearing it across the whole distribution. And you suppress variance without killing the signal. Those three things together are what makes this work.
The bias-variance framing makes this concrete. When you remove structure through pruning, you increase bias slightly. The model has less capacity, so it can't represent every nuance of the original distribution. But when you distill, you reduce variance significantly. The model stops wandering through redundant subspaces. If the variance reduction exceeds the bias increase, the model generalizes better with fewer parameters. That's what we're seeing. The net effect is positive because the slack we're removing was hurting more than helping.
Borglum didn't keep every cubic foot of granite on that mountain. He kept the cubic feet that held the faces.
A correction regime
I want to be careful about how far I push this.
I'm not claiming dense models are broken. Dense training is powerful. Scaling works. Larger models capture more, and there are good reasons the field went in that direction. What I am saying is that dense models aren't structurally optimal by default. They converge to something that works, not something that's tight. And compression combined with self-distillation reveals and corrects that slack in a way that's measurable and repeatable.
SparseKD isn't an optimization trick. It's closer to a selective correction regime. A structured compression pathway that exposes slack and corrects it. Prune to reveal. Distill to repair. The phenomenon holds across configurations. It's not fragile.
I don't want to overclaim. But "build it big and hope the structure is right" is starting to feel like piling more granite onto the mountain and hoping a face appears. It doesn't work that way. Someone has to carve.
What we're left with
Selectivity improves generalization when variance reduction exceeds bias increase. That's the core finding, and it holds up across everything we've tested. Self-referential distillation isn't just a training trick. It's a structural correction mechanism that tightens the probability manifold. Pruning reveals slack. Distillation repairs structure. The effect is measurable and reproducible.
Mt. Rushmore isn't impressive because of how much granite is on that mountain. It's impressive because of how much was removed, and what was left standing.
There's more work to do. But the direction is clear, and the math keeps agreeing with us.
For the full experimental results and analysis, read the paper: Sparse Knowledge Distillation: Experimental Results