What Happens if You Remove ReLU From a Deep Neural Network?

I spent two days debugging a model that trained without errors, printed loss every epoch, and still sat near random guessing — barely above the 10% floor for a ten-class problem.

I went through the usual checklist. Wrong loss function? No. Mismatched labels? No. Bad learning rate? No. I printed a batch of images. The digits looked correct. I printed the labels. They matched. Everything looked fine on the surface.

My friend finally spotted it. I had removed all the ReLU activation functions while refactoring the architecture. Not maliciously, not experimentally — just forgot to put them back. PyTorch ran the model without a single warning. No error, no NaN, no exploding gradients. Just silent, confident, useless training.

That stuck with me. I knew activation functions "introduced non-linearity," but that phrase had always felt hand-wavy. I didn't have a concrete feel for what actually breaks when you remove them, layer by layer, gradient by gradient. So I built a proper experiment to find out — five scripts, five experiments, one complete run that took over six hours on CPU.

This is what I found.

The Experiment Setup

Two five-layer MLPs, same architecture, same data, same optimizer, same number of training epochs. The only difference:

# Model A: ReLU between every hidden layer
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.relu(self.fc4(x))
return self.fc5(x)

# Model B: nothing between layers
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
x = self.fc4(x)
return self.fc5(x)

Both models: 784 inputs, hidden layers [512, 256, 128, 64], 10 outputs. Both have exactly 575,050 parameters. Adam optimizer, lr=0.001, batch size 64, 20 epochs, CrossEntropyLoss. Kaiming initialization for the ReLU model (correct for ReLU variance), Xavier for the linear model (correct for symmetric activations).

All code is at github.com/Emmimal/relu-experiment.

The experiment suite runs with:

pip install -r requirements.txt
python run_all.py

Experiment 1: MNIST Training

The training curves diverge immediately and never converge:

Epoch	ReLU Loss	Linear Loss
1	0.2141	0.4532
5	0.0474	0.3339
10	0.0264	0.3018
20	0.0162	0.2772

The ReLU model converges cleanly. The linear model's loss keeps nudging downward but never settles — it's still at 0.2772 by epoch 20, more than 17x the ReLU model's final loss.

Model	Train Accuracy	Test Accuracy
ReLU (5-layer)	99.6%	98.1%
Linear (5-layer)	92.1%	91.8%

The 6.3-point gap looks significant on its own. But the real story is what 91.8% means in context: a single nn.Linear(784, 10) — one layer, no hidden state, pure logistic regression — achieves around 92% on MNIST. The five-layer linear network with half a million parameters matched a model with essentially no architecture at all. Four hidden layers, hundreds of thousands of weights, and they contributed zero additional expressive power.

That's not a performance problem. That's a mathematical inevitability.

One caveat worth naming: all experiments here use MNIST. MNIST has several linearly separable digit pairs (0 vs 1, 1 vs 7) that inflate the linear model's floor and make the gap look narrower than it would on harder data. Results on FashionMNIST — where classes like T-shirts and pullovers share far more visual structure — would likely show a wider gap. Extending the experiment is straightforward: swap datasets.MNIST for datasets.FashionMNIST in utils.py and rerun. The two-moons result (87.5% vs 94.6% on purely non-linear data) gives a better sense of what the ceiling looks like when linear separability isn't available at all.

Why This Is Mathematically Guaranteed

When you stack linear layers without activations:

h1 = W1 * x + b1
h2 = W2 * h1 + b2  =  (W2 @ W1) * x + W2*b1 + b2
h3 = W3 * h2 + b3  =  (W3 @ W2 @ W1) * x + ...

No matter how many layers you add, the composition always reduces to:

f(x) = W_effective * x + b_effective

Five matrix multiplications is still one matrix. The 5-layer linear network doesn't learn more than logistic regression — it is logistic regression, just parameterized in a much more convoluted way.

ReLU breaks this. max(0, z) is not linear — it routes negative activations to zero and passes positive ones through unchanged. That single operation prevents the composition from collapsing. Each layer can now represent something genuinely new that the previous layer could not.

Experiment 2: The Gradient Norms (This Is the Interesting Part)

Before running this experiment, I expected the linear model to have dead early layers — the standard "vanishing gradient" story. What I actually got was more interesting.

Here are the mean gradient norms per layer at epoch 10, averaged across every batch in that epoch:ReLU Model:

fc1.weight:  0.2914
fc2.weight:  0.2482
fc3.weight:  0.1851
fc4.weight:  0.1831
fc5.weight:  0.2197

Linear Model:

fc1.weight:  0.3676
fc2.weight:  0.3756
fc3.weight:  0.8999
fc4.weight:  1.1518
fc5.weight:  1.3699

The linear model's early layers are not dead. fc1 actually receives a larger gradient than in the ReLU model. What's broken is the distribution of gradient signal. The linear model's fc5 gets a gradient norm of 1.37 while fc1 gets 0.37 — a 3.7x difference within the same model. The ReLU model is almost flat across all layers (0.183 to 0.291).

This pattern holds and intensifies over training. By epoch 20, the linear model's fc5 gradient norm has grown to 1.77 while fc1 has shrunk to 0.30. The optimizer is concentrating its updates in the output layers.

Why does this happen? Because in a linear network, W_effective = W5 @ W4 @ ... @ W1. Adjusting W5 has the most direct path to the output — there's one fewer matrix multiplication between W5's gradient and the loss. Adam naturally exploits this: it piles signal into W5 and W4 because that's where the gradient is clearest, and lets W1 and W2 do relatively less.

The network hasn't vanished — it has concentrated. It's found a local solution where the final layers do classification and the earlier layers function as a passive (but trainable) linear projection. The result is mathematically equivalent to one linear transformation. And that's the ceiling.

This is not the same mechanism as vanishing gradients in sigmoid networks, where the bounded derivative causes exponential decay through the chain rule. In a linear network, gradients don't shrink — they concentrate toward the output end. Same practical effect on learning, different root cause.

Note: the gradient concentration pattern described above is an experimental observation from this specific setup — 5-layer MLP, Adam optimizer, MNIST. I am not aware of established literature that formally characterizes this exact behavior under these conditions. If you've seen a paper that does, I'd genuinely like to know.

Experiment 3: Decision Boundaries

Numbers are one thing. To make this concrete, I trained both models on a 2D two-moons dataset — two interleaved crescents that cannot be linearly separated — and plotted the full decision surface.

Linear model: A straight line across the 2D plane. One side is class 0, the other is class 1. The crescent shapes get cut in half. Test accuracy: 87.5%.
ReLU model: A curved boundary that wraps around each moon independently. Test accuracy: 94.6%.

The gap on two-moons (7.1 points) is larger than on MNIST because two-moons is purely non-linear — there is no straight line that meaningfully separates the data. MNIST has some digit pairs that are nearly linearly separable (0 vs 1 is easy; 3 vs 8 is not), which is why the linear model still manages 91.8% there. Strip out those easy cases and the ceiling drops hard.

Experiment 4: More Depth Makes the Linear Model Worse

I expected adding layers to be neutral — same ceiling, same result. Instead, depth actively hurts:

Depth	Linear Test Acc (mean +/- std)	ReLU Test Acc (mean +/- std)
1	92.00% +/- 0.15%	98.05% +/- 0.12%
2	91.93% +/- 0.13%	97.53% +/- 0.39%
3	91.75% +/- 0.11%	97.69% +/- 0.07%
5	91.53% +/- 0.09%	97.77% +/- 0.23%
10	90.93% +/- 0.62%	97.75% +/- 0.26%

Each configuration ran 3 random seeds. Every added layer costs the linear model 0.1–0.2 percentage points. By depth 10 it's down to 90.93% — worse than a single linear layer at 92%.

Two things are happening. First, multiplying more matrix chains introduces floating-point errors. The effective weight matrix W5 @ W4 @ ... @ W1 becomes increasingly ill-conditioned as the product grows longer. Second, the optimization landscape gets harder. A 10-layer linear network has identical expressive power to a 1-layer one, but far more saddle points and flat regions for Adam to navigate. You're making the optimizer's job harder without giving it a more capable model to work with. This is consistent with findings in Saxe et al. (2013), "Exact solutions to the nonlinear dynamics of learning in deep linear networks", which showed that deep linear networks present complex optimization dynamics despite their theoretical equivalence to single-layer models.

Notice the standard deviation at depth 10 for the linear model (0.62%) is also the largest of any configuration. Deep linear networks are less stable across random seeds — another sign of numerical instability accumulating through the matrix chain.

Limitation: 3 seeds per configuration is enough to spot consistent trends but too few for tight confidence intervals — particularly at depth 10 where variance is highest. The directional finding (linear degrades with depth, ReLU doesn't) is robust, but the exact percentage values should be treated as estimates rather than precise measurements.

The ReLU model stays flat and strong across all depths. Depth doesn't hurt it because each layer is adding genuine non-linear capacity, not just complicating a linear transformation.

Experiment 5: Activation Function Comparison

Once you accept you need non-linearity, the question is which activation to use.

Activation	Test Accuracy (mean +/- std)
None (linear)	92.03% +/- 0.07%
Sigmoid	97.70% +/- 0.52%
Tanh	97.40% +/- 0.11%
ReLU	98.17% +/- 0.20%
Leaky ReLU	98.05% +/- 0.09%
GELU	97.82% +/- 0.29%

The first thing that jumps out: sigmoid at 97.70% is much stronger than I expected. The vanishing gradient problem is real, but for a shallow 5-layer network on a relatively straightforward task, the sigmoid still learns effectively. It becomes debilitating at depth (20+ layers), not at 5.

The three modern activations — ReLU, Leaky ReLU, GELU — cluster tightly between 97.82% and 98.17%. The differences are within noise at this scale. Leaky ReLU's fix for "dying ReLU" neurons doesn't meaningfully move the needle here.

GELU finished third at 97.82%, behind ReLU and Leaky ReLU. That's not a knock on GELU — it's the current default in transformer architectures (BERT, GPT-3/4, PaLM) because its smooth gradient signal pays dividends at 12–96 layers deep. For a 5-layer MLP on MNIST, the smooth approximation doesn't have enough depth to justify itself.

The result that matters most: the gap between any activation and none is roughly 6 percentage points. Whether you pick sigmoid or GELU, you land in a completely different performance regime from no activation at all.

The Numbers, Summarized

From six hours of actual training:

Finding	Result
ReLU (5-layer) test accuracy	98.1%
Linear (5-layer) test accuracy	91.8% — identical to logistic regression
Linear loss plateau	0.2772 (17x higher than ReLU's 0.0162)
Linear gradient at fc5 vs fc1	3.7x larger at output — signal concentrates, not vanishes
Adding 9 more layers to linear	-1.1% accuracy (90.93% at depth 10 vs 92.00% at depth 1)
Two-moons: linear ceiling	87.5% (no linear boundary can separate crescents)
Any activation vs none	~6 percentage point gap regardless of which activation

How This Changed How I Think About Activations

Before this experiment, I thought of activation functions primarily as gradient management tools. Keep them to prevent vanishing gradients, pick the right one for your architecture, and move on.

What I didn't have a clear picture of was the gradient concentration pattern in linear networks. I expected early layers to receive tiny gradients. Instead, fc1 and fc2 in the linear model receive larger absolute gradients than in the ReLU model — the optimizer just distributes them unevenly, stacking signal at the output end because that's where it's most efficient.

The network isn't broken in a way that's easy to see. Loss is decreasing. Weights are updating. Gradients are flowing. It's optimizing — just toward a locally best linear solution that has no path to the non-linear structure in the data.

The depth sweep was the other surprise. I assumed more linear layers would be neutral: same ceiling, same result, maybe slightly different optimization trajectory. The consistent degradation across 3 seeds at every depth from 1 to 10 shows that extra linear layers genuinely cost you — not because they make the model less expressive (the ceiling is already fixed) but because they make the path to that ceiling harder for Adam to navigate.

The practical takeaway: activation functions aren't scaffolding around the real architecture. They are a core part of why depth works at all. Without them, depth is overhead, not capacity.

Run It Yourself

git clone https://github.com/Emmimal/relu-experiment.git
cd relu-experiment
pip install -r requirements.txt
python run_all.py

If you don't want to commit to a 6-hour full run, there's a --fast flag that reduces epochs significantly for a quick sanity check.

The key patterns — gradient concentration, linear accuracy ceiling, depth degradation — all show up clearly even in the fast run:

python run_all.py --fast # reduced epochs, same structural findings

Individual experiments:

python mnist_experiment.py        # Experiment 1: ReLU vs Linear
python depth_sweep.py             # Experiment 2: Depth sweep
python two_moons.py               # Experiment 3: Decision boundaries
python activations_comparison.py  # Experiment 4: Activation comparison
python gradient_analysis.py       # Experiment 5: Gradient norm tracking

The gradient logging code in gradient_analysis.py is the part I'd most recommend borrowing. Plotting gradient norms per layer over training has caught more architectural bugs for me than loss curves alone.

What Happens if You Remove ReLU From a Deep Neural Network?

The Experiment Setup

Experiment 1: MNIST Training

Why This Is Mathematically Guaranteed

Experiment 2: The Gradient Norms (This Is the Interesting Part)

Experiment 3: Decision Boundaries

Experiment 4: More Depth Makes the Linear Model Worse

Experiment 5: Activation Function Comparison

The Numbers, Summarized

How This Changed How I Think About Activations

Run It Yourself

Further Reading