What Happens if You Remove ReLU From a Deep Neural Network?

Written by emmimalpalexander | Published 2026/03/05
Tech Story Tags: deep-neural-networks | relu-functions | activation-functions | pytorch-debugging | deep-linear-networks | neural-network-nonlinearity | mnist-experiment | hackernoon-top-story

TLDRRemoved ReLU from a 5-layer PyTorch MLP. The model trained without errors, loss decreased every epoch, and it still hit 91.8% on MNIST — matching single-layer logistic regression exactly. Four hidden layers with 575K parameters added zero expressive power. The gradient data was the unexpected part. Early layers didn't vanish — they just received proportionally less signal. By epoch 10, fc5 had a gradient norm of 1.37 while fc1 sat at 0.37. The optimizer was concentrating updates at the output end because that's where the matrix product chain is shortest. The network quietly specialized its final layer to do all the classification work and treated the rest as a passive linear projection. Depth made it worse, not neutral. A 10-layer linear network scored 90.93% vs 92.00% for a 1-layer one — same expressive ceiling, harder optimization landscape. Every non-linear activation tested (Sigmoid, Tanh, ReLU, Leaky ReLU, GELU) landed between 97.4% and 98.2%. The gap between any activation and none was ~6 points. The choice of which activation barely mattered. The choice of whether to use one was everything. The real job of ReLU isn't gradient management. It's making depth mean something at all. Full experiment: 5 scripts, 3 seeds per config, 21,523 seconds on CPU. Code at : https://github.com/Emmimal/relu-experimentvia the TL;DR App

I spent two days debugging a model that trained without errors, printed loss every epoch, and still sat near random guessing — barely above the 10% floor for a ten-class problem.


I went through the usual checklist. Wrong loss function? No. Mismatched labels? No. Bad learning rate? No. I printed a batch of images. The digits looked correct. I printed the labels. They matched. Everything looked fine on the surface.


My friend finally spotted it. I had removed all the ReLU activation functions while refactoring the architecture. Not maliciously, not experimentally — just forgot to put them back. PyTorch ran the model without a single warning. No error, no NaN, no exploding gradients. Just silent, confident, useless training.


That stuck with me. I knew activation functions "introduced non-linearity," but that phrase had always felt hand-wavy. I didn't have a concrete feel for what actually breaks when you remove them, layer by layer, gradient by gradient. So I built a proper experiment to find out — five scripts, five experiments, one complete run that took over six hours on CPU.


This is what I found.

The Experiment Setup

Two five-layer MLPs, same architecture, same data, same optimizer, same number of training epochs. The only difference:

# Model A: ReLU between every hidden layer
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = F.relu(self.fc3(x))
x = F.relu(self.fc4(x))
return self.fc5(x)

# Model B: nothing between layers
x = self.fc1(x)
x = self.fc2(x)
x = self.fc3(x)
x = self.fc4(x)
return self.fc5(x)


Both models: 784 inputs, hidden layers [512, 256, 128, 64], 10 outputs. Both have exactly 575,050 parameters. Adam optimizer, lr=0.001, batch size 64, 20 epochs, CrossEntropyLoss. Kaiming initialization for the ReLU model (correct for ReLU variance), Xavier for the linear model (correct for symmetric activations).



The experiment suite runs with:

pip install -r requirements.txt
python run_all.py


Experiment 1: MNIST Training

The training curves diverge immediately and never converge:

Epoch

ReLU Loss

Linear Loss

1

0.2141

0.4532

5

0.0474

0.3339

10

0.0264

0.3018

20

0.0162

0.2772


The ReLU model converges cleanly. The linear model's loss keeps nudging downward but never settles — it's still at 0.2772 by epoch 20, more than 17x the ReLU model's final loss.


Model

Train Accuracy

Test Accuracy

ReLU (5-layer)

99.6%

98.1%

Linear (5-layer)

92.1%

91.8%


The 6.3-point gap looks significant on its own. But the real story is what 91.8% means in context: a single nn.Linear(784, 10) — one layer, no hidden state, pure logistic regression — achieves around 92% on MNIST. The five-layer linear network with half a million parameters matched a model with essentially no architecture at all. Four hidden layers, hundreds of thousands of weights, and they contributed zero additional expressive power.


That's not a performance problem. That's a mathematical inevitability.


One caveat worth naming: all experiments here use MNIST. MNIST has several linearly separable digit pairs (0 vs 1, 1 vs 7) that inflate the linear model's floor and make the gap look narrower than it would on harder data. Results on FashionMNIST — where classes like T-shirts and pullovers share far more visual structure — would likely show a wider gap. Extending the experiment is straightforward: swap datasets.MNIST for datasets.FashionMNIST in utils.py and rerun. The two-moons result (87.5% vs 94.6% on purely non-linear data) gives a better sense of what the ceiling looks like when linear separability isn't available at all.

Why This Is Mathematically Guaranteed

When you stack linear layers without activations:

h1 = W1 * x + b1
h2 = W2 * h1 + b2  =  (W2 @ W1) * x + W2*b1 + b2
h3 = W3 * h2 + b3  =  (W3 @ W2 @ W1) * x + ...


No matter how many layers you add, the composition always reduces to:

f(x) = W_effective * x + b_effective

Five matrix multiplications is still one matrix. The 5-layer linear network doesn't learn more than logistic regression — it is logistic regression, just parameterized in a much more convoluted way.


ReLU breaks this. max(0, z) is not linear — it routes negative activations to zero and passes positive ones through unchanged. That single operation prevents the composition from collapsing. Each layer can now represent something genuinely new that the previous layer could not.

Experiment 2: The Gradient Norms (This Is the Interesting Part)

Before running this experiment, I expected the linear model to have dead early layers — the standard "vanishing gradient" story. What I actually got was more interesting.

Here are the mean gradient norms per layer at epoch 10, averaged across every batch in that epoch:ReLU Model:

fc1.weight:  0.2914
fc2.weight:  0.2482
fc3.weight:  0.1851
fc4.weight:  0.1831
fc5.weight:  0.2197


Linear Model:

fc1.weight:  0.3676
fc2.weight:  0.3756
fc3.weight:  0.8999
fc4.weight:  1.1518
fc5.weight:  1.3699


The linear model's early layers are not dead. fc1 actually receives a larger gradient than in the ReLU model. What's broken is the distribution of gradient signal. The linear model's fc5 gets a gradient norm of 1.37 while fc1 gets 0.37 — a 3.7x difference within the same model. The ReLU model is almost flat across all layers (0.183 to 0.291).



This pattern holds and intensifies over training. By epoch 20, the linear model's fc5 gradient norm has grown to 1.77 while fc1 has shrunk to 0.30. The optimizer is concentrating its updates in the output layers.



Why does this happen? Because in a linear network, W_effective = W5 @ W4 @ ... @ W1. Adjusting W5 has the most direct path to the output — there's one fewer matrix multiplication between W5's gradient and the loss. Adam naturally exploits this: it piles signal into W5 and W4 because that's where the gradient is clearest, and lets W1 and W2 do relatively less.


The network hasn't vanished — it has concentrated. It's found a local solution where the final layers do classification and the earlier layers function as a passive (but trainable) linear projection. The result is mathematically equivalent to one linear transformation. And that's the ceiling.


This is not the same mechanism as vanishing gradients in sigmoid networks, where the bounded derivative causes exponential decay through the chain rule. In a linear network, gradients don't shrink — they concentrate toward the output end. Same practical effect on learning, different root cause.


Note: the gradient concentration pattern described above is an experimental observation from this specific setup — 5-layer MLP, Adam optimizer, MNIST. I am not aware of established literature that formally characterizes this exact behavior under these conditions. If you've seen a paper that does, I'd genuinely like to know.

Experiment 3: Decision Boundaries

Numbers are one thing. To make this concrete, I trained both models on a 2D two-moons dataset — two interleaved crescents that cannot be linearly separated — and plotted the full decision surface.


  • Linear model: A straight line across the 2D plane. One side is class 0, the other is class 1. The crescent shapes get cut in half. Test accuracy: 87.5%.
  • ReLU model: A curved boundary that wraps around each moon independently. Test accuracy: 94.6%.


The gap on two-moons (7.1 points) is larger than on MNIST because two-moons is purely non-linear — there is no straight line that meaningfully separates the data. MNIST has some digit pairs that are nearly linearly separable (0 vs 1 is easy; 3 vs 8 is not), which is why the linear model still manages 91.8% there. Strip out those easy cases and the ceiling drops hard.

Experiment 4: More Depth Makes the Linear Model Worse

I expected adding layers to be neutral — same ceiling, same result. Instead, depth actively hurts:

Depth

Linear Test Acc (mean +/- std)

ReLU Test Acc (mean +/- std)

1

92.00% +/- 0.15%

98.05% +/- 0.12%

2

91.93% +/- 0.13%

97.53% +/- 0.39%

3

91.75% +/- 0.11%

97.69% +/- 0.07%

5

91.53% +/- 0.09%

97.77% +/- 0.23%

10

90.93% +/- 0.62%

97.75% +/- 0.26%


Each configuration ran 3 random seeds. Every added layer costs the linear model 0.1–0.2 percentage points. By depth 10 it's down to 90.93% — worse than a single linear layer at 92%.


Two things are happening. First, multiplying more matrix chains introduces floating-point errors. The effective weight matrix W5 @ W4 @ ... @ W1 becomes increasingly ill-conditioned as the product grows longer. Second, the optimization landscape gets harder. A 10-layer linear network has identical expressive power to a 1-layer one, but far more saddle points and flat regions for Adam to navigate. You're making the optimizer's job harder without giving it a more capable model to work with. This is consistent with findings in Saxe et al. (2013), "Exact solutions to the nonlinear dynamics of learning in deep linear networks", which showed that deep linear networks present complex optimization dynamics despite their theoretical equivalence to single-layer models.


Notice the standard deviation at depth 10 for the linear model (0.62%) is also the largest of any configuration. Deep linear networks are less stable across random seeds — another sign of numerical instability accumulating through the matrix chain.


Limitation: 3 seeds per configuration is enough to spot consistent trends but too few for tight confidence intervals — particularly at depth 10 where variance is highest. The directional finding (linear degrades with depth, ReLU doesn't) is robust, but the exact percentage values should be treated as estimates rather than precise measurements.


The ReLU model stays flat and strong across all depths. Depth doesn't hurt it because each layer is adding genuine non-linear capacity, not just complicating a linear transformation.

Experiment 5: Activation Function Comparison

Once you accept you need non-linearity, the question is which activation to use.

Activation

Test Accuracy (mean +/- std)

None (linear)

92.03% +/- 0.07%

Sigmoid

97.70% +/- 0.52%

Tanh

97.40% +/- 0.11%

ReLU

98.17% +/- 0.20%

Leaky ReLU

98.05% +/- 0.09%

GELU

97.82% +/- 0.29%


The first thing that jumps out: sigmoid at 97.70% is much stronger than I expected. The vanishing gradient problem is real, but for a shallow 5-layer network on a relatively straightforward task, the sigmoid still learns effectively. It becomes debilitating at depth (20+ layers), not at 5.


The three modern activations — ReLU, Leaky ReLU, GELU — cluster tightly between 97.82% and 98.17%. The differences are within noise at this scale. Leaky ReLU's fix for "dying ReLU" neurons doesn't meaningfully move the needle here.


GELU finished third at 97.82%, behind ReLU and Leaky ReLU. That's not a knock on GELU — it's the current default in transformer architectures (BERT, GPT-3/4, PaLM) because its smooth gradient signal pays dividends at 12–96 layers deep. For a 5-layer MLP on MNIST, the smooth approximation doesn't have enough depth to justify itself.


The result that matters most: the gap between any activation and none is roughly 6 percentage points. Whether you pick sigmoid or GELU, you land in a completely different performance regime from no activation at all.

The Numbers, Summarized

From six hours of actual training:

Finding

Result

ReLU (5-layer) test accuracy

98.1%

Linear (5-layer) test accuracy

91.8% — identical to logistic regression

Linear loss plateau

0.2772 (17x higher than ReLU's 0.0162)

Linear gradient at fc5 vs fc1

3.7x larger at output — signal concentrates, not vanishes

Adding 9 more layers to linear

-1.1% accuracy (90.93% at depth 10 vs 92.00% at depth 1)

Two-moons: linear ceiling

87.5% (no linear boundary can separate crescents)

Any activation vs none

~6 percentage point gap regardless of which activation

How This Changed How I Think About Activations

Before this experiment, I thought of activation functions primarily as gradient management tools. Keep them to prevent vanishing gradients, pick the right one for your architecture, and move on.


What I didn't have a clear picture of was the gradient concentration pattern in linear networks. I expected early layers to receive tiny gradients. Instead, fc1 and fc2 in the linear model receive larger absolute gradients than in the ReLU model — the optimizer just distributes them unevenly, stacking signal at the output end because that's where it's most efficient.


The network isn't broken in a way that's easy to see. Loss is decreasing. Weights are updating. Gradients are flowing. It's optimizing — just toward a locally best linear solution that has no path to the non-linear structure in the data.


The depth sweep was the other surprise. I assumed more linear layers would be neutral: same ceiling, same result, maybe slightly different optimization trajectory. The consistent degradation across 3 seeds at every depth from 1 to 10 shows that extra linear layers genuinely cost you — not because they make the model less expressive (the ceiling is already fixed) but because they make the path to that ceiling harder for Adam to navigate.


The practical takeaway: activation functions aren't scaffolding around the real architecture. They are a core part of why depth works at all. Without them, depth is overhead, not capacity.

Run It Yourself

git clone https://github.com/Emmimal/relu-experiment.git
cd relu-experiment
pip install -r requirements.txt
python run_all.py

If you don't want to commit to a 6-hour full run, there's a --fast flag that reduces epochs significantly for a quick sanity check.


The key patterns — gradient concentration, linear accuracy ceiling, depth degradation — all show up clearly even in the fast run:

python run_all.py --fast # reduced epochs, same structural findings


Individual experiments:

python mnist_experiment.py        # Experiment 1: ReLU vs Linear
python depth_sweep.py             # Experiment 2: Depth sweep
python two_moons.py               # Experiment 3: Decision boundaries
python activations_comparison.py  # Experiment 4: Activation comparison
python gradient_analysis.py       # Experiment 5: Gradient norm tracking


The gradient logging code in gradient_analysis.py is the part I'd most recommend borrowing. Plotting gradient norms per layer over training has caught more architectural bugs for me than loss curves alone.

Further Reading


Experiments run on Windows 11, Python 3.12, CPU only. Results verified across 3 random seeds per configuration. Full code at the linked repository.



Written by emmimalpalexander | Independent AI researcher and author writing about deep learning and neural networks.
Published by HackerNoon on 2026/03/05