How the f*** do activation functions work

07.04.25

So you’ve built a neural net. It’s got layers. It's got weights. Maybe it even learns stuff. But somewhere along the line, someone told you:

“Don't forget the activation function.”

And you were like,
“Okay???”
But deep down, you were thinking:
“How the f**** do activation functions actually work?”

Let’s break it down.


What are activation functions anyway?

In very blunt terms:
Activation functions are little mathematical bouncers at the club called "Your Neural Net." Every time a neuron computes its weighted sum, the activation function decides whether that neuron gets to party (i.e. "activate") or not.

More formally:
They take the output of a neuron and squash/transform it into a form that's usable by the next layer, and more importantly makes it so that the entire network learns in a more robust way. That’s it.


But wait, why not just use raw values?

Because that would turn your entire network into one big linear expression.
Like this:

output = W3*(W2*(W1*x))

Chaining linear operations just gives... another linear operation. No matter how many layers you have, the net result is still just:

y = A*x + B

No curves, no bends, no interesting decision boundaries. Just a boring line.

Activation functions bring in the non-linearity. They give your network the ability to learn actual complex stuff — like images, speech, etc.


The Usual Suspects

Here’s the lineup:

1. ReLU (Rectified Linear Unit)

f(x) = max(0, x)

Most popular kid in the class. Easy to compute, works well, doesn’t saturate in the positive direction.

But it dies sometimes (called dying ReLU), where neurons get stuck outputting zero forever.

2. Sigmoid

f(x) = 1 / (1 + exp(-x))

S-shaped, squashes everything between 0 and 1. Used to be everywhere, now mostly retired except in output layers (e.g. for binary classification).
Also: suffers from vanishing gradients, which means it learns really slow when inputs go too far from zero (the chain rule of calculus, which means that the deeper you go the smaller the gradient gets).

3. Tanh

f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

Same squiggly vibe as sigmoid, but ranges from -1 to 1. Centered at 0, so it’s slightly better behaved.
Still... not immune to vanishing gradients either.

4. Leaky ReLU

f(x) = x if x > 0 else 0.01*x

ReLU but with a safety net. Even negative values can pass through a little.
Helps with the dying neuron problem.

5. Softmax

Used in the output layer of multi-class classification networks. Turns raw scores into probabilities (but not exactly).


Okay but what do they do in code?

Let’s take an example. Here's a simple example of a neuron:

def neuron(x, w, b):
    return x*w + b

Let's say we add ReLU:

def relu(x):
    return max(0, x)

def neuron_with_relu(x, w, b):
    z = x*w + b
    return relu(z)

You can do this with any function. The key idea:

The activation function modifies the output of a layer before passing it to the next.


Visual intuition (aka look at dem curves)

Here’s how a few look:

These curves define how gradients flow back during backprop. So choosing the wrong one can totally ruin learning.


TL;DR

Without them, your net would just be... well, dumb.


Still confused? Honestly, just try plotting them and see how they bend the output space. It’s way cooler to see what they do than to read about it. You can also find some basic visualizations and implementations for the functions I described in this notebook here.