Activation Functions

6 min readFeb 6, 2022

An activation function transforms the sum of weighted inputs given to a node in a neural network using a formula.

It helps the model to decide if a neuron can be activated and adds non-linearity to a neuron’s output, which enables it to learn in a better manner.

Uses of activation functions in a neural network:

Performs predictions when used in output layers
Converts linear mappings to non-linear mappings when used after hidden layers
Maintains values of gradients to specific limits to prevent issues like exploding gradients and vanishing gradients
Better learning and generalization

Some popular activation functions are given below.

Sigmoid or Logistic

Sigmoid is a nonlinear function. It is a monotonic function that has an S-shaped curve. The function has a range of (0,1) and the total sum of the output nodes is equal to one. Thus, sigmoid is used for the binary classification of variables. Also, it is used for multi-label classification i.e. if the outputs are not mutually exclusive. For example, we use a sigmoid in the output layer of a model used for classifying diseases in a chest x-ray image. The image might contain the infection, emphysema, and/or cancer, or none of those findings.

Mathematically it can be represented as:

Advantages

Can be used to obtain the output as a probability

Disadvantages

Suffers from vanishing gradients problem

ReLU

ReLU is an abbreviation for Rectified Linear Unit activation function. It is a piecewise linear function and it is one of the most commonly used activation functions in deep neural networks. The neurons are deactivated only when the values are 0. Its range is from 0 to inf.

The curve is also known as the ramp function and it is similar to half-wave rectification (i.e. only the positive or negative part of the input is passed) in electrical engineering

Mathematically it can be represented as:

Advantages

Prevents the vanishing gradient problem
Less computationally expensive
Activates the convergence of gradient descent

Disadvantages

The problem of dead neurons
Causes a positive bias shift

Dead neurons are neurons that never get activated during backpropagation since the negative values become zero.

Tanh

Tanh is very similar to the sigmoid function, but only better. The output value of the function lies between [0,1]. It has an S-shaped curved which is more centred at zero. It is used for classifiers but you should be careful while using it for a large number of epochs.

Mathematically it can be represented as:

Advantages

Data is zero centred than Sigmoid

Disadvantages

Suffers from vanishing gradients problem

Leaky ReLU

The leaky ReLU is a type of ReLU that has a very small negative slope. It is a fast learner and more balanced than ReLU. It is used when you want neurons to be activated for negative input values.

Mathematically it can be represented as:

Advantages

Used when sparse gradients are available

Disadvantages

May cause dead neurons

Parameterized ReLU

Parameterized ReLU is a version of ReLU where the slope of the negative part is taken as an argument ‘a’. It is used when you want neurons to be activated for negative input values and want to vary the slope of the function for negative values

Mathematically it can be represented as:

Advantages

Better at solving the problem of dead neurons than ReLU and Leaky ReLU

Disadvantages

Variation of argument ‘a’ may affect the learning of the model

Exponential Linear Units (ELUs)

In ELU, which is similar to ReLU except for the slope of the negative part of the function is modified using an exponential. It tends to converge the cost or loss to zero faster than ReLU.

Mathematically it can be represented as:

Advantages

Better at converging than ReLU, Leaky ReLU and Parametrized ReLU
Has a smoother negative curve

Swish

The swish activation function is similar to Sigmoid computationally and performs better on deeper models. Its negative side curve is logarithmic and smoother than ReLU. It works better than ReLU for deep neural networks.

Mathematically it can be represented as:

Softmax

The softmax function outputs a vector of values that sum to 1 that can be interpreted as probabilities of class membership. It is an extension of the sigmoid function to multi-class classification. Softmax is commonly used as an activation function for the last layer. For example, we can use softmax in the last layer of a model that is used to classify cars. The car can only belong to one specific manufacturer.

Mathematically it can be represented as:

Which activation function should I use for my neural network?

There is no such general formula to choose an activation function. There are many considerations, but the general rule of thumb is to try out the most recommended one and move on to others if it doesn’t give you the desired results.

Softmax function is used for multi-class classification
ReLU is a default choice for neural networks and works best among all hidden layers
ReLU is used in hidden layers of CNNs and Tanh and Sigmoid is used in hidden layers of RNNs
Sigmoid functions and their combinations generally work better in the case of binary and multi-label classification problems in the output layer.
Sigmoid and tanh functions are avoided due to the vanishing gradient problem, especially in the hidden layers.
Swish function is used when neural networks have more than 40 layers
Tanh is avoided due to the dead neuron problem

Some links in case you want to explore more about the topic:

Note: the article was first published here on our site ml-concepts.com.

Activation Functions

Sigmoid or Logistic

Advantages

Disadvantages

ReLU

Advantages

Disadvantages

Tanh

Advantages

Disadvantages

Leaky ReLU

Advantages

Disadvantages

Parameterized ReLU

Advantages

Disadvantages

Exponential Linear Units (ELUs)

Advantages

Swish

Softmax

Which activation function should I use for my neural network?

Some links in case you want to explore more about the topic:

Written by Sourabh Gupta