What is meant by the activation function

How artificial neural networks work

Artificial neural networks are a special area of ​​machine learning that even has its own trend term: deep learning.
But how does an artificial neural network actually work? And how is it implemented in Python? This is article 2 of 6 in the series of articles - Getting Started with Deep Learning.

First of all, we limit ourselves here to the artificial neural networks of monitored machine learning. For this it is important that the principle of training and testing monitored procedures is understood. Artificial neural networks can also be used for unsupervised dimension reduction and clustering. The best known method is the AE-Net (Auto Encoder Network), which is not considered here.

Let's start with simply artificial neural networks, all of which are based on the perceptron as the core idea. The model for artificial neural networks are natural neural networks like those found in the human brain.


The perceptron is a "classic" among the artificial neural networks. When speaking of a neural network, what is usually meant is a perceptron or a variation of it. Perceptrons are multi-layer networks without feedback, with fixed input and output layers. There is no absolutely uniform definition of a perceptron, but as a rule it is a pure feed-forward network with an input layer (also called scanning layer or retina) with statically or dynamically weighted connections to the output layer, which (as a single- Layer perceptron) consists of a single neuron. One neuron is made up of two mathematical functions: a calculation of the net input and an activation function that decides whether the calculated net input “fires” or not. It is therefore binary in its output: You can also think of it as a small lamp, so that depending on the input values ​​and the weightings, a net input (sum) forms and a step function decides whether the lamp lights up at the end or not. This concept of output generation is called forward propagation.

Single-layer perceptron

Even if “network” may seem a bit exaggerated for a single perceptron with its one neuron, it is the basis for many larger and multi-layered networks.

Let us now consider the mathematics of forward propagation.

We have a lot of input values . Where for the following always applies as bias input: . The bias input is only a placeholder for the important bias weight.


A weight variable is required for each input variable:


Each product of the input value and weighting should add up to the net input form. Here shows as a linear mathematical function that is two-dimensional easily as a With as Y-axis section if .


The linear function only becomes a binary class division through the step function as a so-called activation function (see: Machine Learning - Regression vs Classification), because if a threshold to be defined exceeds the step function with the input a different value than if this threshold value is not exceeded.


The definition of this activation function is the core of the classification and many extended artificial neural networks essentially differ from the perceptron in that the activation function is more complex than a pure step function, for example as a sigmoid function (based on the logistic function) or the hyperbolic tangent (tanh) function. More about this in the next article in this series of articles, so let's stick with the simple jump function.

Artificial neural networks are basically nothing more than multi-dimensional, mathematical functions that can capture enormous complexity by being switched as neurons next to each other (neurons in one layer) and one behind the other (several layers). The weightings are the set screws that shape the form of the mathematical function, consisting of straight lines and curves, to describe a point cloud (regression) or to identify class boundaries (classification).

Another view of artificial neural networks is that of the filter: An artificial neural network accepts all input variables (e.g. all pixels of an image) and the weightings (the shape of the filter) are designed in such a way that the filter always leads to the correct class (in the context of image classification: the object class).

Let's come back briefly to the calculation of the net input . Since this notation ...


... is quite exhausting, advanced students of linear algebra prefer to write .


The superscript stands for transpose. Transposing means that columns become rows - or vice versa.

For example, we fill two vectors and with exemplary content:

Input values:




Can now the net input can be calculated, because the weighting vector changes from the column vector to the row vector. In this way - represented mathematically correctly - each element of one vector can be multiplied by the associated element of the other vector, and the resulting values ​​are summed up.


Back to the actual task of the artificial neural network: classification! (We will ignore regression, clustering and dimension reduction as tasks in this article 🙂

The perceptron is supposed to separate two classes. For this, all entries should be weighted correctly so that the resulting net entry the jump function is activated if the data record is not suitable for the abut for that other Class.

Since we are dealing with a linear function have to do, the convergence (= accuracy of fit of the model with reality) of a single-layer perceptron is only possible for linear separability!

Training the perceptron network

The task now is to find the right weights - and not just any right ones, but exactly the optimal ones. The question that arises for any artificial neural network is that of the correct weightings. Training a perceptron is comparatively easy, precisely because it is binary. Because binary also means that if a wrong answer was given, the other possible result must be correct.

Training a perceptron works as follows:

  1. Set all weights to the value 0.00
  2. With every record of the training
    1. Calculate the output value
    2. Compare the output value with the actual result
    3. Update the weights against the error:

Whereby the weight adjustment contrary to the error (or towards the other possible answer) happens:

Note for the experts: The step size Let's just fade out here. Please just from go out.

is the difference between the prediction and the actual result (class). All weights are updated with each error at the same time. If all weightings have been updated, the next run comes (another comparison between and ), not to forget, of course, the dependence on the input values ​​x:

Training a perceptron

The training in supervised learning is always based on the idea of ​​considering the output error (the difference between the prediction and the actually correct result) and adapting the classification logic to the correct setting screws (in neural networks these are the weightings) against the error.

Correct classification situations can represent true positives and true negatives, which should not lead to any weight adjustment:

True-Positive -> Classification: 1 | correct class: 1

True-Negative-> Classification: -1 | correct class: -1

Incorrect classifications create an error that should lead to a weight adjustment contrary to the error:

False-Positive -> Classification: 1 | correct class: -1

False-Negative -> Classification: -1 | correct class: 1

Imaginary training example of a single-layer perceptron (SLP)

Let's assume that is and the SLP mistakenly is the class even though it is the correct class would. (And we leave the step size unchanged )

Then the following happens:

The weighting decreases accordingly and thus the probability increases that if in the next iteration () class +1 is correct again, the threshold value to fall below and come across this correct class.

The update of the weighting is proportional to . For example, a new (at iteration ) to an erroneous classification () lead, the decision limit would lead to the correct prediction of the class in the next run () at can be moved even further in the same direction:

You can find out more about training artificial neural networks in the next article in this series of articles.

Single-Layer Perceptrons (SLP) - Example with the Boolean separation

Let's leave the training of the perceptron and just assume that the ideal weights have already been found and now look at what a perceptron can (not) do. Because don't forget, it should actually differentiate between classes or find the necessary decision-making boundaries.

Boolean operators differentiate cases according to Boolean values. They are a popular “Hello World” for familiarization with the linear decision-making logic of a perceptron. There are three basic Boolean comparison operators: AND, OR, and XOR


A perceptron to solve this problem would need two dimensions (+ bias): and
And it would have to have weightings that ensure that the prediction is based on the logic AND, OR or XOR with is working.

It is important that we also phi define as a step function. For example, they might look like they are worth jumps if z> 0, but otherwise remains.

The network and the weights (w setup) could look like this for the AND and OR logic:

The weightings work with the SLP without any problems, because we are dealing with linearly separable problems:

Fancy a little test? So let's start with the AND logic:

Seems to work!

And then the OR logic with it

Excellent! However, the question now arises as to how the XOR problem is to be solved, because that determines both the limits of AND and those of the OR operator.

Multi-Layer-Perceptron (MLP) or (Deep) Feed Forward (FF) Net

Because an XOR can also be described mathematically correctly as follows:

Let's try it out!

It works!

Multiple classification with the perceptron

A perceptron network classifies binary, the output is limited to 1 or -1 or 0 or 1.

In practice, however, a One-vs-All (OvA) or One-vs-Rest (OvR) classification is often implemented. In this case, the 1 stands for the recognition of a specific class, while all other other classes are considered negative.

In order to be able to recognize each class, n classifiers (= n perceptron networks) are required. Every perceptron network is trained to recognize a certain class.

Adaline - Or: the limitation of the perceptron

The perceptron is only activated via a jump function. This limits the fine-tuning of the training enormously. Activations via continuous functions, which are then differentiable (derivable), are better. This gives a convex error function with a clear minimum. The Adaline algorithm (ADAptive Linear NEuron) expands the idea of ​​the perceptron by precisely this idea. The essential advance of the Adaline rule compared to that of the perceptron is that the updating of the weightings is not based on a simple step function, as with the perceptron, but on a linear, continuous activation function.

Single-layer Adaline

How an artificial neural network can be trained with the Adaline category is explained in the next article in this series of articles.

Advanced network concepts (CNN and RNN)

Anyone who has already entered deep learning with frameworks such as TensorFlow may have already learned about advanced concepts of artificial neural networks. The CNNs (Convolutional Neural Network) are currently the choice for processing high-dimensional tasks such as image recognition (Computer Vision) and text recognition (NLP). The CNN significantly expands the possibilities with neural networks by adding a network to reduce dimensions, but the core of the concept is still the MLPs. When used in image recognition, CNNs function, put simply, in such a way that the upstream network area reads out the millions of image pixels sector by sector (convolution, folding by reading out over sectors that overlap), condensed (pooling, for example using non-linear functions such as max ()) and then - according to this procedure - classified similarly in the MLP.


Another expanded form are RNNs (Recurrent Neuronal Network), which are also based on the idea of ​​the MLP, but turn this concept on its head thanks to back connections (neurons send to previous layers) and self connections (neurons send to themselves).


However, for a deeper understanding of CNNs and RNNs, it is essential that the concept of the MLP is understood beforehand. It is the simplest form of the most widely used and very powerful network topologies that are still used today.

In 2016, Fjodor van Veen from asimovinstitute.org had - thankfully - created a compilation of network topologies that I still take a look at today:

Artificial Neural Networks - Topology Overview by Fjodor van Veen

Book recommendations

I use the following books for my self-study of machine learning and deep learning and have partly been thought templates for this article:



Benjamin Aunkofer

Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.

Tags:Artificial Intelligence, artificial neural network, data science, deep learning, linear algebra, math