What is the full form of CNN

Convolutional Neural Networks - structure, function and areas of application

A Convolutional Neural Network (“CNN” for short) is a deep learning architecture that was specially developed for processing images. In the meantime, however, it has been found that convolutional neural networks also work extremely well in many other areas, e.g. in the area of ​​word processing.

Development of convolutional neural networks (CNN)

A convolutional neural network (also called "ConvNet") is able to process input in the form of a matrix. This makes it possible to use images represented as a matrix (width x height x color channels) as input. A normal neural network, e.g. in the form of a multi-layer perceptron (MLP), on the other hand, requires a vector as input, i.e. in order to use an image as input, the pixels of the image would have to be rolled out one after the other in a long chain (flattening). As a result, normal neural networks are e.g. not able to recognize objects in an image regardless of the position of the object in the image. The same object at a different position in the image would have a completely different input vector.

A CNN essentially consists of filters (Convolutional Layer) and aggregation layers (Pooling layer), which repeat themselves alternately, and at the end of one or more layers of "normal" completely connected neurons (Dense / fully connected layer).

Filter - The Convolutional Layer

The matrix input is initially made up of a set number of so-called Filter analyzes that have a fixed pixel size (Kernel size) (e.g. 2 x 2 or 3 x 3), and then scan like a window with a constant step size over the pixel matrix of the input. The filters move from left to right across the input matrix and jump to the next lower line after each pass. With the so-called Padding defines how the filter should behave when it hits the edge of the matrix.

The filter has a fixed weight for each point in its viewing window, and it calculates a result matrix from the pixel values ​​in the current viewing window and these weights. The size of this result matrix depends on the size (kernel size) of the filter, the padding and, above all, the step size.

A step size of 2 with a kernel size of 2 x 2, for example, halves the size of the result matrix per filter compared to the input matrix. Each pixel is no longer connected to the filter individually, but 4 pixels are connected to the filter at the same time (local connectivity). The input was thus "folded" (Convolution).

In the first level of a convolutional neural network, a convolutional layer with 32 or 16 filters is usually used, the folded output of which is correspondingly a new matrix. This first layer is usually followed by a second convolutional layer with the same structure, which uses the new matrices from the convolution of the first layer as input. This is followed by a pooling layer.

The pooling layer

A pooling layer aggregates the results of convolutional layers by only passing on the strongest signal. At a MaxPooling Layer for example, the highest value of a kernel matrix is ​​simply used and all others are discarded. The four matrix results generated by a 2 x 2 kernel are thus reduced to just one number (the highest of the four). Pooling is used to pass on only the most relevant signals to the next layers, to achieve a more abstract representation of the content and to reduce the number of parameters in a network.

Many CNNs consist of a sequence of two convolutional layers each with the same number of filters, followed by a pooling layer, which in turn is followed by two convolutional layers and a pooling layer. While the size of the input is constantly being reduced by the convolutions and pooling, the number of filters for recognizing higher-level signals is increasing. The last pooling layer is followed by one or more fully connected layers.

The fully connected / dense layer

The fully connected layer or dense layer is a normal neural network structure in which all neurons are connected to all inputs and all outputs. In order to be able to feed the matrix output of the convolutional and pooling layers into a dense layer, this must first be rolled out (flatten). The output signals of the filter layers are independent of the position of an object, so there are no longer any position features, but location-independent object information.

This object information is fed into one or more fully connected layers and connected to an output layer which, for example, has exactly the number of neurons that corresponds to the number of different classes to be recognized.

Activation functions and optimization

In a convolutional neural network, the results of each layer are mostly activated by a ReLU function. The ReLU function ensures that all values ​​that are less than zero become zero and that all values ​​that are greater than zero are retained 1: 1. The last layer receives a Softmax activation in the event of classification problems, i.e. the output of all output neurons adds to 1 and indicates the probability of the corresponding output.

The weights of the filters and the fully connected layer are chosen randomly at the beginning and then further optimized during the training using the familiar backpropagation. In the case of classification problems (e.g. which object can be seen in the picture), the Categorical Cross-Entropy used (the negative natural log of the calculated probability for the category).

This is how convolutional neural networks work

With its filters, a convolutional neural network recognizes structures in the input data regardless of location. On the first level, the filters are activated by simple structures such as lines, edges and spots of color. The type of filter is not specified, but learned from the network. In the next level, structures are learned that consist of the combination of these basic structures, e.g. curves, simple shapes, etc.

The abstraction level of the network increases with each filter level. Which abstractions ultimately lead to the activation of the rear layer results from the characteristic features of the given classes that are to be recognized. It is very interesting to visualize the patterns that lead to the activation of the filters on different levels.

How exactly a ConvNet achieves these astonishing results has not yet been fully explained in mathematical theory. It is clear, however, that convolutional neural networks are currently achieving the best results in the field of image processing.

Areas of application of convolutional networks

When it comes to recognizing objects in images, the performance of convolutional is already better than that of humans.

Google’s Deepmind also used CNN’s for the famous AlphaGo system to evaluate the current playing position of the go board. This leads to a further strength of convolutional neural networks, which is to condense information available as a matrix into a meaningful vector.

This property can also be used, for example, in the field of word processing. A few months ago, the well-known open source NLU library spaCy switched from a Word2Vec-based solution for encoding words and texts to a CNN. As a result, the performance of spaCy could be significantly improved in all areas.

The current areas of application of convolutional neural networks range from auto encoders to object recognition in images and videos to the synthetic generation of images and texts from vectors. In many other areas, research is currently being carried out to determine whether CNN’s can also achieve better results there than normal neural networks. There are hardly any limits to creativity due to the possibility of using matrix inputs.


Graphic "Typical CNN": By Aphex34 CC BY-SA 4.0

Graphic "Visualization of Features in a Fully Trained Model": Screenshot from Visualizing and Understanding Convolutional Networks

Graphic "Object Detection, LSVRC Competition": Screenshot from the AI ​​Index Report 2017

/ 1 comment / by Roland Becker