The intuition of convolution neural network

11 min readJun 20, 2021

Along with the development of machine learning, especially deep learning, people are interested to include technology more and more in their daily life. To fulfill this purpose, image recognition is one of the most important tasks and this is done by a convolution neural network (CNN). It’s not only image recognition but also contains many other computer vision-related tasks.

A convolution neural network is nothing but a neural network with convolution operation instead of general matrix multiplication as we do in the neural network. So to understand CNN first we have to clear what is convolution operation and how it dealt with the image.

Before answering these questions let me introduce the main ingredient of convolution neural network

· Convolution layer

· Pooling layer

· Flattening layer

· Fully connected layer

Figure 1: depicted all layers in a very simplified convolution neural network

Convolution layer:

Convolution layer where we are doing convolution operation.

Convolution operation:

Convolution is an integral transformation operation on a function through a specific operator. It represents the original function into a new representation of the similarity of two functions in a certain window. It is mainly used for feature extraction from images without canceling the spatial associations between pixels.

So how convolution operation works on the image –

Image can be considered as a matrix of pixel values. Let’s consider, I have a 5x5 image whose pixel value lies between 0 and 1(here I have shown grayscale image whose value can be lie between 0 to 255 but here we are taking pixel value between 0 to1 only.) and also have another 3X3 matrix which called feature extractor or filter (filter is nothing but another matrix with some numerical values which will learn by the machine during training especially in backpropagation ), now I have shown convolution operation with 5x5 matrix and 3x3 matrix with animation figure below.

3x3 matrix

Let’s understand how this convolution is being done. The orange-colored filter is placed to an area of an input green colored image at the upper left corner, then element-wise multiplication is calculated between pixel values of input image within this area and filter’s values and add this multiplication output, then we will get one element of the output matrix (pink color). Next, we repeat this same process placing this orange filter at the upper-middle position i.e. we shift the filter by a one-pixel value (i.e. stride value considering as 1). In this way, the orange filter covers the whole first row then, after travel the whole row it comes to the next row and repeats the same process, in this way this filter traverses the whole image (filter travel left to right, and top to bottom). Here the output matrix is called a feature map.

Different values of the filter produced different feature maps for one input image. Each filter represents some pattern (like edge, vertical line, horizontal line, etc.) of the input image and by traveling the whole image, the filter tries to find out where this pattern is available in the input image.

In this figure, the filter with a red boundary traverses the whole image and tries to find out where the left slanted line is present and produce a feature map, and another filter with a green boundary also traverse the whole image and try to find out where the right slanted line is present and produce another feature map.

To extract the feature from the image, only one or two filters are not sufficient, we have to use several filters. After convolution operation, feature maps are stack channel-wise and produce output.

Figure 3: This figure is shown the channel-wise staking of the feature map.

In the case of convolution operation, instead of every input pixel is connected to every hidden layer’s neurons (as in the case of dense neural network), used small localized region of the input image (called local field) to feed the feature map. So, CNN has sparse connections. This is possible by making the filter smaller than the input image.

Let’s consider we have a 10 x10 image, if I connect this image to a dense neural network then, every pixel value would be connecting to the next hidden layer neurons. But if we connect with CNN, some regions of the input image (which cover by filter/kernel and convolution operation at this place) represent one point in the feature map. Then many connections would become sparse. This sparse connection reduces the parameter size, so, the memory requirement of the model for this work also reduces and it improves model statistical efficiency.

Figure 4: This figure shows the local region (gray shaded region) of one hidden neuron (green colored pixel point) in the feature map.

Let’s understand this sparse connection by visualization -

In these figures highlighted s represents one output and highlighted x’s are representing the inputs that affect this output, these units represent the respective field of this output. (upper one) when s formed by convolution with filter width 3, i.e. only 3 inputs affect this output point. (lower one) here all input points are connecting to output i.e. no sparse connection.

In convolution operation parameter shearing is also another important property. We saw from the previous session, that filter is traversed throughout the whole image to find out some features present or not in this image, this time filter’s values are remaining the same. That means filter parameters (numerical values) that are shared among the all receptive region. Due to this property number of the parameter is reduce a lot compare to a dense neural network because in the case of a dense neural network each input is connected to the next output with a certain parameter, so, there is no shearing that occurs.

Let’s understand what am I saying –

If I have a 10x10 image,

For CNN — if I choose a 3x3 filter then my parameter would be only 9 (3*3).

For DNN — if I choose the number of neurons in the next layer as 3 then the number of the parameter would be 300 (10*10*3 =300).

so, from this explanation, we got an essence of how effective this parameter sharing to reduce the number of parameters.

Activation function:

we are using the activation function just after the convolution operation, which belongs to the convolution layer. So, what is the activation function, and why we use this??

The activation function is a function that decides whether a neuron of a hidden layer will be activated or not. And we use the activation function to introduce non-linear relations in our model.

Now you might be thinking why need this non-linearity in our image?

Because images are themselves very non-linear in nature, as there might be multiple objects are present in an image, and boundaries between those objects might not be linear. When we apply convolution operation on an input image to extract feature and make a feature map then this time feature map might be linear in nature, so this time we need to break up this linearity.

and to do so, we use the activation function.

For image recognition, most of the time we use Relu as an activation function.

The mathematical form of Relu is –

y= x for x >= 0

y= 0 for x<0

So why Relu instead of other well-known activation functions like sigmoid, tanh because of its several advantages over these activation functions like-

· Computationally very simple

· Faster than other activation functions

· It’s sparely activated

· Overcome vanishing gradient problem

We might be used parametric Relu or leaky Relu or Elu etc. instead of Relu also.

Figure 8: in this figure we depicted a feature map without Relu (left one) and with Relu (right one)

Pooling layer:

We are using the pooling layer immediately after the convolution layer to downsampling the feature map.

But the question is why need to downsampling??

Till now we have shown, convolution layer which creates a feature map but there has some limitations on the feature map, in this feature map, the convolution layer records the price location of the feature in the input image. That means a very small change of input image (like rotation, shifting, cropping) might create a new feature map. To avoid these problems, we usually use max pooling or average pooling.

How pooling operation is being done –

At first, we choose a filter or kernel size and stride value like convolution operation but here, have no learnable parameter as a convolution filter.

In the case of max pooling, we are picked only the maximum pixel value within the filter.

Here we are shown max pooling with filter size 2 x2 and stride value 2, so the 2x2 filter first goes to the upper-left most corner (blue) and find out the maximum value within this filter (here 9) and this filter shift by 2 pixels according to stride value and come to the upper- right side and picked the maximum value (i.e. 7) similarly filter comes to the lower-left corner and follow the same process and then lower-right corner. If the feature map is bigger than the given feature map this time filter traverse throughout the whole image and follows the same rule.

And in the case of average pooling calculating the average value of the pixel within this filter instead of considering the maximum value.

So, convolution and pooling layer together able to extract the feature of the input image. But only one convolution and pooling layer are not sufficient to extract the pattern from the image, we need several convolutions and pooling layers for that, and obviously the number of convolution and pooling layer and their position that would be decided by the user. Whereat, very first convolution and pooling layer extract a very simple pattern (like line, edges). As we go deeper into the network, it (my model) can extract more complex patterns.

Figure 10: In this figure shown, at very first convolution layer will learn a simple kind of pattern like edge, then the second convolution layer will learn more complex feature of the first layer and continue this process using several combinations of convolution and pooling layer for extracting feature from the image.

Flattening layer:

once we extract the feature properly, this feature map converts to a 1-D array for further processing through a dense neural network.

figure 11: Flattening operation of the feature map

Fully-connected layer:

This layer is nothing but a dense neural network. Neural networks are composed of the input layer (this is the flattening layer), an output layer, and one or more hidden layers. Neurons of each layer are connected to next layer neurons with their associated weights and threshold. If the output of each neuron (except the input layer’s neurons) is greater than the specified threshold (also called bias), then this neuron will be activated (that decided by the activation function) and passed the value to the next layer.

This fully connected layer is used for learning the model. so, one can think of this whole process as two-part. In the first part, i.e. before the flattening layer convolution layer are providing a low dimensional, meaningful, something like invariant feature space, and the second part, i.e. beyond the flattening layer, which is used for learning the model.

Figure 12: Fully connect layer, here x1, x2, x3, x4 (left-side) all are the input of this layer, are coming from flattening layer and x1, x2, x3 (right-side) is represent output layer.

so, far we saw all the layers of CNN and their importance, now time to add these all layers.

For better understanding let’s take an example, how CNN is used in real life.

Take a dog vs cat classification example using CNN.

First of all, we have to collect a set of dog and cat images for training the model. Let’s consider I have 5000 dog class images and 5000 cat class images. After getting data then I will feed these data to the CNN network. Then this data would be passing through a combination of convolution layer and pooling layer and extracted feature from these data and then flattening this feature map and finally feed to a dense neural network, till now this is done in forwarding propagation since at the very fast time the parameters are initialized randomly so obviously my prediction result would be not good. So then we calculate the loss function and update (i.e. adjust all parameters, weights, and biases of the neural network as well as filter parameter of convolution layer) the model according to loss function (i.e. stating backpropagation) and we are updating this model until convergence of loss function.

Once the model is trained means it reaches its minimum loss function then we can be used this model for dog vs cat class prediction. The output of the CNN is representing in terms of the probabilistic form.

In this figure, we have shown my model is enabled to predict this image belongs to dog class with 95% accuracy and cat class with 5% accuracy. And we consider the final class, depending on the major probability value.

The drawback of CNN:

CNN has been proving its effectiveness on image processing problems, still it has some limitations also.

CNN cannot have distinguished between the image with a different angle, different background, and several different lighting conditions.

In this figure, each picture will consider as a different image for the CNN model unless such pictures are not given in the dataset.

Also, below describing both figures will be considered almost the same for the CNN model because in this both figures contain the same contents (two eyes, one nose, and one mouth). That means the spatial position of the feature in the image is not considered in the case of CNN.

These problems of CNN are resolved by proceeding with data augmentation, where we are creating images artificially by rotating a small amount or flip the image etc. so that my model will get a lot of variety of images. But data augmentation does not resolve the worst-case scenario (like crumbled cloths, or inverted chairs, etc.).

· CNN does not have co-ordinates frame as human vision.

The co-ordinate frame is mainly a mental model, let’s understand what am I saying with this figure. A human can easily understand, if the left side image is rotated 180 degrees clockwise then we will get the right image. This thing we might be thinking, just by adjusting our co-ordinate frame in our brain mentally and we can see both faces without considering rotation.

So we may conclude human vision can process multiple precepts where CNN has only one precept and that is sufficient to feel the co-ordinate frame.

· CNN is becoming slower due to the max-pooling layer.

· If CNN has large numbers of layers, then the model will take a lot of time for training. This problem can be solved by using GPU.

· And CNN required a large dataset for training.

To overcome these many drawbacks of CNN researchers is follow the idea of “capsule neural network”.

Though CNN has these many drawbacks still, it is very demanding for image processing problems, like image classification, object detection, object localization, verification from images i.e. computer vision-related tasks.

So, we can consider, CNN as a time-consuming feature extraction method.

The intuition of convolution neural network

Written by Moumita Jana