#Convolution
 Each position in $z$ consists of convolution result in previous map
 Way for shrinking the maps
 Stride greater than 1
 Downsampling (not necessary)
 Typically performed with strides > 1
 Pooling
 Maxpooling
 Note: keep tracking of location of max (needed while back prop)
 Mean pooling
 Maxpooling
#Learning the CNN
 Training is as in the case of the regular MLP
 The only difference is in the structure of the network
 Define a divergence between the desired output and true output of the network in response to any input
 Network parameters are trained through variants of gradient descent
 Gradients are computed through backpropagation
#Final flat layers
 Backpropagation continues in the usual manner until the computation of the derivative of the divergence
 Recall in Backpropagation
 Step 1: compute $\frac{\partial Div}{\partial z^{n}}$γ$\frac{\partial Div}{\partial y^{n}}$
 Step 2: compute $\frac{\partial Div}{\partial w^{n}}$ according to step 1
#Convolutional layer
#Computing $\nabla_{Z(l)} D i v$

$$ \frac{d D i v}{d z(l, m, x, y)}=\frac{d D i v}{d Y(l, m, x, y)} f^{\prime}(z(l, m, x, y)) $$

Simple compontwise computation
#Computing $\nabla_{Y(l1)} D i v$

Each $Y(l1,m,x,y)$ affects several $z(l,n,x\prime,y\prime)$ terms for every $n$ (map)
 Through $w_l(m,n,xx\prime,yy\prime)$
 Affects terms in all $l^{th}$ layer maps
 All of them contribute to the derivative of the divergence $Y(l1,m,x,y)$

Derivative w.r.t a specific $y$ term
$$ \frac{d D i v}{d Y(l1, m, x, y)}=\sum_{n} \sum_{x^{\prime}, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} \frac{d z\left(l, n, x^{\prime}, y^{\prime}\right)}{d Y(l1, m, x, y)} $$
$$ \frac{d D i v}{d Y(l1, m, x, y)}=\sum_{n} \sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} w_{l}\left(m, n, xx^{\prime}, yy^{\prime}\right) $$
#Computing $\nabla_{w(l)} D i v$
 Each weight $w_l(m,n,x\prime,y\prime)$ also affects several $z(l,n,x,y)$ term for every $n$
 Affects terms in only one $Z$ map (the nth map)
 All entries in the map contribute to the derivative of the divergence w.r.t. $w_l(m,n,x\prime,y\prime)$
 Derivative w.r.t a specific $w$ term
$$ \frac{d D i v}{d w_{l}(m, n, x, y)}=\sum_{x^{\prime}, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} \frac{d z\left(l, n, x^{\prime}, y^{\prime}\right)}{d w_{l}(m, n, x, y)} $$
$$ \frac{d D i v}{d w_{l}(m, n, x, y)}=\sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} Y\left(l1, m, x^{\prime}+x, y^{\prime}+y\right) $$
#Summary
#In practice
$$ \frac{d D i v}{d Y(l1, m, x, y)}=\sum_{n} \sum_{x \prime, y^{\prime}} \frac{d D i v}{d z\left(l, n, x^{\prime}, y^{\prime}\right)} w_{l}\left(m, n, xx^{\prime}, yy^{\prime}\right) $$
 This is a convolution, with defferent order
 Use mirror image to do normal convolution (flip up down / flip left right)
 In practice, the derivative at each (x,y) location is obtained from all $Z$ maps
 This is just a convolution of $\frac{\partial Div}{\partial z(l,n,x,y)}$ by the inverted filter
 After zero padding it first with $L1$ zeros on every side
 Note: the $x\prime, y\prime$ refer to the location in filter
 Shifting down and right by $K1$, such that $0,0$ becomes $K1,K1$
$$ z_{\text {shift}}(l, n, m, x, y)=z(l, n, xK+1, yK+1) $$
$$ \frac{\partial D i v}{\partial y(l1, m, x, y)}=\sum_{n} \sum_{x^{\prime}, y^{\prime}} \widehat{w}\left(l, n, m, x^{\prime}, y^{\prime}\right) \frac{\partial D i v}{\partial z_{s h i f t}\left(l, n, x+x^{\prime}, y+y^{\prime}\right)} $$
 Regular convolution running on shifted derivative maps using flipped filter
#Pooling
 Pooling is typically performed with strides > 1
 Results in shrinking of the map
 Downsampling
#Derivative of Max pooling
$$
\frac{d D i v}{d Y(l, m, k, l)}=\left\{\begin{array}{c}
\frac{d D i v}{d U(l, m, i, j)} \text { if }(k, l)=P(l, m, i, j) \\
0 \text { otherwise }
\end{array}\right.
$$
 Max pooling selects the largest from a pool of elements ^{[1]}
#Derivative of Mean pooling
 The derivative of mean pooling is distributed over the pool
$$ d y(l, m, k, n)=\frac{1}{K_{l p o o l}^{2}} d u(l, m, k, n) $$
#Transposed Convolution
 Weβve always assumed that subsequent steps shrink the size of the maps
 Can subsequent maps increase in size^{[2]}
 Output size is typically an integer multiple of input
 +1 if filter width is odd
#Model variations
 Very deep networks
 100 or more layers in MLP
 Formalism called βResnetβ
 Depthwise convolutions
 Instead of multiple independent filters with independent parameters, use common layerwise weights and combine the layers differently for each filter
#Depthwise convolutions
 In depthwise convolution the convolution step is performed only once
 The simple summation is replaced by a weighted sum across channels
 Different weights (for summation) produce different output channels
#Models

For CIFAR 10
 Lenet 5^{[3]}

For ILSVRC(Imagenet Large Scale Visual Recognition Challenge)
 AlexNet
 NN contains 60 million parameters and 650,000 neurons
 5 convolutional layers, some of which are followed by maxpooling layers
 3 fullyconnected layers
 VGGNet
 Only used 3x3 filters, stride 1, pad 1
 Only used 2x2 pooling filters, stride 2
 ~140 million parameters in all
 Googlenet
 Multiple filter sizes simultaneously
 AlexNet

For ImageNet
 Resnet
 Last layer before addition must have the same number of filters as the input to the module
 Batch normalization after each convolution
 Densenet
 All convolutional
 Each layer looks at the union of maps from all previous layers
 Instead of just the set of maps from the immediately previous layer
 Resnet