11 Convolutional Neural Networks

Neural Networks

CNNs

Image Processing

Deep Learning

This lecture discusses the architecture and functioning of Convolutional Neural Networks (CNNs), including their layers, operations, and applications in image processing and computer vision. It also covers the concept of pooling layers and their role in reducing dimensionality.

Author

Jiuru Lyu

Published

April 1, 2025

Introduction to the MNIST Dataset and CNNs

Each image is a \(32\times32\) grayscale image (\(0-255\)).

A flat representation of the image: \[\va x=\mqty[x_1,x_2,\dots,x_{1024}]\]
Problem with flat representation:
- Ignore spatial structure
- Subsceptible to translational error
Goal: preserve the spatial structure of the task by capturing relationships among neighboring pixels.
Ideas:
- Laern feature representations based on small patches.
- Apply patch-based feature representations across the entire image.
Building blocks of CNN:
- Convolutional layers
- Activation layers
- Pooling layers
- Fully connected layers

Convolutional Layer

Neurons that maps a \(3\times 3\) patch to a scalar value. \[\va x\cdot\va w+b,\] where \(\va x\) is the image patch, \(\va w\) and \(b\) are filter parameters.
Convolution operation:
- Slide the filter over the image spatially
- Compute the dot product with different patches of the image.

Example 1 (Padding)

Often times, it is beneficial to preserve the original image size. This can be done with padding: allow filter to overlap with boundary (zero padding/copy-paste).

Activation Layers

Each filter produces a feature map and a activation map.
Multiple filters \(\longrightarrow\) multiple feature maps and activation maps (or channels).

Pooling Layers

Downsamples previous layers activation map
Cosolidate feature learned at previous stage.
Why?
- Compress/Smooth
- Spatial invariance
- Prevent overfitting
Pooling often uses simple functions: max or average.
Pooling operates over each activation map independently.

Fully Connected Layers

Flatten the output from previous layer
Normal dense fully connected layer

Architecture Details:

Input to a covolutional layer: \(C_\text{in}\times H\times W\)
- \(C_\text{in}\): number of input channels
- \(H\): height of the input
- \(W\): width of the input
\(C_\text{out}\) (number of output channels) filters of \(h\times w\), where \(h<H\) and \(w<W\) (\(h=w\)).
Output: \(C_\text{out}\times H'\times W'\), where \(H'\) and \(W'\) depends on filter size, padding, and stride.
Parameter sharing: efficient:
- Suppose input image \(100\times100\longrightarrow10,000\) input pixels.
- Fully-connected layer with \(100\) neurons (no bias): \(10,000\times100=1,000,000\) parameters.
- Convolutional layer with \(100\) filters of size \(3\times 3\) (no bias): \(3\times3\times100=900\) parameters.