Vision
Train convolutional neural networks on image data with PyTorch and torchvision — from FashionMNIST warm-up to an MNIST capstone.
Goal of the lesson
By the end of this 3-hour session you should be able to:
- read images as
[C, H, W]tensors and feed them to a model in batches, - explain what a convolution does and why CNNs beat plain MLPs on images,
- compute the spatial output size of a
Conv2d/MaxPool2dlayer, - assemble a small CNN with
Conv2d,MaxPool2d,BatchNorm2d,Dropout, - train it on FashionMNIST with batched loaders,
- visualize predictions and a confusion matrix,
- build an MNIST digit recognizer as a capstone and test it on hand-drawn digits.
Suggested timing
| Block | Topic |
|---|---|
| 20 min | Image tensors, transforms, DataLoader |
| 25 min | Convolution intuition: kernels, strides, padding |
| 25 min | Build a CNN, trace shapes through it |
| 35 min | Train and evaluate on FashionMNIST |
| 25 min | Visualize predictions, confusion matrix |
| 50 min | Capstone — MNIST digit recognizer |
Setup
uv init --python 3.12 vision
cd vision
uv add torch torchvision matplotlib
=
Image tensors
PyTorch images are [C, H, W] tensors:
C= channels (1 for grayscale, 3 for RGB),H= height in pixels,W= width in pixels.
torchvision.datasets ships ready-to-use datasets that download themselves on first run.
= =,
=True,
=True,
=,
)
= =,
=False,
=True,
=,
)
=
transforms.ToTensor() converts a PIL image into a [C, H, W] tensor with values in [0, 1].
, =
# torch.Size([1, 28, 28])
; You're reading a preview.
Sign in to read the full article. Any account opens 10 free articles a month; students and teachers read their course pages without limit.
Sign in