AN INTERACTIVE GUIDE

PyTorch &
Deep Learning

From a single neuron to transformers, LLMs, and vision-language-action models. Learn by doing — every concept has an interactive demo you can play with.

Neural NetworksPyTorchTransformersLLMsVLAs
01

What is a Neural Network?

The fundamental building block of modern AI

At its core, a neural network is a function approximator. You give it inputs, it produces outputs, and through training it learns to map inputs to the correct outputs. The magic is that it can learn any continuous function, given enough capacity.

The simplest unit is a single neuron (also called a perceptron). It takes some inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function. That's it. Everything else in deep learning is just clever arrangements of this basic operation.

The Neuron Equation

output = activation(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)

Where w are learnable weights, b is a learnable bias, and the activation function introduces non-linearity.

Interactive Neuron

x₁x₂Σ + bReLU
raw = 1.0 × 0.50 + 0.5 × -0.30 + 0.10
= 0.4500
output = relu(0.4500)
0.4500

Play with the sliders above. Notice how changing the weights changes which inputs matter more. The bias shifts the decision boundary. And the activation function shapes the output range. A neural network is just layers of these neurons, where each layer's outputs become the next layer's inputs.

In PyTorch, you'd write a single neuron like this:

python
1import torch
2import torch.nn as nn
3
4# A single neuron: 2 inputs → 1 output
5neuron = nn.Linear(in_features=2, out_features=1)
6
7# Input tensor
8x = torch.tensor([1.0, 0.5])
9
10# Forward pass: weight · input + bias
11output = torch.relu(neuron(x))
12print(output) # tensor([0.3842])
Why 'Linear'?

PyTorch calls it nn.Linear because the neuron computes a linear transformation(y = Wx + b) before the activation. The activation function is what makes the overall computation non-linear.

02

Tensors

The data structure that powers everything

Before we go further, we need to talk about tensors. If you've used NumPy arrays, you already know tensors — they're the same concept, but with GPU acceleration and automatic differentiation built in.

A tensor is simply an n-dimensional array of numbers. A scalar is a 0D tensor. A vector is 1D. A matrix is 2D. An image is typically a 3D tensor (channels × height × width). A batch of images is 4D. The shape tells you the size of each dimension.

Tensor Explorer

0.1
0.5
-0.2
1.3
-0.4
0.8
0.3
-0.1
1.2
-0.6
0.7
0.4
← columns (features) →
shape: [3, 4]| dtype: float32

A 2D grid of numbers. Shape: (3, 4) — 3 rows, 4 columns. Think: a batch of 3 data points, each with 4 features.

Click through the tabs above to see how tensors grow in dimensionality. The key insight is that everything in deep learning is a tensor operation. Images, text, audio — they all get converted to tensors before the network sees them.

python
1import torch
2
3# Creating tensors
4scalar = torch.tensor(42) # shape: ()
5vector = torch.tensor([1.0, 2.0, 3.0]) # shape: (3,)
6matrix = torch.randn(3, 4) # shape: (3, 4)
7batch = torch.randn(32, 3, 224, 224) # batch of images
8
9# Key operations
10print(matrix.shape) # torch.Size([3, 4])
11print(matrix.dtype) # torch.float32
12print(matrix.device) # cpu (or cuda:0)
13
14# Move to GPU (if available)
15if torch.cuda.is_available():
16 matrix = matrix.cuda() # now on GPU!
Shape Intuition

When you see a shape like (32, 3, 224, 224), read it as: "32 images, each with 3 color channels (RGB), each 224 pixels tall and 224 pixels wide." The first dimension is almost always the batch size.

Tensor visualization showing scalar, vector, matrix, and 3D tensor

Tensors: from scalars to multi-dimensional arrays

03

The Forward Pass

Data flows in, predictions come out

When data enters a neural network, it flows through each layer sequentially — this is the forward pass. Each layer transforms the data: multiplying by weights, adding biases, and applying activation functions. The output of one layer becomes the input to the next.

Think of it like an assembly line. Raw materials (input data) enter one end, and each station (layer) adds something. By the end, you have a finished product (prediction). The network starts with random weights, so its first predictions are garbage. That's okay — training will fix that.

Forward Pass Visualization

InputHidden 1Hidden 2Output

Click "Forward Pass" above to watch data propagate through the network. Notice how each layer transforms the values. In PyTorch, the forward pass is just calling your model like a function:

python
1class SimpleNet(nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.layer1 = nn.Linear(3, 4) # 3 inputs → 4 hidden
5 self.layer2 = nn.Linear(4, 4) # 4 → 4
6 self.layer3 = nn.Linear(4, 2) # 4 → 2 outputs
7
8 def forward(self, x):
9 x = torch.relu(self.layer1(x)) # activate!
10 x = torch.relu(self.layer2(x))
11 x = self.layer3(x) # no activation on output
12 return x
13
14model = SimpleNet()
15prediction = model(torch.randn(1, 3)) # forward pass!
03.1

Activation Functions

The non-linearity that makes deep learning work

Without activation functions, stacking layers would be pointless — a stack of linear transformations is just one big linear transformation. Activation functions introduce non-linearity, allowing the network to learn complex patterns.

Activation Functions

-4-2024-101
relu(x) = max(0, x)

Zeroes out negatives. Simple, fast, and the default choice for hidden layers.

Which Activation to Use?

ReLU for hidden layers (fast, simple, works well). GELU for transformers (smoother, used in GPT/BERT). Sigmoid for binary output. Softmax for multi-class output (we'll see this in the Transformer section).

04

The Backward Pass

How the network learns from its mistakes

The forward pass gives us a prediction. But how does the network improve? This is where the backward pass (backpropagation) comes in. The idea is beautifully simple:

  1. Compute the loss — how wrong is our prediction?
  2. Compute the gradient of the loss with respect to each weight
  3. Update each weight in the direction that reduces the loss

The gradient tells us: "If I increase this weight by a tiny amount, how much does the loss change?" If the gradient is positive, we decrease the weight. If negative, we increase it. The learning rate controls how big each step is.

Loss landscape visualization showing gradient descent path

The loss landscape: gradient descent finds the valley

Gradient Descent Visualizer

-2020510parameter valueloss
0.001 (slow)0.1 (fast)
position2.5000
loss10.7813
gradient18.2500
steps0

Try different learning rates in the demo above. Too small and it barely moves. Too large and it overshoots, bouncing around wildly. Finding the right learning rate is one of the most important hyperparameters in deep learning.

The Learning Rate Dilemma

Too high → training diverges (loss explodes). Too low → training is painfully slow. Modern optimizers like Adam adapt the learning rate per-parameter, which is why it's the default choice. But you still need to set the initial learning rate.

python
1# The magic of autograd: PyTorch tracks operations
2x = torch.tensor([2.0], requires_grad=True)
3y = x ** 2 + 3 * x + 1 # some function
4
5y.backward() # compute gradients!
6print(x.grad) # tensor([7.]) → dy/dx = 2x + 3 = 7
7
8# In practice, you never call backward() on individual ops.
9# You call it on the loss:
10loss = criterion(model(inputs), targets)
11loss.backward() # computes ALL gradients
12optimizer.step() # updates ALL weights
13optimizer.zero_grad() # reset for next iteration
05

The Training Loop

Forward, loss, backward, update — repeat

Now we put it all together. The training loop is the heartbeat of deep learning. Every iteration: push data forward, compute the loss, propagate gradients backward, and update the weights. Do this thousands (or millions) of times, and the network learns.

An epoch is one complete pass through the entire training dataset. You typically train for many epochs, watching the loss decrease over time. When the loss stops improving on a held-out validation set, you stop — that's your trained model.

Training Loop Simulator

1
Forward Pass
2
Compute Loss
3
Backward Pass
4
Update Weights
Epoch0
Loss over epochs
020
for epoch in range(20):
pred = model(x) # forward
loss = criterion(pred, y)
loss.backward() # backward
optimizer.step() # update

Watch the loss curve in the demo above. It drops quickly at first (the "easy" patterns), then slows down as the model fine-tunes on harder examples. This is the typical training dynamic. Here's the complete training loop in PyTorch:

python
1import torch
2import torch.nn as nn
3from torch.utils.data import DataLoader
4
5# 1. Define model
6model = SimpleNet()
7
8# 2. Define loss function and optimizer
9criterion = nn.CrossEntropyLoss()
10optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
11
12# 3. Training loop
13for epoch in range(20):
14 total_loss = 0
15 for batch_x, batch_y in DataLoader(dataset, batch_size=32):
16 # Forward pass
17 predictions = model(batch_x)
18 loss = criterion(predictions, batch_y)
19
20 # Backward pass
21 optimizer.zero_grad() # clear old gradients
22 loss.backward() # compute new gradients
23 optimizer.step() # update weights
24
25 total_loss += loss.item()
26
27 print(f"Epoch {epoch}: loss = {total_loss:.4f}")
Key Vocabulary

Epoch: one pass through all training data. Batch: a subset of data processed together (e.g., 32 samples). Iteration: one weight update (one batch). Learning rate: step size for weight updates.

06

Building Blocks

nn.Module and the PyTorch way

In PyTorch, every model is built from nn.Module. It's the base class that gives you parameter tracking, GPU movement, saving/loading, and the forward pass interface. You compose modules like LEGO blocks to build complex architectures.

python
1class ResidualBlock(nn.Module):
2 """A block with a skip connection."""
3 def __init__(self, dim):
4 super().__init__()
5 self.net = nn.Sequential(
6 nn.Linear(dim, dim),
7 nn.GELU(),
8 nn.Linear(dim, dim),
9 nn.Dropout(0.1),
10 )
11 self.norm = nn.LayerNorm(dim)
12
13 def forward(self, x):
14 # The residual connection: add input to output
15 return x + self.net(self.norm(x))
16
17
18class DeepModel(nn.Module):
19 def __init__(self, input_dim, hidden_dim, n_layers):
20 super().__init__()
21 self.embed = nn.Linear(input_dim, hidden_dim)
22 self.blocks = nn.ModuleList([
23 ResidualBlock(hidden_dim)
24 for _ in range(n_layers)
25 ])
26 self.head = nn.Linear(hidden_dim, 10)
27
28 def forward(self, x):
29 x = self.embed(x)
30 for block in self.blocks:
31 x = block(x)
32 return self.head(x)

Notice two patterns that appear everywhere in modern deep learning:

Residual Connections

Add the input to the output: y = x + f(x). This lets gradients flow directly through skip connections, enabling much deeper networks.

Layer Normalization

Normalize activations to have zero mean and unit variance. Stabilizes training and allows higher learning rates. Applied before or after each sublayer.

The nn.ModuleList Pattern

Use nn.ModuleList (not a plain Python list) to hold sub-modules. This ensures PyTorch can find all parameters for optimization, GPU transfer, and saving. It's a common beginner mistake to use [] instead.

07

From CNNs to Transformers

The architecture that changed everything

For years, Convolutional Neural Networks (CNNs) dominated computer vision and Recurrent Neural Networks (RNNs) dominated language. Then in 2017, the paper "Attention Is All You Need" introduced the Transformer, and everything changed.

The key innovation is self-attention: instead of processing tokens one by one (like RNNs) or looking at local patches (like CNNs), every token can directly attend to every other token. This allows the model to capture long-range dependencies in a single step.

Transformer architecture visualization

The Transformer: multi-head attention + feed-forward networks + residual connections

Self-Attention in Plain English

For each token, self-attention asks: "Which other tokens in this sequence should I pay attention to?" It computes a weighted average of all tokens' values, where the weights are determined by how "relevant" each token is to the current one. This is computed using three learned projections: Query, Key, and Value.

Self-Attention Visualizer

Hover over a token to see what it "attends to" — which other tokens influence its representation.

The
cat
sat
on
the
mat
Attention Matrix (row = query, col = key)
The
cat
sat
on
the
mat
The
53
17
9
9
6
5
cat
11
42
21
8
6
13
sat
5
20
45
10
7
13
on
4
11
13
47
15
10
the
5
8
12
12
51
12
mat
5
14
17
6
10
48

Hover over tokens in the demo above to see attention patterns. Notice how content words (nouns, verbs) tend to attend strongly to each other, while function words ("the", "on") have more diffuse attention. This is how the model builds contextual understanding.

python
1# Self-attention in ~10 lines of PyTorch
2def self_attention(x, d_k):
3 """x: (batch, seq_len, d_model)"""
4 Q = x @ W_q # queries: what am I looking for?
5 K = x @ W_k # keys: what do I contain?
6 V = x @ W_v # values: what do I offer?
7
8 # Attention scores
9 scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)
10 weights = torch.softmax(scores, dim=-1)
11
12 # Weighted combination of values
13 return weights @ V

Multi-head attention runs multiple attention operations in parallel, each with different learned projections. This lets the model attend to different types of relationships simultaneously — one head might capture syntax, another semantics, another positional patterns.

08

Large Language Models

Transformers at scale

An LLM is, at its core, a very large Transformer trained on a very large amount of text. The training objective is deceptively simple: predict the next token. Given "The cat sat on the", predict "mat". Do this across trillions of tokens from the internet, and something remarkable emerges — the model learns language, reasoning, code, and world knowledge.

But first, text needs to be converted to numbers. This is where tokenization comes in. Modern LLMs use Byte-Pair Encoding (BPE), which breaks text into subword units. Common words stay whole, rare words get split into pieces.

Tokenizer Explorer

The
·neural
·network
·is
·a
·deep
·learning
·model
Characters: 43Tokens: 8Ratio: 5.4 chars/token

Type different text in the demo above to see how tokenization works. Notice how common words are single tokens, but unusual words get broken into subword pieces. This is why LLMs can handle any text — even made-up words — by decomposing them into known subwords.

Scaling Laws

One of the most important discoveries in modern AI is that model performance improves predictably with scale. More parameters, more data, more compute — each gives a reliable improvement following a power law. This is why the field has been racing to train ever-larger models.

The Scaling Timeline

2B
GPT-2
2019
175B
GPT-3
2020
540B
PaLM
2022
70B
LLaMA 2
2023
1.8T
GPT-4
2023
405B
LLaMA 3
2024
671B
DeepSeek V3
2024

Scaling laws show that model performance improves predictably with more parameters, more data, and more compute. The relationship follows a power law: doubling parameters gives a consistent improvement in loss, but with diminishing returns.

From Pre-training to Alignment

Raw pre-trained LLMs are powerful but unruly — they'll happily generate toxic content or confidently state falsehoods. The solution is alignment: fine-tuning the model to be helpful, harmless, and honest. This typically involves:

1
Supervised Fine-Tuning (SFT)

Train on high-quality human-written conversations. The model learns the format and style of helpful responses.

2
Reward Modeling

Train a separate model to judge response quality based on human preferences (which response is better?).

3
RLHF / DPO

Use reinforcement learning (or direct preference optimization) to fine-tune the model to maximize the reward model's score.

The Bitter Lesson

Rich Sutton's "Bitter Lesson" argues that general methods that leverage computation (like scaling up transformers) ultimately beat methods that leverage human knowledge (like hand-crafted features). The history of LLMs is a powerful example of this principle.

09

Vision-Language-Action Models

When AI learns to see, speak, and act

Vision-Language-Action multimodal AI visualization

Three modalities converging: vision, language, and action

The frontier of AI research is multimodal models — systems that can process and generate across multiple modalities: text, images, video, audio, and even physical actions. Vision-Language-Action (VLA) models represent the cutting edge, combining perception, reasoning, and embodied action in a single architecture.

The Architecture of Multimodality

The key insight is that different modalities can be tokenized into a shared representation space, then processed by the same Transformer backbone:

Vision

Images are split into patches (e.g., 16×16 pixels), each encoded by a vision encoder (like ViT or SigLIP) into embedding vectors.

Language

Text is tokenized via BPE into subword tokens, each mapped to an embedding vector through the standard embedding table.

Action

Robot actions (joint angles, gripper commands) are discretized into tokens or predicted as continuous values from the Transformer's output.

python
1# Simplified VLA architecture
2class VisionLanguageAction(nn.Module):
3 def __init__(self):
4 super().__init__()
5 # Vision encoder (frozen or fine-tuned)
6 self.vision_encoder = SigLIPEncoder()
7 self.vision_proj = nn.Linear(1152, 4096)
8
9 # Language model backbone
10 self.llm = LlamaForCausalLM.from_pretrained(
11 "meta-llama/Llama-3-8B"
12 )
13
14 # Action head
15 self.action_head = nn.Linear(4096, 7) # 7-DOF robot
16
17 def forward(self, image, text_tokens):
18 # Encode image into tokens
19 vision_tokens = self.vision_proj(
20 self.vision_encoder(image)
21 )
22
23 # Concatenate with text tokens
24 text_embeds = self.llm.embed_tokens(text_tokens)
25 combined = torch.cat([vision_tokens, text_embeds], dim=1)
26
27 # Process through LLM
28 hidden = self.llm(inputs_embeds=combined)
29
30 # Predict actions from last hidden state
31 actions = self.action_head(hidden[:, -1, :])
32 return actions # e.g., [dx, dy, dz, rx, ry, rz, grip]

Notable VLA Models

ModelKey InnovationYear
RT-2Actions as text tokens in a VLM2023
OpenVLAOpen-source VLA with Llama backbone2024
π₀Flow matching for continuous action generation2024
OctoGeneralist policy via diffusion action heads2024
Why VLAs Matter

VLAs represent the convergence of language understanding, visual perception, and physical action into a single model. Instead of separate systems for seeing, thinking, and acting, a VLA processes all modalities through one Transformer — enabling robots to follow natural language instructions while adapting to what they see in real time.

10

Let's Build a Transformer Block

Putting it all together, step by step

We've covered all the pieces. Now let's assemble them into a complete Transformer block — the fundamental unit that powers GPT, BERT, LLaMA, and every modern LLM. We'll build it step by step in PyTorch, and you'll see that it's surprisingly simple once you understand the components.

Each Transformer block contains: multi-head self-attention, layer normalization, a feed-forward network, and residual connections. Stack N of these blocks, add an embedding layer at the bottom and a prediction head at the top, and you have a complete Transformer.

Build a Transformer Block — Step by Step

Step 1: Input Embedding

Tokens are integers. We convert them into dense vectors using an embedding table. Each token ID maps to a learned vector of size d_model.

python
1class TransformerBlock(nn.Module):
2 def __init__(self, d_model=512, n_heads=8, d_ff=2048):
3 super().__init__()
4 # Step 1: Token embeddings
5 self.embedding = nn.Embedding(
6 num_embeddings=50257, # vocab size
7 embedding_dim=d_model
8 )
9 # Positional encoding
10 self.pos_encoding = nn.Embedding(1024, d_model)
The
cat
sat
↓ lookup
[0.2,-0.1,...]
[0.2,-0.1,...]
[0.2,-0.1,...]
d_model = 512 dims each

Step through the builder above to see each component come together. By the end, you'll have a complete, working Transformer block. The remarkable thing is that this same architecture — with different sizes and training data — powers everything from GPT-4 to robot controllers.

The Unreasonable Effectiveness of Transformers

The same architecture works for text (GPT), images (ViT), audio (Whisper), protein folding (AlphaFold 2), weather prediction (Pangu-Weather), and robot control (RT-2). The Transformer is the closest thing we have to a universal computation architecture.

Where to Go From Here

You now have a solid mental model of how deep learning works, from individual neurons to Transformers to LLMs and VLAs. Here are some paths forward:

Build something

Fine-tune a small LLM on your own data. Train a classifier. Build a chatbot. The best way to learn is by doing.

Read the papers

"Attention Is All You Need", "Language Models are Few-Shot Learners" (GPT-3), "Training Compute-Optimal LLMs" (Chinchilla).

Go deeper on CUDA

Understanding GPU programming will give you superpowers. Learn how PyTorch operations map to GPU kernels and how to write custom CUDA code.