
From a single neuron to transformers, LLMs, and vision-language-action models. Learn by doing — every concept has an interactive demo you can play with.
The fundamental building block of modern AI
At its core, a neural network is a function approximator. You give it inputs, it produces outputs, and through training it learns to map inputs to the correct outputs. The magic is that it can learn any continuous function, given enough capacity.
The simplest unit is a single neuron (also called a perceptron). It takes some inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function. That's it. Everything else in deep learning is just clever arrangements of this basic operation.
output = activation(w₁·x₁ + w₂·x₂ + ... + wₙ·xₙ + b)
Where w are learnable weights, b is a learnable bias, and the activation function introduces non-linearity.
Play with the sliders above. Notice how changing the weights changes which inputs matter more. The bias shifts the decision boundary. And the activation function shapes the output range. A neural network is just layers of these neurons, where each layer's outputs become the next layer's inputs.
In PyTorch, you'd write a single neuron like this:
1import torch2import torch.nn as nn34# A single neuron: 2 inputs → 1 output5neuron = nn.Linear(in_features=2, out_features=1)67# Input tensor8x = torch.tensor([1.0, 0.5])910# Forward pass: weight · input + bias11output = torch.relu(neuron(x))12print(output) # tensor([0.3842])PyTorch calls it nn.Linear because the neuron computes a linear transformation(y = Wx + b) before the activation. The activation function is what makes the overall computation non-linear.
The data structure that powers everything
Before we go further, we need to talk about tensors. If you've used NumPy arrays, you already know tensors — they're the same concept, but with GPU acceleration and automatic differentiation built in.
A tensor is simply an n-dimensional array of numbers. A scalar is a 0D tensor. A vector is 1D. A matrix is 2D. An image is typically a 3D tensor (channels × height × width). A batch of images is 4D. The shape tells you the size of each dimension.
A 2D grid of numbers. Shape: (3, 4) — 3 rows, 4 columns. Think: a batch of 3 data points, each with 4 features.
Click through the tabs above to see how tensors grow in dimensionality. The key insight is that everything in deep learning is a tensor operation. Images, text, audio — they all get converted to tensors before the network sees them.
1import torch23# Creating tensors4scalar = torch.tensor(42) # shape: ()5vector = torch.tensor([1.0, 2.0, 3.0]) # shape: (3,)6matrix = torch.randn(3, 4) # shape: (3, 4)7batch = torch.randn(32, 3, 224, 224) # batch of images89# Key operations10print(matrix.shape) # torch.Size([3, 4])11print(matrix.dtype) # torch.float3212print(matrix.device) # cpu (or cuda:0)1314# Move to GPU (if available)15if torch.cuda.is_available():16 matrix = matrix.cuda() # now on GPU!When you see a shape like (32, 3, 224, 224), read it as: "32 images, each with 3 color channels (RGB), each 224 pixels tall and 224 pixels wide." The first dimension is almost always the batch size.

Tensors: from scalars to multi-dimensional arrays
Data flows in, predictions come out
When data enters a neural network, it flows through each layer sequentially — this is the forward pass. Each layer transforms the data: multiplying by weights, adding biases, and applying activation functions. The output of one layer becomes the input to the next.
Think of it like an assembly line. Raw materials (input data) enter one end, and each station (layer) adds something. By the end, you have a finished product (prediction). The network starts with random weights, so its first predictions are garbage. That's okay — training will fix that.
Click "Forward Pass" above to watch data propagate through the network. Notice how each layer transforms the values. In PyTorch, the forward pass is just calling your model like a function:
1class SimpleNet(nn.Module):2 def __init__(self):3 super().__init__()4 self.layer1 = nn.Linear(3, 4) # 3 inputs → 4 hidden5 self.layer2 = nn.Linear(4, 4) # 4 → 46 self.layer3 = nn.Linear(4, 2) # 4 → 2 outputs78 def forward(self, x):9 x = torch.relu(self.layer1(x)) # activate!10 x = torch.relu(self.layer2(x))11 x = self.layer3(x) # no activation on output12 return x1314model = SimpleNet()15prediction = model(torch.randn(1, 3)) # forward pass!The non-linearity that makes deep learning work
Without activation functions, stacking layers would be pointless — a stack of linear transformations is just one big linear transformation. Activation functions introduce non-linearity, allowing the network to learn complex patterns.
Zeroes out negatives. Simple, fast, and the default choice for hidden layers.
ReLU for hidden layers (fast, simple, works well). GELU for transformers (smoother, used in GPT/BERT). Sigmoid for binary output. Softmax for multi-class output (we'll see this in the Transformer section).
How the network learns from its mistakes
The forward pass gives us a prediction. But how does the network improve? This is where the backward pass (backpropagation) comes in. The idea is beautifully simple:
The gradient tells us: "If I increase this weight by a tiny amount, how much does the loss change?" If the gradient is positive, we decrease the weight. If negative, we increase it. The learning rate controls how big each step is.

The loss landscape: gradient descent finds the valley
Try different learning rates in the demo above. Too small and it barely moves. Too large and it overshoots, bouncing around wildly. Finding the right learning rate is one of the most important hyperparameters in deep learning.
Too high → training diverges (loss explodes). Too low → training is painfully slow. Modern optimizers like Adam adapt the learning rate per-parameter, which is why it's the default choice. But you still need to set the initial learning rate.
1# The magic of autograd: PyTorch tracks operations2x = torch.tensor([2.0], requires_grad=True)3y = x ** 2 + 3 * x + 1 # some function45y.backward() # compute gradients!6print(x.grad) # tensor([7.]) → dy/dx = 2x + 3 = 778# In practice, you never call backward() on individual ops.9# You call it on the loss:10loss = criterion(model(inputs), targets)11loss.backward() # computes ALL gradients12optimizer.step() # updates ALL weights13optimizer.zero_grad() # reset for next iterationForward, loss, backward, update — repeat
Now we put it all together. The training loop is the heartbeat of deep learning. Every iteration: push data forward, compute the loss, propagate gradients backward, and update the weights. Do this thousands (or millions) of times, and the network learns.
An epoch is one complete pass through the entire training dataset. You typically train for many epochs, watching the loss decrease over time. When the loss stops improving on a held-out validation set, you stop — that's your trained model.
Watch the loss curve in the demo above. It drops quickly at first (the "easy" patterns), then slows down as the model fine-tunes on harder examples. This is the typical training dynamic. Here's the complete training loop in PyTorch:
1import torch2import torch.nn as nn3from torch.utils.data import DataLoader45# 1. Define model6model = SimpleNet()78# 2. Define loss function and optimizer9criterion = nn.CrossEntropyLoss()10optimizer = torch.optim.Adam(model.parameters(), lr=0.001)1112# 3. Training loop13for epoch in range(20):14 total_loss = 015 for batch_x, batch_y in DataLoader(dataset, batch_size=32):16 # Forward pass17 predictions = model(batch_x)18 loss = criterion(predictions, batch_y)1920 # Backward pass21 optimizer.zero_grad() # clear old gradients22 loss.backward() # compute new gradients23 optimizer.step() # update weights2425 total_loss += loss.item()2627 print(f"Epoch {epoch}: loss = {total_loss:.4f}")Epoch: one pass through all training data. Batch: a subset of data processed together (e.g., 32 samples). Iteration: one weight update (one batch). Learning rate: step size for weight updates.
nn.Module and the PyTorch way
In PyTorch, every model is built from nn.Module. It's the base class that gives you parameter tracking, GPU movement, saving/loading, and the forward pass interface. You compose modules like LEGO blocks to build complex architectures.
1class ResidualBlock(nn.Module):2 """A block with a skip connection."""3 def __init__(self, dim):4 super().__init__()5 self.net = nn.Sequential(6 nn.Linear(dim, dim),7 nn.GELU(),8 nn.Linear(dim, dim),9 nn.Dropout(0.1),10 )11 self.norm = nn.LayerNorm(dim)1213 def forward(self, x):14 # The residual connection: add input to output15 return x + self.net(self.norm(x))161718class DeepModel(nn.Module):19 def __init__(self, input_dim, hidden_dim, n_layers):20 super().__init__()21 self.embed = nn.Linear(input_dim, hidden_dim)22 self.blocks = nn.ModuleList([23 ResidualBlock(hidden_dim)24 for _ in range(n_layers)25 ])26 self.head = nn.Linear(hidden_dim, 10)2728 def forward(self, x):29 x = self.embed(x)30 for block in self.blocks:31 x = block(x)32 return self.head(x)Notice two patterns that appear everywhere in modern deep learning:
Add the input to the output: y = x + f(x). This lets gradients flow directly through skip connections, enabling much deeper networks.
Normalize activations to have zero mean and unit variance. Stabilizes training and allows higher learning rates. Applied before or after each sublayer.
Use nn.ModuleList (not a plain Python list) to hold sub-modules. This ensures PyTorch can find all parameters for optimization, GPU transfer, and saving. It's a common beginner mistake to use [] instead.
The architecture that changed everything
For years, Convolutional Neural Networks (CNNs) dominated computer vision and Recurrent Neural Networks (RNNs) dominated language. Then in 2017, the paper "Attention Is All You Need" introduced the Transformer, and everything changed.
The key innovation is self-attention: instead of processing tokens one by one (like RNNs) or looking at local patches (like CNNs), every token can directly attend to every other token. This allows the model to capture long-range dependencies in a single step.

The Transformer: multi-head attention + feed-forward networks + residual connections
For each token, self-attention asks: "Which other tokens in this sequence should I pay attention to?" It computes a weighted average of all tokens' values, where the weights are determined by how "relevant" each token is to the current one. This is computed using three learned projections: Query, Key, and Value.
Hover over a token to see what it "attends to" — which other tokens influence its representation.
Hover over tokens in the demo above to see attention patterns. Notice how content words (nouns, verbs) tend to attend strongly to each other, while function words ("the", "on") have more diffuse attention. This is how the model builds contextual understanding.
1# Self-attention in ~10 lines of PyTorch2def self_attention(x, d_k):3 """x: (batch, seq_len, d_model)"""4 Q = x @ W_q # queries: what am I looking for?5 K = x @ W_k # keys: what do I contain?6 V = x @ W_v # values: what do I offer?78 # Attention scores9 scores = (Q @ K.transpose(-2, -1)) / (d_k ** 0.5)10 weights = torch.softmax(scores, dim=-1)1112 # Weighted combination of values13 return weights @ VMulti-head attention runs multiple attention operations in parallel, each with different learned projections. This lets the model attend to different types of relationships simultaneously — one head might capture syntax, another semantics, another positional patterns.
Transformers at scale
An LLM is, at its core, a very large Transformer trained on a very large amount of text. The training objective is deceptively simple: predict the next token. Given "The cat sat on the", predict "mat". Do this across trillions of tokens from the internet, and something remarkable emerges — the model learns language, reasoning, code, and world knowledge.
But first, text needs to be converted to numbers. This is where tokenization comes in. Modern LLMs use Byte-Pair Encoding (BPE), which breaks text into subword units. Common words stay whole, rare words get split into pieces.
Type different text in the demo above to see how tokenization works. Notice how common words are single tokens, but unusual words get broken into subword pieces. This is why LLMs can handle any text — even made-up words — by decomposing them into known subwords.
One of the most important discoveries in modern AI is that model performance improves predictably with scale. More parameters, more data, more compute — each gives a reliable improvement following a power law. This is why the field has been racing to train ever-larger models.
Scaling laws show that model performance improves predictably with more parameters, more data, and more compute. The relationship follows a power law: doubling parameters gives a consistent improvement in loss, but with diminishing returns.
Raw pre-trained LLMs are powerful but unruly — they'll happily generate toxic content or confidently state falsehoods. The solution is alignment: fine-tuning the model to be helpful, harmless, and honest. This typically involves:
Train on high-quality human-written conversations. The model learns the format and style of helpful responses.
Train a separate model to judge response quality based on human preferences (which response is better?).
Use reinforcement learning (or direct preference optimization) to fine-tune the model to maximize the reward model's score.
Rich Sutton's "Bitter Lesson" argues that general methods that leverage computation (like scaling up transformers) ultimately beat methods that leverage human knowledge (like hand-crafted features). The history of LLMs is a powerful example of this principle.
When AI learns to see, speak, and act

Three modalities converging: vision, language, and action
The frontier of AI research is multimodal models — systems that can process and generate across multiple modalities: text, images, video, audio, and even physical actions. Vision-Language-Action (VLA) models represent the cutting edge, combining perception, reasoning, and embodied action in a single architecture.
The key insight is that different modalities can be tokenized into a shared representation space, then processed by the same Transformer backbone:
Images are split into patches (e.g., 16×16 pixels), each encoded by a vision encoder (like ViT or SigLIP) into embedding vectors.
Text is tokenized via BPE into subword tokens, each mapped to an embedding vector through the standard embedding table.
Robot actions (joint angles, gripper commands) are discretized into tokens or predicted as continuous values from the Transformer's output.
1# Simplified VLA architecture2class VisionLanguageAction(nn.Module):3 def __init__(self):4 super().__init__()5 # Vision encoder (frozen or fine-tuned)6 self.vision_encoder = SigLIPEncoder()7 self.vision_proj = nn.Linear(1152, 4096)89 # Language model backbone10 self.llm = LlamaForCausalLM.from_pretrained(11 "meta-llama/Llama-3-8B"12 )1314 # Action head15 self.action_head = nn.Linear(4096, 7) # 7-DOF robot1617 def forward(self, image, text_tokens):18 # Encode image into tokens19 vision_tokens = self.vision_proj(20 self.vision_encoder(image)21 )2223 # Concatenate with text tokens24 text_embeds = self.llm.embed_tokens(text_tokens)25 combined = torch.cat([vision_tokens, text_embeds], dim=1)2627 # Process through LLM28 hidden = self.llm(inputs_embeds=combined)2930 # Predict actions from last hidden state31 actions = self.action_head(hidden[:, -1, :])32 return actions # e.g., [dx, dy, dz, rx, ry, rz, grip]| Model | Key Innovation | Year |
|---|---|---|
| RT-2 | Actions as text tokens in a VLM | 2023 |
| OpenVLA | Open-source VLA with Llama backbone | 2024 |
| π₀ | Flow matching for continuous action generation | 2024 |
| Octo | Generalist policy via diffusion action heads | 2024 |
VLAs represent the convergence of language understanding, visual perception, and physical action into a single model. Instead of separate systems for seeing, thinking, and acting, a VLA processes all modalities through one Transformer — enabling robots to follow natural language instructions while adapting to what they see in real time.
Putting it all together, step by step
We've covered all the pieces. Now let's assemble them into a complete Transformer block — the fundamental unit that powers GPT, BERT, LLaMA, and every modern LLM. We'll build it step by step in PyTorch, and you'll see that it's surprisingly simple once you understand the components.
Each Transformer block contains: multi-head self-attention, layer normalization, a feed-forward network, and residual connections. Stack N of these blocks, add an embedding layer at the bottom and a prediction head at the top, and you have a complete Transformer.
Tokens are integers. We convert them into dense vectors using an embedding table. Each token ID maps to a learned vector of size d_model.
1class TransformerBlock(nn.Module):2 def __init__(self, d_model=512, n_heads=8, d_ff=2048):3 super().__init__()4 # Step 1: Token embeddings5 self.embedding = nn.Embedding(6 num_embeddings=50257, # vocab size7 embedding_dim=d_model8 )9 # Positional encoding10 self.pos_encoding = nn.Embedding(1024, d_model)Step through the builder above to see each component come together. By the end, you'll have a complete, working Transformer block. The remarkable thing is that this same architecture — with different sizes and training data — powers everything from GPT-4 to robot controllers.
The same architecture works for text (GPT), images (ViT), audio (Whisper), protein folding (AlphaFold 2), weather prediction (Pangu-Weather), and robot control (RT-2). The Transformer is the closest thing we have to a universal computation architecture.
You now have a solid mental model of how deep learning works, from individual neurons to Transformers to LLMs and VLAs. Here are some paths forward:
Fine-tune a small LLM on your own data. Train a classifier. Build a chatbot. The best way to learn is by doing.
"Attention Is All You Need", "Language Models are Few-Shot Learners" (GPT-3), "Training Compute-Optimal LLMs" (Chinchilla).
Understanding GPU programming will give you superpowers. Learn how PyTorch operations map to GPU kernels and how to write custom CUDA code.