Lesson 1: Neural Networks with PyTorch¶

📚 Prerequisites (Your Starting Level)¶

Based on your background:

Python: Very comfortable
NumPy/Scientific Computing: Not very comfortable (we'll build this up)
Math: Familiar with matrices, derivatives, and basic calculus
Neural Networks: Basic knowledge - you understand:
- What a neural network is
- Forward propagation
- Backward propagation
- Need a refresher on gradient descent

🎯 Learning Objectives¶

By the end of this lesson, you will be able to:

Understand gradient descent - how weights are updated to minimize loss
Explain backpropagation - how PyTorch computes gradients using the chain rule
Use PyTorch basics - create tensors with requires_grad=True, compute forward pass, call backward()
Build a training loop from scratch - implement the complete cycle: forward → loss → backward → update
Understand automatic differentiation - how PyTorch performs calculus automatically

Q: What's a loss function?¶

A: A loss function is a function that quantifies the error in your prediction. For a value x, if you predicted an answer with f(x), but the actual value was y, then the loss function tells you how far away f(x) was from y.

Q: Examples of loss functions¶

A: There are many answers here, but some common ones are mean-square error (y - f(x))^2/n where n = the number of predictinos made. Another one is cross-entropy loss.

Q: When might you use the different loss functions mentioned?¶

A: The former is usually used for regression (predicted a numeric value for example) and the latter for classification (one correct answer out of many).

Q: What does the gradient at a point on a function give you?¶

A: It gives you the rate & direction of change at that point in the function (wrt the input). In other words, it answers the question: if you change the input to the function in a certain direction, how does the function change (does it increase/decrease and by how much)?

Q: Using this, how might you utilize the gradient to improve your prediction?¶

A: We should move the input to the loss function along the direction that decreases the loss function.

Q: What should we take the gradient with respect to (in other words, what are the inputs that we change, what is the x if f(x) is the loss function)?¶

A: The weights of the neural network. NOT the input data!

Side-note: We need to do partial differentation (find the gradient with respect to each weight) to understand how to update each weight.

Q: If we have: Input → Weight1 → Layer1Output → Activation → Weight2 → FinalOutput → Loss, then how could we compute the partial derivative of loss wrt Weight1 (Hint 1: chain rule. Hint 2: to simplify it, just give the first step)?¶

A: We'd take the partial derivative of the loss wrt FinalOutput first.

Q: What would be the rest of the steps in the chain rule to get the partial derivative wrt Weight1?¶

A: d(Loss)/d(FinalOutput) * d(FinalOutput)/d(Activation) * d(Activation)/d(Layer1Output) * d(Layer1Output)/d(Weight1)

Q: Assume loss is MSE, so (y - f(x))^2. Now, what is the partial derivative of this wrt f(x) (the output)?¶

A: Expand: y^2 + f(x)^2 - 2(y)(f(x)) Take derivative wrt f(x): 0 + 2f(x) - ((Using u'v + v'u) 0 + 12y) = 2f(x) - 2y Alternate A: Treat y as a constant and f(x) as the variable. Take derivative of (y−u)^2 wrt u. That gives 2*(y-u) * - 1 = -2(y-u). Therefore, ∂L/∂f(x) = -2 (y − f(x)) = 2f(x) - 2y

Q: If FinalOutput is W*A + b where A is activation, what is the partial derivative of FinalOutput wrt Activation?¶

A: d(FinalOutput)/d(Activation) = d(W2 * A + b)/d(A) = W2

Side-note: Activation function is usually just f(x) = x if x > 0 else 0. This activation function is called ReLU (Rectified Linear Unit).

Q: What would the partial derivative of ReLU wrt x be? In other words, what is d(A)/d(x) where A = relu(x))?¶

A: 1 if x > 0 else 0. Note: The ReLU function is not strictly differentiable because of the kink at zero. However, this doesn't matter in practice.

Q: What would d(Activation)/d(Weight1) be?¶

A: d(activation)/d(weight1) = d(relu(wx + b))/d(weight1) = deriv_relu(wx + b) * x

Side-note: We need to compute the loss with respect to each of the weight matrices (one per layer).

Q: After computing ∂L/∂output, how do we find ∂L/∂weight2?¶

A: Chain rule: ∂L/∂weight2 = ∂L/∂output × ∂output/∂weight2

Q: Once we have the gradient ∂L/∂weight, how exactly do we update the weight?¶

A: weight_new = weight_old - learning_rate × ∂L/∂weight (We subtract because we want to move opposite to the gradient to minimize loss)

Side-note: The learning rate is a way to control how big of a jump we make in the direction of the gradient.

Q: Why do we need the activation function between layers?¶

A: Without it, stacking linear layers (weights) just creates another linear transformation. Activations add non-linearity, enabling the network to approximate any continouus function! You can see Michael Nielsen's excellent tutorial to understand this better: http://neuralnetworksanddeeplearning.com/chap4.html

Q: What happens if the learning rate is too large or too small?¶

A: Too large: may overshoot the minimum in the loss function and diverge (meaning go on to give larger loss function values instead of smaller). Too small: training becomes very slow and may get stuck in local minima (as it is unable to make a big enough "jump" to get out of the minima).