Lesson 1: Neural Networks with PyTorch¶
📚 Prerequisites (Your Starting Level)¶
Based on your background:
- Python: Very comfortable
- NumPy/Scientific Computing: Not very comfortable (we'll build this up)
- Math: Familiar with matrices, derivatives, and basic calculus
- Neural Networks: Basic knowledge - you understand:
- What a neural network is
- Forward propagation
- Backward propagation
- Need a refresher on gradient descent
🎯 Learning Objectives¶
By the end of this lesson, you will be able to:
- Understand gradient descent - how weights are updated to minimize loss
- Explain backpropagation - how PyTorch computes gradients using the chain rule
- Use PyTorch basics - create tensors with
requires_grad=True, compute forward pass, callbackward() - Build a training loop from scratch - implement the complete cycle: forward → loss → backward → update
- Understand automatic differentiation - how PyTorch performs calculus automatically
Q: What's a loss function?¶
A: A loss function is a function that quantifies the error in your prediction. For a value x, if you predicted an answer with f(x), but the actual value was y, then the loss function tells you how far away f(x) was from y.
Q: Examples of loss functions¶
A: There are many answers here, but some common ones are mean-square error (y - f(x))^2/n where n = the number of predictinos made. Another one is cross-entropy loss.
Q: When might you use the different loss functions mentioned?¶
A: The former is usually used for regression (predicted a numeric value for example) and the latter for classification (one correct answer out of many).
Q: What does the gradient at a point on a function give you?¶
A: It gives you the rate & direction of change at that point in the function (wrt the input). In other words, it answers the question: if you change the input to the function in a certain direction, how does the function change (does it increase/decrease and by how much)?
Q: Using this, how might you utilize the gradient to improve your prediction?¶
A: We should move the input to the loss function along the direction that decreases the loss function.
Q: What should we take the gradient with respect to (in other words, what are the inputs that we change, what is the x if f(x) is the loss function)?¶
A: The weights of the neural network. NOT the input data!
Side-note: We need to do partial differentation (find the gradient with respect to each weight) to understand how to update each weight.
Q: If we have: Input → Weight1 → Layer1Output → Activation → Weight2 → FinalOutput → Loss, then how could we compute the partial derivative of loss wrt Weight1 (Hint 1: chain rule. Hint 2: to simplify it, just give the first step)?¶
A: We'd take the partial derivative of the loss wrt FinalOutput first.
Q: What would be the rest of the steps in the chain rule to get the partial derivative wrt Weight1?¶
A: d(Loss)/d(FinalOutput) * d(FinalOutput)/d(Activation) * d(Activation)/d(Layer1Output) * d(Layer1Output)/d(Weight1)
Q: Assume loss is MSE, so (y - f(x))^2. Now, what is the partial derivative of this wrt f(x) (the output)?¶
A: Expand: y^2 + f(x)^2 - 2(y)(f(x)) Take derivative wrt f(x): 0 + 2f(x) - ((Using u'v + v'u) 0 + 12y) = 2f(x) - 2y Alternate A: Treat y as a constant and f(x) as the variable. Take derivative of (y−u)^2 wrt u. That gives 2*(y-u) * - 1 = -2(y-u). Therefore, ∂L/∂f(x) = -2 (y − f(x)) = 2f(x) - 2y
Q: If FinalOutput is W*A + b where A is activation, what is the partial derivative of FinalOutput wrt Activation?¶
A: d(FinalOutput)/d(Activation) = d(W2 * A + b)/d(A) = W2
Side-note: Activation function is usually just f(x) = x if x > 0 else 0. This activation function is called ReLU (Rectified Linear Unit).
Q: What would the partial derivative of ReLU wrt x be? In other words, what is d(A)/d(x) where A = relu(x))?¶
A: 1 if x > 0 else 0. Note: The ReLU function is not strictly differentiable because of the kink at zero. However, this doesn't matter in practice.
Q: What would d(Activation)/d(Weight1) be?¶
A: d(activation)/d(weight1) = d(relu(wx + b))/d(weight1) = deriv_relu(wx + b) * x
Side-note: We need to compute the loss with respect to each of the weight matrices (one per layer).
Q: After computing ∂L/∂output, how do we find ∂L/∂weight2?¶
A: Chain rule: ∂L/∂weight2 = ∂L/∂output × ∂output/∂weight2
Q: Once we have the gradient ∂L/∂weight, how exactly do we update the weight?¶
A: weight_new = weight_old - learning_rate × ∂L/∂weight (We subtract because we want to move opposite to the gradient to minimize loss)
Side-note: The learning rate is a way to control how big of a jump we make in the direction of the gradient.
Q: Why do we need the activation function between layers?¶
A: Without it, stacking linear layers (weights) just creates another linear transformation. Activations add non-linearity, enabling the network to approximate any continouus function! You can see Michael Nielsen's excellent tutorial to understand this better: http://neuralnetworksanddeeplearning.com/chap4.html
Q: What happens if the learning rate is too large or too small?¶
A: Too large: may overshoot the minimum in the loss function and diverge (meaning go on to give larger loss function values instead of smaller). Too small: training becomes very slow and may get stuck in local minima (as it is unable to make a big enough "jump" to get out of the minima).