newsletter.param.codes

Share this post

Gradient descent

newsletter.param.codes

Gradient descent

(or how the best algorithms are the simplest and most intuitive)

Param Singh
Mar 12
2
Share this post

Gradient descent

newsletter.param.codes

Let’s talk about gradient descent. This is the topic discussed in Lecture 3 of the FastAI Deep Learning course.

In short, gradient descent is a method to fit a function to data. The steps involved are pretty simple. The fastai deep learning book explains it succinctly and clearly in Chapter 4:

  1. Initialize the weights. We can take random weights for now.

  2. Predict the value of the function based on these weights.

  3. Based on these predictions, calculate how good the model is (its loss).

  4. Calculate the gradient, which measures for each weight, how changing that weight would change the loss.

  5. Step (that is, change) all the weights based on that calculation.

  6. Go back to the step 2, and repeat the process.

  7. Iterate until you decide to stop the training process (for instance, because the model is good enough or you don't want to wait any longer).

The most complicated part of this process is actually calculating the gradient. However, the good news is that libraries like pytorch handle the math for us. I was very surprised to see that the code for gradient descent is a very simple function.

def apply_step(params):
    preds = f(params)
    loss = mse(preds, actual)
    loss.backward()
    params.data -= lr * params.grad.data
    params.grad = None
    return preds

weights = torch.randn(3).requires_grad_()
for epoch in range(10):
    weights = apply_step(weights)

Here, the loss.backward() call calculates the gradient and puts it in the params variable for us. After that, everything is pretty easy.

The results of just this simple gradient descent aren’t the most impressive, even for stuff like Kaggle’s Titanic problem, but it’s pretty cool to see how simple and intuitive basic machine learning algorithms can be.

I’m always amazed by the intuitiveness of a lot of these algorithms in computer science. The thing that really differentiates a good algorithm from a great algorithm is simplicity. A great algorithm gives you the feeling that you could easily have come up with it, given time. I felt the same about Dijkstra’s algorithm or Binary Search Trees the first time I came across them.

I hope this small note about gradient descent makes some sense. If you’ve done this stuff before, I’m looking for ideas about things to build as the deep learning course goes on. Right now, I’m not too deep in the course, so Kaggle problems are fun, but I’m hoping to put this knowledge to actual use somewhere. Let me know if you want to chat!

Leave a comment

Things of interest

  • People are now able to run GPT-3 level LLMs on consumer grade MacBooks. The future is going to be amazing, I can’t wait.

    Twitter avatar for @lawrencecchen
    Lawrence Chen @lawrencecchen
    @ggerganov 65B running on m1 max/64gb! 🦙🦙🦙🦙🦙🦙🦙
    10:52 AM ∙ Mar 11, 2023
    1,039Likes150Retweets
  • Aiku has generated ~200 poems since February 4. I just assumed that no-one used it, but that wasn’t the case. Most of the users seem to be coming from Google, so I tweaked the site a bit (adding sitemaps etc) to make it more user-friendly. The main thing that I wanted to share was that it’s easy to assume that your project is dead if you don’t have any metrics. Having metrics is a good thing. :)

Thanks for reading newsletter.param.codes! Subscribe for free to receive new posts and support my work.

Share this post

Gradient descent

newsletter.param.codes
Previous
Next
Comments
TopNewCommunity

No posts

Ready for more?

© 2023 Param Singh
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing