Let’s talk about embeddings because they’re cool. The first time I came across embeddings was when I was building AskYC. To build AskYC, I converted the text from the transcriptions of the YC YouTube channel videos into vectors called embeddings using OpenAI’s API.
The magical thing here is that the distance between these vectors is equivalent to semantic difference between the sentences. This is quite amazing. At that point, I didn’t really understand how any of this worked. However, now I think I have some intuition for how they work, thanks to this chapter of Deep Learning for Coders.
An embedding is a mapping of a discrete variable to a point in a vector space. Let’s think about this in the context of collaborative filtering. Collaborative filtering is a technique used to create recommender systems. On a high level, here’s how it works:
We have a dataset that contains users and their ratings for particular items (for the sake of this post, we’ll assume items are movies).
From this dataset, we try to learn which users are similar to each other and then use that to recommend new items to users.
In this case, we create vectors for each user and each item. We get to choose how many parameters we want in these vectors. The process of creating these vectors (or training) is simple gradient descent.
Initialize vectors with random values.
Calculate predictions of ratings for each user and item.
Use these predictions and the actual values to calculate a loss.
Use this loss to improve your vectors.
Each value in these vectors will eventually represent a particular latent factor. For example, the second value in each movie vector might be directly proportional to how much action there is in the movie. These latent factors are picked up by our model during the training process, which is amazing. At this point, we can calculate the distance between two movie vectors and it’s a good estimate of how similar the two movies are.
So on a high level, this is how embeddings work. As a software engineer who’s mostly experienced with concrete problems and solutions, it is a little hard for me to internalise that we know the process with which how we create these embeddings, but the latent factors just emerge automatically.
Nice article. I would like to add something from my understanding though. The latent factors don't emerge automatically. It is just that recent SoTA work does not seem to focus on improving on embedding performance via anything other than increasing the size and quality of data. In the past, we've had many improvements in how to represent textual information in the best way in vector space. Word2Vec --> GloVe --> fastText is the example of research that aimed to improve the capture of semantic relationship better than previous generations. In the last some time, we are missing any improvement in embeddings performance that exists in open research but we're aware that openAI embeddings have best performance for their context length over unseen data. We aren't aware exactly what improvements they've made. Likewise, we may be getting something in this year itself (hoping 🤞) that allows a better semantic relationship capture for the same dataset quality and size. That research would be an example of a research where the change in generation of embeddings approach is aimed at improving the understanding of underlying language model.
I like thinking about connections between groups of embeddings since lotsa points in high dimensional space can be contained in a manifold.