4 Comments

Nice article. I would like to add something from my understanding though. The latent factors don't emerge automatically. It is just that recent SoTA work does not seem to focus on improving on embedding performance via anything other than increasing the size and quality of data. In the past, we've had many improvements in how to represent textual information in the best way in vector space. Word2Vec --> GloVe --> fastText is the example of research that aimed to improve the capture of semantic relationship better than previous generations. In the last some time, we are missing any improvement in embeddings performance that exists in open research but we're aware that openAI embeddings have best performance for their context length over unseen data. We aren't aware exactly what improvements they've made. Likewise, we may be getting something in this year itself (hoping 🤞) that allows a better semantic relationship capture for the same dataset quality and size. That research would be an example of a research where the change in generation of embeddings approach is aimed at improving the understanding of underlying language model.

Expand full comment

so if i had to do my own embedding search kind of thing, what would I use? is fastText the best right now?

Expand full comment

I'll use the word nobody likes - "Depends" but most likely no. SoTA performing embeddings would be already incorporating a multitude of strategies. You'll have to check best one based on your task.

You can do that via MTEB -

https://huggingface.co/blog/mteb

https://huggingface.co/spaces/mteb/leaderboard

In general, instructor embeddings are the best for the average performance but openAI embeddings are the best if you want minimum length of a document to be of high dimensions.

Expand full comment

I like thinking about connections between groups of embeddings since lotsa points in high dimensional space can be contained in a manifold.

Expand full comment