Discussion about this post

User's avatar
TokenBender's avatar

Nice article. I would like to add something from my understanding though. The latent factors don't emerge automatically. It is just that recent SoTA work does not seem to focus on improving on embedding performance via anything other than increasing the size and quality of data. In the past, we've had many improvements in how to represent textual information in the best way in vector space. Word2Vec --> GloVe --> fastText is the example of research that aimed to improve the capture of semantic relationship better than previous generations. In the last some time, we are missing any improvement in embeddings performance that exists in open research but we're aware that openAI embeddings have best performance for their context length over unseen data. We aren't aware exactly what improvements they've made. Likewise, we may be getting something in this year itself (hoping 🤞) that allows a better semantic relationship capture for the same dataset quality and size. That research would be an example of a research where the change in generation of embeddings approach is aimed at improving the understanding of underlying language model.

Expand full comment
Sheikh Abdur Raheem Ali's avatar

I like thinking about connections between groups of embeddings since lotsa points in high dimensional space can be contained in a manifold.

Expand full comment
2 more comments...

No posts