Creating Embeddings with Huggingface

A Crash Course

Yesterday we talked about semantic search and why we need embeddings.

Today, we are taking it one step further.

We’ll create embeddings using the Sentence-Transformers library. By the end of this episode, we will have a Lex Fridman Podcast vector database ready to be searched through!

This is where we are in the process of creating LexGPT!

Why not use OpenAI embeddings?

  • They don’t work offline - this means we would have to call API to create the vector database and embed every user question. This would require OpenAI API keys and we want to avoid it if possible - we don’t want to depend on OpenAI to run the search.

  • The quality of Sentence-Transformers models is actually better! I’ve tested that empirically when creating PodcastGPT. This is also confirmed by other people, e.g. this blog post compares the performance of different embedding models:

    Embedding performance comparison.

Sentence-Transformers - Which model to pick?

There are two types of semantic search that determine which model should we pick:

  • symmetric semantic search - query and the entries in the corpus are of about the same length and have the same amount of content. Query could for example be “How to learn Python online?” and an answer could be: “How to learn Python on the web?”

  • asymmetric semantic search - we have a short query (like a question or some keywords) and we want to find a longer paragraph answering the query. An example would be a query like “What is Python” and you want to find the paragraph “Python is an interpreted, high-level and general-purpose programming language. Python’s design philosophy …”

Try to guess which type of semantic search we need…

Yes, you got it right.

It’s an asymmetric semantic search!

Per the Sentence-Transformers documentation, the best models for this task are Pre-Trained MS MARCO Models. This is why we’ll be using msmarco-distilbert-base-tas-b - the best MS MARCO embedding model.

Let’s get to the nitty gritty!

Embedding summaries

Embedding texts is very straightforward with Sentence-Transformers.

This is the code snippet that lets us embed the whole directory of summaries:

import os
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('msmarco-distilbert-base-tas-b')

input_dir = "summaries"
paths = [os.path.join(input_dir, filename) for filename in os.listdir(input_dir) if not filename.startswith('.')]

summaries = []
for path in paths:
    with open(path, 'r') as f:
        text = f.read()
    summaries.append(text)

embeddings = model.encode(summaries)

It took me around 3 and a half minutes to embed all the summaries on M1 Mac.

That’s it! Each summary has been embedded into a 768-dimensional vector that we can use for further analytics!

Remember to save your embeddings!

The code for the day can be accessed here.

Tomorrow we will further analyze the embeddings! Stay tuned!

This is the eleventh day of the 30-day AI challenge.

Over the next month, I will be building the Lex Fridman AI engine with you!

If you're reading this, I assume you'd like to build things. If you stick to this newsletter you will have a running project after a month and know the necessary technology to build AI apps.

I've recently built PodcastGPT and want to share the process with the community. If you haven't seen the app yet, you can get access here: PodcastGPT

This is all for now! See you tomorrow.

Stay focused!

Luke