TIL: Creating `sentence-transformers` Embeddings From Bumblebee

One amazing thing about the Elixir machine learning ecosystem is that you can run state-of-the-art ML models right from Elixir. You can go pick your favorite Hugging Face model and get it running without any ceremony.

I was recently creating text embeddings using Bumblebee, and realized that some of the defaults Bumblebee picks are not the same as what the sentence-transformers library uses.

The sentence-transformers library produces this embedding(I’m using Simon Willison’s llm tool):

❯ llm embed -m sentence-transformers/all-MiniLM-L6-v2 -c 'Hello'
[-0.0627717524766922, 0.054958831518888474, 0.05216477811336517, 0.08578992635011673, -0.08274892717599869, -0.07457292079925537, 0.06855466216802597, ...

While Bumblebee produces something entirely different:

{:ok, model_info} = Bumblebee.load_model({:hf, "sentence-transformers/all-MiniLM-L6-v2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "sentence-transformers/all-MiniLM-L6-v2"})

serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer)

text = "Hello"
Nx.Serving.run(serving, text)

%{
  embedding: #Nx.Tensor<
    f32[384]
    [-0.03655402362346649, 0.013817708939313889, 0.020922871306538582, 0.05819102004170418, -0.04180845618247986 ...

The default with sentence-transformers seems to be to do mean-pooling and to normalize. So, we need to do the same:

{:ok, model_info} = Bumblebee.load_model({:hf, "sentence-transformers/all-MiniLM-L6-v2"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "sentence-transformers/all-MiniLM-L6-v2"})

serving = Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
  output_pool: :mean_pooling,
  output_attribute: :hidden_state,
  embedding_processor: :l2_norm)

text = "Hello"
Nx.Serving.run(serving, text)

And we get the same embedding from Bumblebee as we did from sentence-transformers:

%{
  embedding: #Nx.Tensor<
    f32[384]
    [-0.06277184188365936, 0.05495888739824295, 0.052164819091558456, 0.08579001575708389, -0.08274886757135391, ...