Embedding models for Kannada

Embedding models transform words, sentences, documents, or digital items into vectors of numbers (usually floats) so that machines can understand and compare them semantically1. A good embedding model has enough information to capture the meaning and relationships in dense vector form.

I have been using them to replace simple FTS, to implement sophisticated RAG applications, etc. Most embedding models are English-only; however, some multilingual models are available, but their quality varies significantly. So, I conducted a small exercise to determine the optimal embedding model for Kannada. You can find the marimo notebook here. 2 This is inspired by Anand’s Embedding similarity threshold post.

As expected, most models do well in English compared to Kannada. You can come to your conclusions by looking at the graph. In Open Weight models, I found nomic-embed-text-v2-moe and snowflake-arctic-embed2 close to my taste as far as Kannada is concerned. In closed models OpenAIs text-embedding-3-large does well. But its the only closed model I tried.


I am yet to try the closed models by Voyage, Cohere, and Open Weight model – intfloat/multilingual-e5-large-instruct. I have added them to the list. I will try them in the next post.

Open Weight Models

nomic-embed-text-v2-moe

nomic-embed-text-v2-moe is a state-of-the-art multilingual Mixture-of-Experts Embedding Model text embedding model that excels in multilingual retrieval.

nomic-embed-text-v2-moe model for Kannada text
nomic-embed-text-v2-moe model for English text
nomic-embed-text-v2-moe model for English text

paraphrase-multilingual

paraphrase-multilingual is a sentence-transformers model: It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search.

paraphrase-multilingual embedding model for Kannada Text
paraphrase-multilingual embedding model for English Text

bge-m3

BGE-M3 is based on the XLM-RoBERTa and is known for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

bge-m3 embedding model for Kannada text
bge-m3 embedding model for Kannada text
bge-m3 embedding model for English text

snowflake-arctic-embed2

Arctic Embed 2.0 by Snowflake is their latest frontier embedding model, which has multilingual capabilities.

snowflake-arctic-embed2 model for Kannada text
snowflake-arctic-embed2 model for English text

OpenAI Model

text-embedding-3-large

OpenAI’s text-embedding-3-large model for Kannada Text
OpenAI’s text-embedding-3-large model for English Text

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Join 2,146 other subscribers

  1. In a more meaningful way than just text compare
  2. If you didn’t know marimo, you should, it’s an useful tool.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.