Embedding models for Kannada

by Thejesh GN · June 18, 2025

Embedding models transform words, sentences, documents, or digital items into vectors of numbers (usually floats) so that machines can understand and compare them semantically¹. A good embedding model has enough information to capture the meaning and relationships in dense vector form.

I have been using them to replace simple FTS, to implement sophisticated RAG applications, etc. Most embedding models are English-only; however, some multilingual models are available, but their quality varies significantly. So, I conducted a small exercise to determine the optimal embedding model for Kannada. You can find the marimo notebook here. ² This is inspired by Anand’s Embedding similarity threshold post.

As expected, most models do well in English compared to Kannada. You can come to your conclusions by looking at the graph. In Open Weight models, I found nomic-embed-text-v2-moe and snowflake-arctic-embed2 close to my taste as far as Kannada is concerned. In closed models OpenAIs text-embedding-3-large does well. But its the only closed model I tried.

I am yet to try the closed models by Voyage, Cohere, and Open Weight model – intfloat/multilingual-e5-large-instruct. I have added them to the list. I will try them in the next post.

Table of Contents

1 Open Weight Models
2 OpenAI Model
- 2.1 text-embedding-3-large

Open Weight Models

nomic-embed-text-v2-moe

nomic-embed-text-v2-moe is a state-of-the-art multilingual Mixture-of-Experts Embedding Model text embedding model that excels in multilingual retrieval.

nomic-embed-text-v2-moe model for English text

paraphrase-multilingual

paraphrase-multilingual is a sentence-transformers model: It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search.