ARCANADA
All Posts
Blog April 18, 2026

How I Picked an Embedding Model for Arcanada

How I Picked an Embedding Model for Arcanada

I have a server. AMD Ryzen 5 3600, 64 GB RAM, no GPU. And a problem: stop paying OpenRouter for embeddings and run my own.

Why? A couple of days earlier I was benchmarking LTM frameworks — Hindsight, Cognee, Graphiti — and OpenRouter was the #1 source of failures. 422 errors on the embeddings endpoint, broken LiteLLM routing, encoding_format incompatibility. I burned $20 on Sonnet before figuring out the bugs weren't in the frameworks — they were in my provider configuration. After that it was clear: a self-hosted embedding API on my own hardware isn't a luxury. It's a necessity.

Candidates

Started with seven models. Killed two before running any tests:

  • jina-embeddings-v3 — CC-BY-NC-4.0 license. Commercial use requires buying a license from Jina AI. No.
  • multilingual-e5-large-instruct — max 512 tokens. My chunks are 800-1500 characters. Won't fit.

Five remained:

ModelParametersDimensionLicense
BAAI/bge-m3568M1024MIT
Alibaba-NLP/gte-multilingual-base305M768Apache-2.0
nomic-ai/nomic-embed-text-v1.5137M768Apache-2.0
nomic-ai/nomic-embed-text-v2-moe137M (active)768Apache-2.0
ai-sage/Giga-Embeddings-instruct (Sber)~600M1024MIT

How I tested

Python script. 100 text chunks: 50 Russian, 50 English. For each model I measured:

  • Speed — milliseconds per chunk (single) and chunks per second in batches of 32.
  • Cross-lingual quality — 5 pairs of the same text in Russian and English, measured cosine similarity. If the model understands both languages well, embeddings of the same meaning should be close.
  • Memory — how much RAM the loaded model consumes.

Everything ran on the server, no GPU, pure CPU inference via sentence-transformers.

Results

Two models (GTE and GigaEmbeddings) wouldn't start — library compatibility issues. GTE crashed with an indexing error, GigaEmbeddings with a ROPE initialization failure. Could've fixed them, but the three that worked gave a clear picture.

Modelp50 msChunks/sec (batch)Cross-lingual similarityRAM
bge-m311823.70.887914 MB
nomic-v1.56016.80.610893 MB
nomic-v2-moe8645.80.7823286 MB

nomic-v1.5 is the fastest (60 ms per chunk), but weak on Russian. Cosine similarity of 0.61 between Russian and English versions of the same text. For a bilingual project, that's not enough.

nomic-v2-moe is better on quality (0.78), but eats 3.3 GB RAM — on a server also running PostgreSQL and Vault, that's too much.

bge-m3 won. Cosine similarity 0.887 — 45% higher than nomic-v1.5. Speed: 118 ms per chunk. RAM: 914 MB, within budget. MIT license. 1024 dimensions — compatible with every LTM framework I'm testing.

What I deployed

A 90-line FastAPI server. Three files: main.py (API), model.py (model loader), config.py (settings). OpenAI-compatible /v1/embeddings endpoint — any framework that speaks the OpenAI API plugs in with zero changes.

Bound to Tailscale IP — the service is only visible inside the mesh network, invisible from the public internet. Systemd unit with MemoryMax=8G and auto-restart. The whole deploy took about fifteen minutes.

Final smoke test: 100 chunks (50 RU + 50 EN) in batches of 10. All passed. 135 ms per chunk on average. Public IP — timeout. Exactly as designed.

Bottom line

45 minutes from start to working service. No per-request API charges — the model runs on hardware I'm already paying for. 135 ms per chunk — faster than OpenRouter gave me on paid models. And most importantly — no 422 errors, no LiteLLM routing headaches, no dependency on someone else's infrastructure.

If you have a server with 1+ GB free RAM and Python — bge-m3 via sentence-transformers works out of the box. Fifteen minutes to set up, no GPU needed. For projects with Russian and English content, it's the best open-source option I've found.

What embedding models are you running? If you've benchmarked anything on CPU under heavy load, I'd be curious to compare notes.