I have a server. AMD Ryzen 5 3600, 64 GB RAM, no GPU. And a problem: stop paying OpenRouter for embeddings and run my own.
Why? A couple of days earlier I was benchmarking LTM frameworks — Hindsight, Cognee, Graphiti — and OpenRouter was the #1 source of failures. 422 errors on the embeddings endpoint, broken LiteLLM routing, encoding_format incompatibility. I burned $20 on Sonnet before figuring out the bugs weren't in the frameworks — they were in my provider configuration. After that it was clear: a self-hosted embedding API on my own hardware isn't a luxury. It's a necessity.
Candidates
Started with seven models. Killed two before running any tests:
- jina-embeddings-v3 — CC-BY-NC-4.0 license. Commercial use requires buying a license from Jina AI. No.
- multilingual-e5-large-instruct — max 512 tokens. My chunks are 800-1500 characters. Won't fit.
Five remained:
| Model | Parameters | Dimension | License |
|---|---|---|---|
| BAAI/bge-m3 | 568M | 1024 | MIT |
| Alibaba-NLP/gte-multilingual-base | 305M | 768 | Apache-2.0 |
| nomic-ai/nomic-embed-text-v1.5 | 137M | 768 | Apache-2.0 |
| nomic-ai/nomic-embed-text-v2-moe | 137M (active) | 768 | Apache-2.0 |
| ai-sage/Giga-Embeddings-instruct (Sber) | ~600M | 1024 | MIT |
How I tested
Python script. 100 text chunks: 50 Russian, 50 English. For each model I measured:
- Speed — milliseconds per chunk (single) and chunks per second in batches of 32.
- Cross-lingual quality — 5 pairs of the same text in Russian and English, measured cosine similarity. If the model understands both languages well, embeddings of the same meaning should be close.
- Memory — how much RAM the loaded model consumes.
Everything ran on the server, no GPU, pure CPU inference via sentence-transformers.
Results
Two models (GTE and GigaEmbeddings) wouldn't start — library compatibility issues. GTE crashed with an indexing error, GigaEmbeddings with a ROPE initialization failure. Could've fixed them, but the three that worked gave a clear picture.
| Model | p50 ms | Chunks/sec (batch) | Cross-lingual similarity | RAM |
|---|---|---|---|---|
| bge-m3 | 118 | 23.7 | 0.887 | 914 MB |
| nomic-v1.5 | 60 | 16.8 | 0.610 | 893 MB |
| nomic-v2-moe | 86 | 45.8 | 0.782 | 3286 MB |
nomic-v1.5 is the fastest (60 ms per chunk), but weak on Russian. Cosine similarity of 0.61 between Russian and English versions of the same text. For a bilingual project, that's not enough.
nomic-v2-moe is better on quality (0.78), but eats 3.3 GB RAM — on a server also running PostgreSQL and Vault, that's too much.
bge-m3 won. Cosine similarity 0.887 — 45% higher than nomic-v1.5. Speed: 118 ms per chunk. RAM: 914 MB, within budget. MIT license. 1024 dimensions — compatible with every LTM framework I'm testing.
What I deployed
A 90-line FastAPI server. Three files: main.py (API), model.py (model loader), config.py (settings). OpenAI-compatible /v1/embeddings endpoint — any framework that speaks the OpenAI API plugs in with zero changes.
Bound to Tailscale IP — the service is only visible inside the mesh network, invisible from the public internet. Systemd unit with MemoryMax=8G and auto-restart. The whole deploy took about fifteen minutes.
Final smoke test: 100 chunks (50 RU + 50 EN) in batches of 10. All passed. 135 ms per chunk on average. Public IP — timeout. Exactly as designed.
Bottom line
45 minutes from start to working service. No per-request API charges — the model runs on hardware I'm already paying for. 135 ms per chunk — faster than OpenRouter gave me on paid models. And most importantly — no 422 errors, no LiteLLM routing headaches, no dependency on someone else's infrastructure.
If you have a server with 1+ GB free RAM and Python — bge-m3 via sentence-transformers works out of the box. Fifteen minutes to set up, no GPU needed. For projects with Russian and English content, it's the best open-source option I've found.
What embedding models are you running? If you've benchmarked anything on CPU under heavy load, I'd be curious to compare notes.