Arcanada — Scrutator: Building the Knowledge Engine Behind Arcanada

Every knowledge base looks simple at first. You write notes, store documents, search with keywords. It works — until it doesn't. The moment you have five thousand documents in two languages, keyword search stops finding things you know are there. Worse — it finds things that match lexically but miss the point entirely.

This is the problem Scrutator solves.

What Scrutator Is

The name comes from Latin: scrutator — "one who thoroughly investigates." Scrutator is the foundational Knowledge Retrieval & Meaning Engine for the Arcanada ecosystem. It provides unified search, retrieval, and meaning extraction across all knowledge sources: wiki, project docs, agent memories, conversations.

It is not a wrapper around a vector database. It is a complete retrieval pipeline: chunking, embedding, indexing, hybrid search, ranking, and a Dreaming module that periodically reorganizes and strengthens connections in the knowledge base.

Scrutator is open source, MIT-licensed. No secrets, no hidden sauce — just solid engineering that anyone can study, fork, and improve.

Architecture

The system has five core layers, all live in production:

Chunking Engine — Adaptive semantic document splitting with four strategies: markdown headers, code boundaries, sliding window, and single-document mode. The chunker respects document structure instead of blindly cutting at token limits. Parent-child hierarchy preserves context. Content limit: 1 MB per document, zero external dependencies.
Embedding Server — BAAI/bge-m3 model producing three types of vectors simultaneously: dense (1024-dim), sparse (lexical weights), and ColBERT (multi-vector token-level). Three workers, fp16 quantization, running on our own hardware with no external API dependencies.
Hybrid Search — Three-way retrieval combining dense vector similarity, sparse lexical matching, and PostgreSQL full-text search. Results fused via Reciprocal Rank Fusion (RRF, k=60). Three signals catch what any single method would miss.
Storage — PostgreSQL 16 with pgvector 0.8.2 (HNSW indexes, m=16, ef_construction=64) for vectors, tsvector with dual-language generated columns for full-text search (Russian + English). Six tables, 22 indexes. One database, no external dependencies.
Dreaming — A periodic process that reorganizes the knowledge base: builds cross-references, strengthens semantic links, identifies contradictions, and removes redundancy. Integrated with Agent Dreamer for autonomous knowledge maintenance cycles.

What Is Working in Production

All five layers are deployed and running on our infrastructure (Tailscale-only access).

Embedding Server (v2.1)

Uses BAAI/bge-m3 via FlagEmbedding's BGEM3FlagModel. Five API endpoints:

POST /v1/embeddings — dense vectors (OpenAI-compatible)
POST /v1/embeddings/sparse — lexical sparse weights
POST /v1/embeddings/colbert — ColBERT multi-vectors
POST /v1/embeddings/hybrid — all three in one call
GET /health — server status with RAM usage and Prometheus metrics

Three workers, fp16 quantization, CPU-only (no GPU required). Cross-lingual similarity: 0.887 between Russian and English translations of the same text — 45% higher than the nearest competitor we benchmarked.

Chunking Engine

Four splitting strategies: markdown_headers for structured docs, code_boundaries for source files, sliding_window for flat text, and single for short documents. Language auto-detection for Russian and English. Deduplication on ingest via ON CONFLICT (source_path, chunk_index) DO UPDATE.

Hybrid Search Pipeline

The retrieval core. Three-way search: dense cosine similarity over pgvector HNSW indexes, sparse lexical matching, and PostgreSQL full-text search with dual-language tsvector columns. Results fused through RRF (k=60).

Dreaming Module

Semantic analysis of the entire knowledge base: 1,148 chunks indexed, 20 semantic duplicates detected, 50 cross-references built, 50 orphan chunks identified. Analysis completes in 5.5 seconds. The module integrates with Agent Dreamer for autonomous dream cycles — periodic knowledge maintenance that runs without human intervention.

Memory Layer

LTM (Long Term Memory) integration provides AI agents with persistent memory backed by Scrutator's retrieval. Chunk-to-page mapping enables edge write-back from dream analysis directly to source documents.

Benchmarks

Live production measurements (20-iteration median on arcana-db):

Metric	Value
2-way search (dense + FTS), p50	383 ms
2-way search, p95	399 ms
3-way search (dense + sparse + FTS), p50	749 ms
3-way search, p95	768 ms
Embedding API round-trip	~350 ms
DB query	<50 ms
API warmup	238 ms
Dream analysis (1,148 chunks)	5.5 s

The 3-way search adds ~366 ms over 2-way due to the additional sparse embedding round-trip. The dominant cost is always the embedding API call, not the database query.

Testing and Problems We Solved

The project has 174 automated tests across all components — unit, integration, API, and real-file tests. Zero regressions across all build stages. Every component was tested against real documents (Python source, wiki pages, Datarim workflow files), not just synthetic data.

Problems We Hit

RAM surprise. The original plan predicted 450 MB for the Embedding Server with fp16 quantization. Actual usage: 2,400 MB per worker. With three workers — 6.9 GB total. BGE-M3 loads additional components (sparse_linear, colbert_linear) that aren't accounted for in the base dense model footprint. We documented it, adjusted server specs, and moved on.

Transformers 5.x breakage. A routine dependency update to transformers 5.x broke the embedding pipeline — the function is_torch_fx_available was removed upstream. Fix: pin transformers>=4.45,<5.0 until the ecosystem catches up.

Deploy vs. install confusion. First production deploy failed with ModuleNotFoundError: No module named 'scrutator'. The deployment plan used pip install -r requirements.txt, but the project uses pyproject.toml. Fix: pip install -e . — a 30-second fix after 10 minutes of confusion.

Database permissions. Schema tables were created by the postgres superuser instead of the application user scrutator_app. The API worked in development but failed silently in production. Fix: ALTER TABLE ... OWNER TO scrutator_app for all six tables.

Edge write-back architecture. The Dreaming module initially attempted to write edges using page paths, but the database stores chunk UUIDs. This caused 100–200 extra HTTP round-trips per dream cycle. Solution: a dedicated batch lookup endpoint with server-side path-to-UUID resolution — one API call instead of hundreds.

Input length limit. Documents exceeding 32K characters hit the BGE-M3 token limit silently. We added a conservative cap at 24,000 characters with clear error messages.

How It Connects

Scrutator is the retrieval backend for the entire Arcanada ecosystem:

Long Term Memory — Scrutator is the search layer. When an agent needs to remember something from past conversations or documents, it queries Scrutator.
Agent Dreamer — The Dreaming module plugs into Dreamer's autonomous pipeline. Knowledge maintenance runs on a schedule — not just passive storage, but active reorganization.
Model Connector — LLM integration for semantic query understanding. We're working on using Model Connector as the LLM backend for Scrutator's analysis, with Cursor as the primary connector and Claude as fallback.

Every AI agent in the ecosystem gets access to a unified, multilingual, hybrid search over the entire knowledge base. Not just keyword matching — semantic understanding.

Why Open Source

Knowledge retrieval is infrastructure. Like databases and web servers — it should be transparent, auditable, and improvable by the community. We publish everything: architecture decisions, benchmark results, even our mistakes (the RAM prediction being off by 5x is documented in the repo). Check the GitHub repository — it's MIT-licensed.

What Comes Next

The core engine is built and running. The next phase is about making it smarter:

LLM-powered analysis — Using Model Connector to give Scrutator access to language models for deeper semantic analysis during dream cycles.
Long Term Memory benchmarks — Running production-scale benchmarks to measure retrieval quality and memory persistence across agent sessions.
Self-hosted embeddings for external consumers — Making the embedding API available to other projects in the ecosystem without external API dependencies.

The Series

This post covers the full picture after all core components shipped. Technical deep-dives are planned:

Embedding Server: BGE-M3 sparse + ColBERT + fp16 — how we migrated from SentenceTransformer to BGEM3FlagModel, the RAM surprise, and what we learned.
Chunking Engine: How to Split Knowledge into Meanings — four strategies, real-file testing, and why structure-aware splitting matters.
Hybrid Search: Dense + Sparse + FTS + RRF — why three signals beat one, with production benchmarks.
Dreaming: When Knowledge Starts Thinking — periodic reorganization, the Agent Dreamer integration, and edge write-back architecture.

Follow the blog or star the GitHub repo to stay updated.