Benchmarks¶
at a glance¶
- 98.1% R@5 on LongMemEval
- 1.5 points above MemPalace
- 12.1 points above Emergence AI
- 377x faster than brute-force dense search at 1M vectors with HNSW
LongMemEval (ICLR 2025)¶
LongMemEval — 500 questions testing 5 long-term memory abilities across ~40 conversation sessions per question (~115k tokens).
results¶
| system | R@5 | method |
|---|---|---|
| engram v2 | 98.1% | HNSW + BM25 + assistant BM25 + temporal boost + cross-encoder |
| MemPalace (raw) | 96.6% | ChromaDB cosine, verbatim storage |
| engram v1 | 94.7% | HNSW + BM25 + RRF |
| Emergence AI | 86.0% | RAG |
| MemPalace (AAAK) | 84.2% | compressed storage |
| EverMemOS | 83.0% | — |
| TiMem | 76.9% | temporal hierarchical |
per question type¶
| type | n | R@5 | R@10 |
|---|---|---|---|
| knowledge-update | 72 | 100.0% | 100.0% |
| single-session-user | 64 | 100.0% | 100.0% |
| multi-session | 121 | 99.2% | 99.2% |
| temporal-reasoning | 127 | 96.9% | 97.6% |
| single-session-assistant | 56 | 96.4% | 96.4% |
| single-session-preference | 30 | 93.3% | 96.7% |
what makes it work¶
v2 adds three improvements over v1:
- assistant-turn BM25 (weight 0.5) — catches answers in assistant responses without polluting the dense index
- timestamp proximity boost — favors sessions closer to the question date
- cross-encoder reranking — jointly scores top-20 candidates against the query
run the benchmark: python benchmarks/longmemeval/run_engram.py data/longmemeval_s_cleaned.json --rerank
benchmark takeaway¶
The important result is not just “higher score”.
Engram wins while staying operationally simple:
- one SQLite file
- local retrieval stack
- no external vector DB
- no graph database
- production-friendly latency
system benchmark (72 tests)¶
43 subsystem tests across 20 modules:
| subsystem | tests | result |
|---|---|---|
| embedding | 3/3 | dim=384, norm=1.0, avg 5.1ms |
| ANN index (HNSW) | 7/7 | 0.09ms search, 100% recall@10, 5,304 inserts/sec |
| brute-force dense | 2/2 | 0.016ms avg |
| intent classification | 1/1 | 6/6 correct |
| full pipeline (no rerank) | 3/3 | 15.5ms avg |
| full pipeline (+ cross-encoder) | 1/1 | 252ms avg |
| cross-encoder | 2/2 | 2.9ms/doc |
| surprise gate | 4/4 | 0.10ms avg |
| Hopfield channel | 1/1 | <1ms |
| BM25 / FTS5 | 2/2 | 3.5ms avg |
| entity graph | 4/4 | 2-hop traversal |
| memory CRUD | 2/2 | write → ANN → forget |
| layers (L0-L3) | 1/1 | 248ms |
plus 72 pytest tests across store, embeddings, ann_index, retrieval, surprise, and config.
latency¶
| operation | time |
|---|---|
| ANN dense search | 0.09ms |
| full pipeline (no rerank) | 15.5ms |
| full pipeline (+ cross-encoder) | 252ms |
| embedding | 5.1ms |
| surprise gate | 0.10ms |
| ANN insert | 0.19ms |
| BM25 / FTS5 | 3.5ms |
ANN scaling¶
| vectors | brute-force | HNSW | speedup |
|---|---|---|---|
| 1k | 0.1ms | 0.12ms | 1x |
| 10k | 0.9ms | 0.16ms | 5x |
| 100k | 8.7ms | 0.20ms | 45x |
| 500k | 43.7ms | 0.22ms | 198x |
| 1M | 87.3ms | 0.23ms | 377x |
throughput¶
| operation | rate |
|---|---|
| embedding (MLX GPU) | 1,879 texts/sec |
| embedding (CPU) | 176 texts/sec |
| SQLite bulk insert | 51,000 rows/sec |
| ANN insert | 5,304 ops/sec |