Benchmarks¶

LongMemEval (ICLR 2025)¶

LongMemEval — 500 questions testing 5 long-term memory abilities across ~40 conversation sessions per question (~115k tokens).

system	R@5	method
engram v2	98.1%	HNSW + BM25 + assistant BM25 + temporal boost + cross-encoder
MemPalace (raw)	96.6%	ChromaDB cosine, verbatim storage
engram v1	94.7%	HNSW + BM25 + RRF
Emergence AI	86.0%	RAG
MemPalace (AAAK)	84.2%	compressed storage
EverMemOS	83.0%	—
TiMem	76.9%	temporal hierarchical

v2 adds three improvements over v1:

assistant-turn BM25 (weight 0.5) — catches answers in assistant responses without polluting the dense index
timestamp proximity boost — favors sessions closer to the question date
cross-encoder reranking — jointly scores top-20 candidates against the query

run the benchmark: python benchmarks/longmemeval/run_engram.py data/longmemeval_s_cleaned.json --rerank

43 subsystem tests across 20 modules:

plus 72 pytest tests across store, embeddings, ann_index, retrieval, surprise, and config.

vectors	brute-force	HNSW	speedup
1k	0.1ms	0.12ms	1x
10k	0.9ms	0.16ms	5x
100k	8.7ms	0.20ms	45x
500k	43.7ms	0.22ms	198x
1M	87.3ms	0.23ms	377x