Mihai Cosma
Emergence through chaos and order.
Agent Harnesses
Orchestration, memory, communication architectures
- panetone 1-file Python bridge for 2-way communication with Claude/Codex/Gemini/OpenCode interactive sessions in wezterm-mux-server via Telegram/Signal/Slack. Launch multi-agent collaborative session via /collab.
- swarmhost multi-agent orchestration with dependency graphs, built by reverse-engineering Claude Code's /batch in <12 hours. Stateless planner/worker/merger architecture with continuous task refill. Used to ship 15K LOC to prod in 26 hours.
- longmem ablation study on agent memory retrieval beats Mastra-AI's LongMemEval SOTA with just embeddings and terse prompts. Novel application of ColBERT-Zero for long-context retrieval. Conclusion is that complexity hurts.
Evaluation & Affordances
Quantitative benchmarks, agent capabilities, automated evaluation
- mad novel eval benchmark for LLM agent memory and multi-step planning over long game sequences. Agents reach 91% of ceiling with memory, and match random score without.
- human3090 maps the efficient frontier of local coding LLMs at the trailing edge of hardware on my 3090.
- search-claude-history lets Claude efficiently search session history through a simple CLI tool.
- redvsblue.fyi reproducible economic analysis of US Presidential control on 83 economic metrics.
ML/RL Research
Novel architectures, reinforcement learning, empirical results
- reservoir computing for LLMs adds ESN sidecar to Qwen3.5-0.8B. 3x average exact-match across 23 benchmarks (0.12→0.36) at negligible perplexity cost (+0.3%). Perfect on passkey retrieval, near-perfect on multi-digit arithmetic, strong gains on variable tracking and program tracing. Kicked off with /collab.
- mnk-game AlphaZero-style in Rust shows modest gains from transfer learning across board sizes. Hyperparameter sweeps and 75x speedup over reference Python, but improvements don't stack.
- energy spillage attempt to repurpose ICLR 2026 hallucination metric for test-time tree search. /collab for research design. Found energy spillage is confidence proxy, and underperforms min(log_prob). Killed via predefined go/no-go gates.
Things I Built
- gitrep.fyi star-based reputation scores for GitHub contributors. Sort issues and PRs by who filed them, not just when. Plus star histories and agent-friendly unlimited API.
- clanker-analytics token usage analytics for AI coding tools. Reads local session logs from Claude Code, Codex, and Gemini CLI with per-project breakdowns using DuckDB. uv tool install clanker-analytics.
- wakterm fork of WezTerm that makes Claude/Codex/Gemini/OpenCode first-class terminal citizens. Agent lifecycle CLI, turn-state tracking, session persistence, and 26 upstream bug fixes. The terminal layer underneath panetone.