airmy.dev/company/research
/ AIRMY Research
Pushing the frontier
of production AI.
AIRMY Research publishes work on inference optimisation, agent safety, multi-agent systems, and the infrastructure challenges of deploying AI at scale.
18
Published Papers
4
Research Areas
3
Open-Source Repos
12
Full-Time Researchers
/ Research Areas
Where we focus.
Our research is grounded in the hard problems we encounter running AI infrastructure at production scale.
Inference Optimisation
Speculative decoding, KV-cache management, and batching strategies for sub-50ms P99 latency at scale.
Agent Safety & Alignment
Formalising safety constraints for autonomous agents, detection of goal misgeneralisation, and sandboxed execution environments.
Multi-Agent Orchestration
Coordination protocols for heterogeneous agent networks, task decomposition, and emergent behaviour in large agent graphs.
Context & Memory
Efficient long-context architectures, persistent memory systems, and retrieval-augmented generation for production agents.
Human-Agent Collaboration
Studying optimal handoff patterns, oversight mechanisms, and trust calibration between human operators and autonomous systems.
Evaluation & Benchmarking
Reproducible evaluation frameworks, domain-specific benchmarks, and capability elicitation methodologies.
/ Publications
Recent papers.
Sub-50ms inference through speculative decoding in production agent networks
Marcus Thorn, Yuki Tanaka, Priya Nair
We present a production-validated approach combining speculative decoding with dynamic KV-cache eviction that achieves median inference latency of 48ms across heterogeneous agent workloads at AIRMY scale (10B+ monthly calls). Our method introduces an adaptive speculation budget that adjusts based on real-time token acceptance rates, reducing wasted compute by 34% compared to fixed-budget baselines.
Immutable audit logs as a compliance primitive in agentic systems
James Osei, Fatima Al-Rashid, Sophie Marchand
We argue that append-only, cryptographically-verifiable audit logs should be treated as a first-class primitive in the design of agentic AI systems, not a post-hoc compliance feature. We demonstrate that agents instrumented with immutable logging exhibit better-aligned behaviour under adversarial prompts and provide a reference architecture implemented in production at AIRMY.
Emergent coordination in distributed multi-agent task graphs
Yuki Tanaka, Remi Okonkwo, Priya Nair
We study coordination behaviour in large networks of specialised agents assigned to decomposed subtasks of complex enterprise workflows. We identify three distinct coordination regimes (sequential, parallel, and negotiated) that emerge without explicit inter-agent communication protocols, and characterise the workload properties that favour each regime.
/ Open Source
Open source tools.
We open-source the evaluation and benchmarking infrastructure we use internally. Built for reproducibility.
airmy/airmy-sdk
The official AIRMY Python and TypeScript SDK. Includes typed clients for all API endpoints, retry logic, and streaming support.
airmy/airmy-evals
Open evaluation framework for benchmarking agent accuracy, latency, and safety across standardised task suites.
airmy/airmy-bench
Reproducible benchmarks for agent inference infrastructure. Includes datasets, scoring harnesses, and leaderboard tooling.
/ Research Team
Who we are.
We're a team of 12 researchers with backgrounds spanning systems ML, alignment research, and distributed systems.
Alumni of: Anthropic, DeepMind, MIT CSAIL, Stanford AI Lab, CMU.
Join the Research Team/ Stay Updated
Research, shipped to your inbox.
Get new papers, benchmarks, and open-source releases as they happen.