airmy.dev/company/research

/ AIRMY Research

Pushing the frontier
of production AI.

AIRMY Research publishes work on inference optimisation, agent safety, multi-agent systems, and the infrastructure challenges of deploying AI at scale.

View Publications Research Careers

Published Papers

Research Areas

Open-Source Repos

Full-Time Researchers

/ Research Areas

Where we focus.

Our research is grounded in the hard problems we encounter running AI infrastructure at production scale.

Inference Optimisation

Speculative decoding, KV-cache management, and batching strategies for sub-50ms P99 latency at scale.

Agent Safety & Alignment

Formalising safety constraints for autonomous agents, detection of goal misgeneralisation, and sandboxed execution environments.

Multi-Agent Orchestration

Coordination protocols for heterogeneous agent networks, task decomposition, and emergent behaviour in large agent graphs.

Context & Memory

Efficient long-context architectures, persistent memory systems, and retrieval-augmented generation for production agents.

Human-Agent Collaboration

Studying optimal handoff patterns, oversight mechanisms, and trust calibration between human operators and autonomous systems.

Evaluation & Benchmarking

Reproducible evaluation frameworks, domain-specific benchmarks, and capability elicitation methodologies.

/ Publications

Recent papers.

NeurIPS 2025 Workshop on Efficient AI Systems· December 2025

Sub-50ms inference through speculative decoding in production agent networks

Marcus Thorn, Yuki Tanaka, Priya Nair

We present a production-validated approach combining speculative decoding with dynamic KV-cache eviction that achieves median inference latency of 48ms across heterogeneous agent workloads at AIRMY scale (10B+ monthly calls). Our method introduces an adaptive speculation budget that adjusts based on real-time token acceptance rates, reducing wasted compute by 34% compared to fixed-budget baselines.

Inference OptimisationProduction Systems

Read paper

ICLR 2026 Workshop on Trustworthy AI· January 2026

Immutable audit logs as a compliance primitive in agentic systems

James Osei, Fatima Al-Rashid, Sophie Marchand

We argue that append-only, cryptographically-verifiable audit logs should be treated as a first-class primitive in the design of agentic AI systems, not a post-hoc compliance feature. We demonstrate that agents instrumented with immutable logging exhibit better-aligned behaviour under adversarial prompts and provide a reference architecture implemented in production at AIRMY.

Agent SafetyCompliance

Read paper

ICML 2026 (to appear)· March 2026

Emergent coordination in distributed multi-agent task graphs

Yuki Tanaka, Remi Okonkwo, Priya Nair

We study coordination behaviour in large networks of specialised agents assigned to decomposed subtasks of complex enterprise workflows. We identify three distinct coordination regimes (sequential, parallel, and negotiated) that emerge without explicit inter-agent communication protocols, and characterise the workload properties that favour each regime.

Multi-Agent OrchestrationEmergent Behaviour

Read paper

/ Open Source

Open source tools.

We open-source the evaluation and benchmarking infrastructure we use internally. Built for reproducibility.

View

airmy/airmy-sdk

The official AIRMY Python and TypeScript SDK. Includes typed clients for all API endpoints, retry logic, and streaming support.

Python / TypeScript

·★ 4.2k·2 days ago

View

airmy/airmy-evals

Open evaluation framework for benchmarking agent accuracy, latency, and safety across standardised task suites.

Python

·★ 1.8k·1 week ago

View

airmy/airmy-bench

Reproducible benchmarks for agent inference infrastructure. Includes datasets, scoring harnesses, and leaderboard tooling.

Python

·★ 890·3 weeks ago

/ Research Team

Who we are.

We're a team of 12 researchers with backgrounds spanning systems ML, alignment research, and distributed systems.

Alumni of: Anthropic, DeepMind, MIT CSAIL, Stanford AI Lab, CMU.

Join the Research Team

/ Stay Updated

Research, shipped to your inbox.

Get new papers, benchmarks, and open-source releases as they happen.