CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Sugiura, Issa; Hattori, Daichi; Araragi, Kazuo; Ogawa, Keita; Onose, Shota; Makino, Taro; Usuki, Teppei; Ishida, Takashi

CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

Sakana AI, in collaboration with KPMG AZSA LLC, introduces CoffeeBench, a benchmark for evaluating how much net income an LLM agent can generate as a coffee roaster over 90 days in a multi-agent economy with two farmers, two roasters, and two retailers.

Authors

Issa Sugiura

Sakana AI

Daichi Hattori

KPMG AZSA LLC

Kazuo Araragi

KPMG AZSA LLC

Keita Ogawa

KPMG AZSA LLC

Shota Onose

KPMG AZSA LLC

Taro Makino

Sakana AI

Teppei Usuki

KPMG AZSA LLC

Takashi Ishida

Sakana AI

Introduction

As large language models (LLMs) continue to advance, their applications have expanded to long-horizon tasks requiring sequential decision-making, long-term planning, and interaction with environments. Coding agents and web-use agents are prominent examples that iteratively interact with external environments to solve complex tasks.

Business management is a promising domain for LLM agents, requiring long-horizon decision-making that may span months or even years. To push in this direction, prior work such as Vending-Bench evaluates whether an LLM agent can operate a vending-machine business over an extended period and generate profit. Subsequent work has moved toward increasingly realistic settings: Vending-Bench Arena introduces competition between multiple vending-machine agents, while Project Vend deploys a physical vending machine in an office environment to conduct real-world experiments.

However, these settings are limited to a single type of firm, whereas real-world economies involve multiple firms of different roles operating simultaneously, each acting autonomously to optimize its own objectives, forming complex structures such as supply chains. Evaluating LLM agents in such multi-agent environments is therefore a natural and important next step.

To this end, we introduce CoffeeBench, a benchmark for evaluating how much net income an LLM agent can generate as a coffee roaster over 90 days in a multi-agent economy with two farmers, two roasters, and two retailers. Each agent operates autonomously to maximize cumulative net income through communication and transactions with other agents, while managing cash, inventory, and pricing. Each run requires hundreds to thousands of tool calls, demanding long-horizon planning and decision-making.

CoffeeBench

CoffeeBench is designed as a multi-agent extension of Vending-Bench, simulating a B2B coffee supply chain with two farmers, two roasters, and two retailers, for six firms in total. Each firm is operated by an LLM agent, and the simulation runs for 90 simulated days with the objective of maximizing net income.

Diagram of the daily timeline in CoffeeBench: agents act during business hours, wait for the next day, and consumer sales are simulated at retailers each morning. — **Figure 2.** Timeline of the simulation. Agents take actions during business hours. Between the end of the day and the next morning, the environment simulates updates such as consumer sales at retailers, and agents start the next day with the results reflected.

Every agent operates as a ReAct agent and interacts with the environment through tool calls. The shared toolset includes sending messages to counterparties, placing orders, and paying invoices. In addition, each role is provided with role-specific tools as follows.

Farmers: produce_item() - produce coffee beans.
Roasters: roast() - roast green beans into roasted product.
Retailers: set_retail_price() - set retail prices.

When an agent has nothing left to do, it calls wait_for_next_day() to wait until the next day. Once every agent is in this waiting state, the simulation transitions to the next day. Between the end of the day and the next morning, the environment simulates consumer sales at the retailers, along with other updates such as operating costs and spoilage, and each agent starts the next morning with these results reflected.

To bring the simulated environment closer to real-world business, we also include features such as invoice-based deferred payment that is common in B2B trade.

In this environment, agents must maximize profit while taking into account factors such as cash balance, inventory levels, relationships with counterparties, and future demand. Because fixed costs are incurred every day, agents will eventually become unprofitable if they remain inactive. As a result, they must continuously make business decisions that balance cash flow management with profit generation.

Experiments

We evaluated a range of LLMs on CoffeeBench, including GPT-5.5, Claude Opus 4.7, Claude Haiku 4.5, Gemini 3.1 Pro, and Kimi K2.6. In each run, the evaluated model operates Roaster A, and we compare its net income at the end of the 90-day simulation. To keep conditions comparable, the five firms other than Roaster A are all operated by Claude Sonnet 4.6 as a fixed baseline. To account for run-to-run variance, we ran each model three times and averaged the results. We additionally include a passive baseline agent that only calls wait_for_next_day().

Results

Business performance varies across models

Animated time series of cumulative net income for Roaster A across evaluated models over the 90-day horizon. — **Figure 3.** Cumulative net income of Roaster A under evaluation, across models, over the 90-day horizon.

Open the viewer

The experimental results showed that all models outperformed the passive baseline, which continuously called wait_for_next_day() without taking any other actions, demonstrating that LLM agents can actively make business decisions in this environment. At the same time, we observed substantial performance differences across models. While most models steadily increased their profits over time, Claude Haiku 4.5 ultimately operated at a loss.

Top-performing models act proactively

Time series of business KPIs (cash, inventory, AR/AP, sales, etc.) for Roaster A across evaluated models. — **Figure 4.** Time-series of business KPIs for Roaster A under evaluation.

Distribution of tool calls issued by Roaster A under evaluation across the 90-day horizon, broken down by tool type. — **Figure 5.** Distribution of tool calls issued by Roaster A under evaluation, broken down by tool type.

Looking at the behavioral patterns, the top-performing models share a common signature.

Distribution of messages sent by Roaster A under evaluation, broken down by recipient role. — **Figure 6.** Distribution of messages sent by Roaster A under evaluation, broken down by recipient.

High-performing models actively communicated with both farmers and retailers, frequently engaging in profit-oriented actions such as price negotiations and promotional activities. In contrast, lower-performing models tended to behave more passively and communicated less frequently with other agents. These results suggest that proactive behavior and inter-agent communication contribute to profitability.

However, simply taking more actions was not sufficient for success. For example, although Kimi K2.6 made a comparable number of tool calls to the top-performing models, it failed to achieve strong profits. This suggests that what matters is not the sheer number of tool invocations, but whether they are directed toward profit-relevant actions such as executing transactions (make_offer() / accept_offer()) and negotiating prices.

Interestingly, Gemini 3.1 Pro exhibited a different behavioral pattern. While it sent fewer messages than GPT-5.5 or Claude Opus 4.7, it frequently read incoming messages, indicating a more reactive management style that responded to counterparties’ actions rather than proactively initiating interactions. These results demonstrate that CoffeeBench enables analysis not only of final business performance, but also of the behavioral characteristics underlying agents’ decision-making processes.

The idle-drift failure mode in Claude Haiku 4.5

Action log of Claude Haiku 4.5 showing the idle-drift failure mode: from a certain day onward the model issues only wait_for_next_day(). — **Figure 7.** Action log of Claude Haiku 4.5. From a few dozen days into the run, the model stops taking real economic actions and only issues `wait_for_next_day()`.

One interesting phenomenon was observed with Claude Haiku 4.5, which achieved the worst overall performance. Midway through the simulation, the agent stopped participating in economic activity, repeatedly incurring fixed costs and eventually operating at a loss. Inspection of the reasoning traces revealed that Claude Haiku 4.5 continued to analyze the situation and formulate plans, such as recognizing opportunities to procure beans cheaply from farmers or anticipating increased demand from retailers. However, despite these internal deliberations, the agent repeatedly chose to call wait_for_next_day() rather than executing any concrete actions.

This phenomenon was consistently observed across all three runs of Claude Haiku 4.5, while no similar behavior was observed in the other models. Potential explanations of this phenomenon include behavioral shifts induced by long-context accumulation and overly conservative action selection, possibly driven by implicit concerns about token budget consumption. Understanding the mechanisms underlying this phenomenon is an important direction for future work.

Exploratory stress test with revenue-maximizing incentives

So far, we have focused on how agents behave when optimizing for long-term profit. In real-world business settings, however, strong performance pressure such as the need to meet aggressive revenue targets can distort decision-making and sometimes lead to accounting misconduct, including practices such as circular transactions or channel stuffing that artificially inflate revenue. We believe CoffeeBench could also serve as an experimental framework for studying the mechanisms through which such undesirable behaviors emerge.

As a first step in this direction, we conducted a preliminary experiment in which the agents’ KPI was changed from net profit to revenue, and agents were assigned revenue targets that would normally be difficult to achieve. We additionally prompted the agents with strong instructions to achieve the revenue target at all costs.

Despite this pressure, we did not observe any collusive behaviors resembling circular transactions in which agents coordinated to artificially inflate revenue. One possible explanation is that the agents simply did not recognize that such strategies were available to them. However, as models continue to improve in long-horizon planning and multi-agent coordination, future agents may eventually discover and exploit such undesirable strategies on their own. Understanding what kinds of behaviors emerge in such settings, and how to mitigate them, is one of the long-term research directions enabled by CoffeeBench.

Conclusion

We introduced CoffeeBench, a new benchmark for evaluating the long-horizon business capabilities of LLM agents in multi-agent economic environments. Our experiments revealed substantial differences across models in both business performance and behavioral patterns. In particular, high-performing models actively engaged in negotiation and communication, while some models exhibited long-horizon failure modes in which they continued reasoning about the situation but failed to take concrete actions. These findings highlight the importance of evaluating LLM agents in realistic multi-agent environments that involve long-term decision-making and inter-agent communication. We hope CoffeeBench will serve as a step toward understanding and designing future societies in which AI agents participate in economic activity.

Citation

For attribution in academic contexts, please cite this work as:

Sugiura, Hattori, Araragi, Ogawa, Onose, Makino, Usuki, & Ishida.
"CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies", 2026.

BibTeX:

@misc{sugiura2026coffeebenchbenchmarkinglonghorizonllm,
  title  = {CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies},
  author = {Issa Sugiura and Daichi Hattori and Kazuo Araragi and Keita Ogawa and Shota Onose and Taro Makino and Teppei Usuki and Takashi Ishida},
  year={2026},
  eprint={2606.16613},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.16613},
}