AppWorld Icon

AppWorld-UL

Benchmarking Diverse Agent-User Interactions for Tool-Use

ICML'26

Junzhi Chen*, Harsh Trivedi*, Jane Pan,
Michael Zhang, Tejas Srinivasan, Niranjan Balasubramanian, Ashish Sabharwal

Contact: szjiozi1130@gmail.com, harshjtrivedi94@gmail.com

Paper Code Video Poster Leaderboard
Visit AppWorld for the original AppWorld project.

About

📌 TLDR

We introduce AppWorld-UL, a challenging benchmark of 516 user-in-the-loop tool-use tasks that require agents to ask clarifying questions, seek confirmation, or handle infeasible instructions in realistic app environments.

✨ Abstract

Tool-use agents that address day-to-day digital tasks such as ordering groceries must not only operate applications, but also interact with the user, e.g., to ask clarification questions, prompt for confirmation, and inform the user when the instruction is infeasible. However, current benchmarks for evaluating agent-user interactions do not capture the diversity of such interactions. Further, they operate in small environments with few, often non-state-changing, APIs. To address this gap, we introduce AppWorld-UL, a "user-in-the-loop" benchmark of 516 challenging tasks requiring diverse agent-user interactions. Building upon the AppWorld framework with 9 popular simulated apps like Amazon and Spotify, we systematically modify original tasks to introduce ambiguities and constraints that necessitate various types of agent-user interaction. User behavior is simulated by an LLM prompted to respond with carefully designed knowledge boundaries, offering more reliable simulation than the unconstrained or overly rigid alternatives used in prior work. Our evaluation reveals that a state-of-the-art LLM, Claude Opus 4.7, achieves only 48.6% success on AppWorld-UL, and only 35.7% on the harder, compositional subset. On the stricter, scenario-level metric, compositional task performance drops to only 21.3%. Our analysis reveals that correct user-interaction is crucial for success. This demonstrates the benchmark's difficulty and its potential to advance research on user-in-the-loop tool-use agents.

Citation

@inproceedings{appworld-ul-icml26,
  title={App{W}orld-{UL}: Benchmarking Diverse Agent-User Interactions for Tool-Use},
  author={Junzhi Chen and Harsh Trivedi and Jane Pan and Michael JQ Zhang and Tejas Srinivasan and Niranjan Balasubramanian and Ashish Sabharwal},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
  url={https://openreview.net/forum?id=cUXV9vtDXd}
}