AppWorld Icon

AppWorld

A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

🏆 ACL'24 Best Resource Paper 🏆

Harsh Trivedi, Tushar Khot, Mareike Hartmann,
Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta,
Ashish Sabharwal, Niranjan Balasubramanian

Contact: hjtrivedi@cs.stonybrook.edu

Task Explorer API Explorer Leaderboard Code Videos Tweet Blog Poster Paper
🚨 Harsh is giving a talk on AppWorld at many places. If you are around, consider joining. If your group would also like a talk, please reach out!
🎉   We have added the ACL videos and poster to the website.

About

📌 TLDR

We introduce the AppWorld Engine, a high-fidelity execution environment of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of 106 people living in a simulated world, and an associated benchmark of natural, diverse, and challenging autonomous agent tasks requiring rich and interactive coding. The tasks have robust programmatic evaluation with state and execution-based unit tests.

✨ Abstract

Engine Overview Img  Benchmark Overview Img

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls.

To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our "normal" tasks and ~30% of "challenge" tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents.

⚡ Lightning (3-minute) Talk

Here is a short talk we gave in the ACL'24 award session (available on youtube). Longer talks are also available.

Citation

@inproceedings{appworld-acl24,
  title={App{W}orld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents},
  author={Harsh Trivedi and Tushar Khot and Mareike Hartmann and Ruskin Manku and Vinty Dong and Edward Li and Shashank Gupta and Ashish Sabharwal and Niranjan Balasubramanian},
  booktitle={ACL},
  year={2024}
}