AppWorld Talks

AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People

✨ Abstract

We envision a world where AI agents (assistants) are widely used for complex tasks in our digital and physical worlds and are broadly integrated into our society. To move towards such a future, we need an environment for a robust evaluation of agents' capability, reliability, and trustworthiness.

In this talk, I'll introduce AppWorld, which is a step towards this goal in the context of day-to-day digital tasks. AppWorld is a high-fidelity simulated world of people and their digital activities on nine apps like Amazon, Gmail, and Venmo. On top of this fully controllable world, we build a benchmark of complex day-to-day tasks such as splitting Venmo bills with roommates, which agents have to solve via interactive coding and API calls. One of the fundamental challenges with complex tasks lies in accounting for different ways in which the tasks can be completed. I will describe how we address this challenge using a reliable and programmatic evaluation framework. Our benchmarking evaluations show that even the best LLMs, like GPT-4o, can only solve ~30% of such tasks, highlighting the challenging nature of the AppWorld benchmark.

I will conclude by laying out exciting future research that can be conducted on the foundation of AppWorld, such as benchmarks and playground for developing multimodal, collaborative, safe, socially intelligent, resourceful, and fail-tolerant agents.

🗣️ Speaker

Harsh Trivedi (bio, picture)

📅 Schedule

#	Date	Time	Institute	Location
1	27 September	2:30 pm EST	Reading Group at University of Waterloo	remote
2	30 September	12:00 pm EST	CLunch Seminar at University of Pennsylvania	remote
3	4 October	1:00 pm PST	University of Southern California	in-person
4	7 October	1:00 pm PST	University of California Irvine	in-person
5	10 October	11:00 am PST	Stanford University	in-person
6	11 October	10:30 am PST	University of California Berkeley	in-person
7	15 October	4:00 pm PST	University of California Santa Barbara	in-person
8	17 October	2:00 pm PST	AI Seminar at University of California San Diego	in-person
9	18 October	1:00 pm PST	Salesforce	remote
10	23 October	3:30 pm EST	Johns Hopkins University	remote
11	24 October	3:00 PM EST	NLP Seminar at Columbia University	in-person
12	25 October	12:30 EST	PLI Seminar at Princeton	in-person
13	29 October	1:30 PST	University of California Santa Cruz	remote
14	1 November	11:15 PST	Allen Institute for AI	in-person
15	6 November	12:00 pm CST	Apple	remote
16	7 November	9:00 am PST	Google	remote
17	7 November	3:00 pm EST	University of North Carolina Chapel Hill	remote
18	8 November	1:00 pm EST	New York University	in-person
19	15 November	1:00 pm EST	Semantic Machines	remote
20	2 December	9:00 pm GMT+5	Cohere for AI: community talks	remote
21	10 December	7:00 pm BST	Camel AI	remote
22	11 December	10:00 am CET	Lamarr NLP colloquium at University of Bonn	remote
23	7 May	12:00 pm PST	Stanford University (Ludwig Schmidt's group)	remote

Talks

👤 Bio

AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People

✨ Abstract

🗣️ Speaker

📅 Schedule