AppWorld Icon

Talks

🚨 If you want zoom/remote link to any of the talks, email me (Harsh). I will try to see if it's possible.

AppWorld: Reliable Evaluation of Interactive Agents in a World of Apps and People

✨ Abstract

We envision a world where AI agents (assistants) are widely used for complex tasks in our digital and physical worlds and are broadly integrated into our society. To move towards such a future, we need an environment for a robust evaluation of agents' capability, reliability, and trustworthiness.

In this talk, I'll introduce AppWorld, which is a step towards this goal in the context of day-to-day digital tasks. AppWorld is a high-fidelity simulated world of people and their digital activities on nine apps like Amazon, Gmail, and Venmo. On top of this fully controllable world, we build a benchmark of complex day-to-day tasks such as splitting Venmo bills with roommates, which agents have to solve via interactive coding and API calls. One of the fundamental challenges with complex tasks lies in accounting for different ways in which the tasks can be completed. I will describe how we address this challenge using a reliable and programmatic evaluation framework. Our benchmarking evaluations show that even the best LLMs, like GPT-4o, can only solve ~30% of such tasks, highlighting the challenging nature of the AppWorld benchmark.

I will conclude by laying out exciting future research that can be conducted on the foundation of AppWorld, such as benchmarks and playground for developing multimodal, collaborative, safe, socially intelligent, resourceful, and fail-tolerant agents.

🗣️ Speaker

Harsh Trivedi (bio, picture)

📅 Schedule

# Date Time Institute Location
1 27 September 2:30 pm EST Reading Group at University of Waterloo remote
2 30 September 12:00 pm EST CLunch Seminar at University of Pennsylvania remote
3 4 October 1:00 pm PST University of Southern California in-person
4 7 October 1:00 pm PST University of California Irvine in-person
5 10 October 11:00 am PST Stanford University in-person
6 11 October 10:30 am PST University of California Berkeley in-person
7 15 October 4:00 pm PST University of California Santa Barbara in-person
8 17 October 2:00 pm PST AI Seminar at UCSD in-person
9 18 October 1:00 pm PST Salesforce remote
10 23 October 3:30 pm EST John's Hopkins remote
11 24 October 3:00 PM EST NLP Seminar at Columbia University in-person
12 25 October 12:30 EST PLI Seminar at Princeton in-person
13 29 October 1:30 PST University of California Santa Cruz in-person
14 1 November 11:15 PST Allen Institute for AI in-person
15 6 November 12:00 pm CST Apple remote
16 7 November 9:00 am PST Google remote
17 7 November 3:00 pm EST University of North Carolina Chapel Hill remote
18 8 November 1:00 pm EST New York University in-person
19 15 November 1:00 pm EST Semantic Machines remote
20 10 December 7:00 pm BST Camel AI remote