We envision a world where AI agents (assistants) are widely used for complex tasks in our digital and physical worlds and are broadly integrated into our society. To move towards such a future, we need an environment for a robust evaluation of agents' capability, reliability, and trustworthiness.
In this talk, I'll introduce AppWorld, which is a step towards this goal in the context of day-to-day digital tasks. AppWorld is a high-fidelity simulated world of people and their digital activities on nine apps like Amazon, Gmail, and Venmo. On top of this fully controllable world, we build a benchmark of complex day-to-day tasks such as splitting Venmo bills with roommates, which agents have to solve via interactive coding and API calls. One of the fundamental challenges with complex tasks lies in accounting for different ways in which the tasks can be completed. I will describe how we address this challenge using a reliable and programmatic evaluation framework. Our benchmarking evaluations show that even the best LLMs, like GPT-4o, can only solve ~30% of such tasks, highlighting the challenging nature of the AppWorld benchmark.
I will conclude by laying out exciting future research that can be conducted on the foundation of AppWorld, such as benchmarks and playground for developing multimodal, collaborative, safe, socially intelligent, resourceful, and fail-tolerant agents.
Harsh Trivedi (bio, picture)
# | Date | Time | Institute | Location |
---|---|---|---|---|
1 | 27 September | 2:30 pm EST | Reading Group at University of Waterloo | remote |
2 | 30 September | 12:00 pm EST | CLunch Seminar at University of Pennsylvania | remote |
3 | 4 October | 1:00 pm PST | University of Southern California | in-person |
4 | 7 October | 1:00 pm PST | University of California Irvine | in-person |
5 | 10 October | 11:00 am PST | Stanford University | in-person |
6 | 11 October | 10:30 am PST | University of California Berkeley | in-person |
7 | 15 October | 4:00 pm PST | University of California Santa Barbara | in-person |
8 | 17 October | 2:00 pm PST | AI Seminar at University of California San Diego | in-person |
9 | 18 October | 1:00 pm PST | Salesforce | remote |
10 | 23 October | 3:30 pm EST | Johns Hopkins University | remote |
11 | 24 October | 3:00 PM EST | NLP Seminar at Columbia University | in-person |
12 | 25 October | 12:30 EST | PLI Seminar at Princeton | in-person |
13 | 29 October | 1:30 PST | University of California Santa Cruz | remote |
14 | 1 November | 11:15 PST | Allen Institute for AI | in-person |
15 | 6 November | 12:00 pm CST | Apple | remote |
16 | 7 November | 9:00 am PST | remote | |
17 | 7 November | 3:00 pm EST | University of North Carolina Chapel Hill | remote |
18 | 8 November | 1:00 pm EST | New York University | in-person |
19 | 15 November | 1:00 pm EST | Semantic Machines | remote |
20 | 2 December | 9:00 pm GMT+5 | Cohere for AI: community talks | remote |
21 | 10 December | 7:00 pm BST | Camel AI | remote |
22 | 11 December | 10:00 am CET | Lamarr NLP colloquium at University of Bonn | remote |