Webarena

The shift from hand-crafted benchmarks to auto-generated simulation environments is collapsing the cost of agent evaluation — and exposing how far even the strongest models still lag behind humans.