Tag
1 dispatch
A new benchmark runs AI agents on 153 real websites. The best model scores 33%. GPT-5.4 scores 6.5%. The gap from sandboxes is brutal.