OpenAI releases GPT-5.4 — surpasses human baseline on OS computer use benchmark, 1M context
OpenAI released GPT-5.4 on March 4. The model scores 75.0% on OSWorld-Verified, a benchmark measuring how effectively an AI navigates a real desktop OS — the human baseline on the same benchmark is 72.4%. This is the first time an AI model has exceeded average human performance at operating a computer. GPT-5.4 supports native OS-level computer use, allowing it to interact with any desktop application through a visual interface without custom integrations. The API version supports a 1 million token context window. The release also includes write actions for Google Workspace and Microsoft 365 apps — scheduling meetings, drafting documents, and creating spreadsheets directly from chat.
Surpassing the human baseline on a desktop computer use benchmark is a concrete capability threshold, not a benchmark constructed to look favorable. OS-level computer use without custom integrations means GPT-5.4 can operate legacy enterprise software, internal tools, and any GUI application — the integration layer that has blocked AI agents from real enterprise workflows. The 1M context and Office write actions are individually significant; together with computer use, they position GPT-5.4 as the broadest-capability model available. OpenAI crossed $25B in annualized revenue the same week and is pursuing an IPO as early as late 2026.
Every story from each day, delivered at midnight UTC.