Researchers from Zhejiang University and OPPO AI Center have published the most comprehensive survey to date of “OS Agents”—AI systems that can autonomously control computers, mobile phones, and web browsers by directly interacting with their interfaces. The 30-page academic review, accepted for publication at the Association for Computational Linguistics conference, comes as major tech companies including OpenAI, Anthropic, Apple, and Google race to deploy AI agents capable of performing complex digital tasks, while highlighting significant security vulnerabilities that most organizations aren’t prepared to address.
The big picture: This technology represents a fundamental shift toward AI systems that can genuinely understand and manipulate the digital world like humans do, moving beyond simple chatbots to agents that can execute multi-step workflows across different applications.
- Over 60 foundation models and 50 agent frameworks have been developed specifically for computer control, with publication rates accelerating dramatically since 2023.
- Current systems work by taking screenshots of computer screens, using computer vision to understand what’s displayed, then executing precise actions like clicking buttons, filling forms, and navigating between applications.
- The most sophisticated systems can handle complex workflows that span different applications—booking a restaurant reservation, adding it to your calendar, then setting a traffic reminder.
Major players racing to market: Tech giants have moved with unprecedented speed to transform academic research into consumer-ready products.
- OpenAI recently launched “Operator,” while Anthropic released “Computer Use” capabilities.
- Apple introduced enhanced AI capabilities in “Apple Intelligence,” and Google unveiled “Project Mariner.”
- All these systems are designed to automate computer interactions by observing screens and executing actions across mobile, desktop, and web platforms.
Security nightmare scenario: The researchers document alarming attack vectors that could compromise enterprise systems in ways traditional security models aren’t designed to handle.
- “Web Indirect Prompt Injection” allows malicious actors to embed hidden instructions in web pages that can hijack an AI agent’s behavior.
- “Environmental injection attacks” use seemingly innocuous web content to trick agents into stealing user data or performing unauthorized actions.
- An AI agent with access to corporate email, financial systems, and customer databases could be manipulated by a carefully crafted web page to exfiltrate sensitive information.
- “Studies on defenses specific to OS Agents remain limited,” creating an immediate challenge for organizations considering deployment.
Performance reality check: Despite the hype, current systems show significant limitations that temper expectations for immediate widespread adoption.
- Success rates vary dramatically across different tasks and platforms, with some commercial systems achieving above 50% success on certain benchmarks while struggling with others.
- Systems excel at simple, well-defined tasks but falter with complex, context-dependent workflows that define much of modern knowledge work.
- They can reliably click buttons or fill standard forms but struggle with tasks requiring sustained reasoning or adaptation to unexpected interface changes.
The personalization challenge: Future OS agents will need to learn from user interactions and adapt to individual preferences over time, presenting both enormous opportunities and privacy risks.
- “A personal assistant is expected to continuously adapt and provide enhanced experiences based on individual user preferences,” the researchers write.
- This capability could create AI agents that learn your email writing style, understand your calendar preferences, and make increasingly sophisticated decisions on your behalf.
- The technical challenges include developing multimodal memory systems that can handle text, images, and voice while avoiding comprehensive surveillance of users’ digital lives.
What they’re saying: The researchers emphasize both the transformative potential and the urgent need for better security frameworks.
- “The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations,” they write. “With the evolution of (multimodal) large language models ((M)LLMs), this dream is closer to reality.”
- “OS Agents can complete tasks autonomously and have the potential to significantly enhance the lives of billions of users worldwide. Imagine a world where tasks such as online shopping, travel arrangements booking, and other daily activities could be seamlessly performed by these agents.”
- The researchers acknowledge that “OS Agents are still in their early stages of development” with “rapid advancements that continue to introduce novel methodologies and applications.”
Study warns of security risks as ‘OS agents’ gain control of computers and phones