A new academic survey has examined the emergence of AI systems that can directly operate computers, smartphones, and web browsers, warning that the same capabilities driving productivity could also expose users and businesses to new security risks.
The 36-page review, produced by Zhejiang University in collaboration with the OPPO AI Center and other institutions, outlines the design, training, and evaluation of so-called “OS agents”, large-language-model-driven assistants capable of controlling devices by interacting with their graphical interfaces. Unlike traditional voice assistants, these agents can observe the screen, interpret interface elements, plan a sequence of actions, and execute them without human input.

Researchers describe a surge of activity since 2023, with more than 60 foundation models and over 50 agent frameworks now targeting computer control. Major technology firms have begun moving these concepts into commercial products, such as OpenAI’s Operator, Anthropic’s Computer Use, Apple’s enhancements to Apple Intelligence, and Google’s Project Mariner.
How OS Agents Work
The survey details how these agents are built, often combining pre-trained vision-language models with custom components that handle high-resolution interface images and HTML structures. Training pipelines use public datasets, synthetic interaction records, and simulated environments to improve grounding, the mapping between instructions and on-screen actions, as well as planning skills.
Developers adopt a range of strategies to boost performance, including supervised fine-tuning with curated task sequences and reinforcement learning to improve reliability and error recovery. Frameworks usually include modules for perception, planning, memory, and action execution, with some designs incorporating personalization so the agent can adapt to a user’s habits over time.
Performance and Limitations
Security and Privacy Risks
Because OS agents operate with the access level of their host user, a compromised agent could move through corporate email, databases, and financial records without triggering the same warning signs that might alert a human. Existing AI security guidelines offer only partial coverage, and defenses tailored specifically to OS agents are still limited.
Personalization Challenges
Looking Ahead
The team maintains an open-source repository to track new models, frameworks, and benchmarks, reflecting a field that is expanding at a pace unusual even for the technology sector. For now, the technology is moving toward the point where it can interact with digital environments much as a human user would, and that, the authors suggest, means the window for building adequate safeguards is already narrowing.
Notes: This post was edited/created using GenAI tools.
Read next: Debate Erupts Over Wikimedia’s Role in Shaping Neutrality Research on Wikipedia