In a significant advancement in the field of artificial intelligence, Google has officially launched its Gemini 2.5 Pro Computer Use model, a sophisticated AI agent capable of autonomously interacting with websites and applications much like a human user. This development marks Google’s entry into the competitive landscape of AI agents, where companies like OpenAI and Anthropic have already made substantial strides with their respective models. The Gemini 2.5 Pro Computer Use model is designed to perform a variety of tasks, including clicking buttons, scrolling through pages, filling out forms, and even navigating complex dropdown menus—all initiated from a single text prompt.
The introduction of this model is a pivotal moment in the evolution of large language models (LLMs), which have traditionally been limited to processing and generating text based on structured inputs or APIs. Unlike these conventional systems, Gemini 2.5 Pro Computer Use leverages a virtual browser to visually interpret user interfaces, enabling it to take actions that mimic human behavior. This capability is particularly noteworthy as it represents a shift towards creating more general-purpose AI agents that can operate across various digital environments without requiring explicit programming for each task.
Google’s DeepMind AI lab has fine-tuned the Gemini 2.5 Pro model specifically for user interface interactions. According to Sundar Pichai, CEO of Google, the ability of this model to engage with the web—such as scrolling, filling out forms, and navigating dropdowns—is an essential step toward building versatile AI agents that can handle a wide range of tasks autonomously. While the model is not available directly to consumers, Google has partnered with Browserbase, a company founded by former Twilio engineer Paul Klein, to provide a platform where users can demo the capabilities of Gemini 2.5 Pro Computer Use.
Browserbase offers a “headless” web browser designed for AI agents, allowing Gemini to interact with web pages without the need for a graphical user interface. Users can access the Gemini 2.5 Computer Use model through Browserbase and even compare its performance against competing models from OpenAI and Anthropic in a newly launched feature called the “Browser Arena.” This side-by-side comparison allows developers and users to evaluate the strengths and weaknesses of different AI agents in real-time.
For developers and AI builders, the Gemini 2.5 Pro Computer Use model is accessible via the Gemini API in Google AI Studio, facilitating rapid prototyping and integration into various applications. Additionally, it can be utilized within Google Cloud’s Vertex AI platform, which provides tools for building and deploying AI applications. This accessibility is crucial for fostering innovation and encouraging the development of new AI-driven solutions across industries.
The capabilities of Gemini 2.5 Pro Computer Use build upon the foundation laid by its predecessor, Gemini 2.5 Pro, which was released earlier in 2025. Since then, the model has undergone several updates aimed at enhancing its ability to perform direct interactions with user interfaces, including both web browsers and mobile applications. The focus on enabling AI agents to complete interface-driven tasks autonomously is evident in the design of the model, which allows for actions such as clicking, typing, scrolling, and filling out forms.
In initial hands-on tests conducted on the Browserbase platform, Gemini 2.5 Computer Use demonstrated its potential by successfully navigating to Taylor Swift’s official website and summarizing the promotional content displayed there. In another test, the model was tasked with searching for highly rated solar lights on Amazon. It adeptly completed a Google Search CAPTCHA, showcasing its ability to handle challenges typically designed to differentiate human users from bots. However, despite its impressive start, the model encountered difficulties in completing the search task, highlighting the ongoing challenges faced by AI agents in fully autonomous operation.
One notable distinction between Gemini 2.5 Computer Use and other AI agents, such as OpenAI’s ChatGPT Agent and Anthropic’s Claude, is the lack of support for local file creation or editing. While those models can generate and modify documents, spreadsheets, and presentations on behalf of users, Gemini 2.5 Computer Use is primarily focused on controlling and navigating web and mobile user interfaces. Its output is limited to suggested UI actions or chatbot-style text responses, necessitating that developers handle any structured output, such as documents or files, separately through custom code or third-party integrations.
Performance benchmarks indicate that Gemini 2.5 Computer Use has achieved leading results in various interface control evaluations, outperforming major competitors like Claude Sonnet and OpenAI’s agent-based models. Evaluations conducted via Browserbase and Google’s internal testing revealed that Gemini 2.5 scored 65.7% on the Online-Mind2Web benchmark, compared to 61.0% for Claude Sonnet 4 and 44.3% for the OpenAI Agent. Similarly, in the WebVoyager benchmark, Gemini 2.5 achieved a score of 79.9%, surpassing Claude Sonnet 4’s 69.4% and the OpenAI Agent’s 61.0%. These results underscore the model’s effectiveness in executing interface-driven tasks with high accuracy.
In addition to its strong performance metrics, Google reports that Gemini 2.5 Computer Use operates with lower latency than other browser control solutions, a critical factor for production use cases such as UI automation and testing. The model functions within an interaction loop, receiving a user task prompt, a screenshot of the interface, and a history of past actions. It analyzes this input to produce a recommended UI action, such as clicking a button or typing into a field. If necessary, it can request confirmation from the user for riskier tasks, ensuring that actions are executed safely and responsibly.
Safety measures are a top priority for Google, given that the model directly controls software interfaces. A multi-layered approach to safety includes a per-step safety service that inspects every proposed action before execution. Developers can define system-level instructions to block or require confirmation for specific actions, and the model incorporates built-in safeguards to prevent actions that could compromise security or violate Google’s prohibited use policies. For instance, if the model encounters a CAPTCHA, it will generate an action to click the checkbox but flag it as requiring user confirmation, ensuring that the system does not proceed without human oversight.
The technical capabilities of Gemini 2.5 Computer Use are extensive, supporting a wide array of built-in UI actions such as click_at, type_text_at, scroll_document, and drag_and_drop. User-defined functions can also be added to extend its reach to mobile or custom environments. The model accepts both image and text input and outputs either text responses or function calls to perform tasks. For optimal results, a screen resolution of 1440×900 is recommended, although the model can function with other sizes as well.
When it comes to pricing, Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model, following a per-token billing structure. Input tokens are priced at $1.25 per one million tokens for prompts under 200,000 tokens, and $2.50 per million tokens for longer prompts. Output tokens are similarly priced, with smaller responses costing $10.00 per million and larger ones $15.00 per million. However, a key difference lies in availability and additional features. While Gemini 2.5 Pro includes a free tier that allows developers to use the model at no cost, Gemini 2.5 Computer Use is exclusively available through a paid tier, with no free access currently offered.
This distinction in access is significant, as it means that all usage of the Computer Use model incurs token-based charges from the outset. Furthermore, while Gemini 2.5 Pro supports optional capabilities like context caching and grounding with Google Search, these features are not available for the Computer Use model at this time. Another important consideration is data handling; output from the Computer Use model is not used to improve Google products in the paid tier, whereas free-tier usage of Gemini 2.5 Pro contributes to model improvement unless explicitly opted out.
As organizations begin to adopt Gemini 2.5 Computer Use, early reports highlight its effectiveness across various domains. For instance, Google’s payments platform team has reported that the model successfully recovers over 60% of failed test executions, addressing a significant source of engineering inefficiencies. Similarly, Autotab, a third-party AI agent platform, noted that the model outperformed others on complex data parsing tasks, boosting performance by up to 18% in their most challenging evaluations. Poke.com, a proactive AI assistant provider, has also observed that the Gemini model often operates 50% faster than competing solutions during interface interactions.
The implications of Gemini 2.5 Computer Use extend beyond individual use cases; they signal a broader trend toward the development of autonomous digital workers capable of performing a wide range of tasks across the web. As AI agents become increasingly sophisticated, they will not only understand language but also act on it, transforming how individuals and organizations interact with technology. This evolution raises important questions about the future of work, the role of AI in society, and the ethical considerations surrounding the deployment of such powerful tools.
In conclusion, Google’s launch of the Gemini 2.5 Pro Computer Use model represents a significant milestone in the quest to create general-purpose AI agents. By enabling autonomous interaction with web interfaces, this model paves the way for a new era of AI-driven solutions that can enhance productivity, streamline workflows, and ultimately reshape the digital landscape. As developers and organizations explore the potential of this technology, the possibilities for innovation and transformation are boundless, heralding a future where AI agents play an integral role in our daily lives.
