The Computer Use model can be previewed via the API. This specialized model is based on Gemini 2.5 Pro and assists agents in interacting with user interfaces. Earlier this year, Gemini AI announced that computer use would be integrated into the Gemini API, enabling developers to create agents that utilize computers. Today, we present the Google Gemini Computer Use Gemini 2.5 AI model, a specialized version that leverages Gemini 2.5 Pro’s advanced visual comprehension and reasoning skills to empower agents to understand and operate within user interfaces (UIs). It achieves state-of-the-art performance across various web and mobile control benchmarks, while also demonstrating reduced latencies. This model will be accessible through the Gemini API on Google AI Studio and Vertex AI.
Although several AI models already interact with software via structured APIs, many digital tasks necessitate reasoning within graphical user interfaces (GUIs)—such as completing forms and submitting them. To perform these tasks, agents must read and respond to web pages and applications similarly to humans: by clicking, typing, and scrolling. The primary capabilities of Gemini 2.5 are accessible through the new computer_use tool within the Gemini API and are designed to operate in a loop. The inputs required for this tool include the user’s request, a screenshot of the interface, and recent action history. Users can also specify a list of UI actions to exclude or provide additional updates.
The model processes these inputs and generates a response, typically as a function call for a specific UI action like clicking or typing. For certain actions, such as making a purchase, a separate confirmation step may be required from the end user. Once the client-side code executes the action, a new screenshot and the current URL are sent back to the Computer Use model to continue the loop. This process repeats for multiple actions until the task is either completed, an error occurs, or the interaction is ended by a safety response or user decision.