TLDR: We adapted mainstream end-to-end models, including Gemini 3 Pro, Claude Sonnet 4.5, KIMI K2.5, Qwen-3.5 and Seed-2.0-Pro , and benchmarked their mobile-use capabilities, analyzed current task performance on MobileWorld, and demonstrated how these models actually work on real phones.

It's been a few months since we released MobileWorld, our benchmark for evaluating mobile agents on realistic, long-horizon tasks. Back then, we only evaluated models through agentic frameworks — pairing a reasoning LLM with a separate grounding model. But many frontier models have since shipped built-in GUI grounding capabilities, and the natural question is: can they handle mobile tasks end-to-end, without any external grounding module? We spent the last few weeks finding out, and this post shares everything we learned. We'll walk through how we adapted five frontier models for direct end-to-end evaluation via understanding their coordinate systems, action formats, and multi-turn structures since none of this is well-documented. We'll show which model came out on top, what it actually costs to run them, and how choices like the number of history screenshots can make or break performance in ways that differ across models. Most importantly, we'll demonstrate these models working on real physical phones and show you how to set this up yourself in just 6 steps.
But first — here's what it actually looks like:
We got Gemini and Claude running on actual physical devices, not just emulators. Watch them navigate complex, multi-step tasks end-to-end:
Instruction: 帮我去小红书整理下我关注的WebAgentLab转发的5篇最近的论文title,整理给我 (Please go to Xiaohongshu and help me collect the titles of the 5 most recent papers reposted by the account WebAgentLab that I follow, and organize them for me.) Model: Claude
Instruction: Go to X (formerly Twitter) and navigate to Elon Musk's profile page. Find 5 his most recent tweets and provide a summary of their main topics and content. Model: Gemini
Now let's dig into how we got here.
If you've tried to use GPT-5, Gemini 3 Pro, or Claude 4.5 Sonnet as GUI agents, you'll know it's not as simple as just sending screenshots and asking for actions. A comprehensive review of the leaderboards for AndroidWorld reveals that the majority of existing research focuses on constructing agent frameworks based on general-purpose models (e.g., AutoDevice using Gemini-3-Pro, DroidRun using GPT-5). However, the end-to-end execution capabilities of these models for mobile GUI tasks have been significantly overlooked. Each model has its own quirks around coordinate systems, action formats, and multi-turn conversation structures.
| Model | Coordinate System | Notes |
|---|---|---|
| Gemini 3 Pro | Relative (0-1000) | Uses normalized coordinates |
| Seed-2.0-Pro | ||
| Qwen-3.5 | ||
| Kimi-K2.5 | Relative (0-1) | |
| Claude 4.5 Sonnet | Absolute pixels | Requires image resize to 1280×720 |
| GPT-5 | — | Relatively weak grounding capabilities; not evaluated |
(For a deeper dive into grounding evaluation, see our grounding benchmark blog post .)
click, long_press, drag, scroll, input_text, etc.) with explicit JSON schemas that include coordinate parametersclick expect {"action_type": "click", "coordinate": [x, y]} where coordinates are normalized to a configurable scale factor (0-1000 for Gemini, absolute pixels for Claude)Thought: ... Action: {...} structure, making it easy to parse and debug agent reasoningask_user and answer actions to handle MobileWorld's user interaction tasks| Model | GUI-Only | Ask User |
|---|---|---|
| Seed 2.0 pro | 63.2 | 61.4 |
| Gemini 3 pro | 51.3 | 29.5 |
| KIMI K2.5 | 49.6 | 51.2 |
| Claude Sonnet 4.5 | 47.8 | 38.6 |
| Qwen3.5-397B-A17B | 42.7 | 54.4 |