TLDR: We adapted mainstream end-to-end models, including Gemini 3 Pro, Claude Sonnet 4.5, KIMI K2.5, Qwen-3.5 and Seed-2.0-Pro , and benchmarked their mobile-use capabilities, analyzed current task performance on MobileWorld, and demonstrated how these models actually work on real phones.

mobileworld_banner_v8.jpg


It's been a few months since we released MobileWorld, our benchmark for evaluating mobile agents on realistic, long-horizon tasks. Back then, we only evaluated models through agentic frameworks — pairing a reasoning LLM with a separate grounding model. But many frontier models have since shipped built-in GUI grounding capabilities, and the natural question is: can they handle mobile tasks end-to-end, without any external grounding module? We spent the last few weeks finding out, and this post shares everything we learned. We'll walk through how we adapted five frontier models for direct end-to-end evaluation via understanding their coordinate systems, action formats, and multi-turn structures since none of this is well-documented. We'll show which model came out on top, what it actually costs to run them, and how choices like the number of history screenshots can make or break performance in ways that differ across models. Most importantly, we'll demonstrate these models working on real physical phones and show you how to set this up yourself in just 6 steps.

But first — here's what it actually looks like:

See It in Action: Frontier Models Controlling Real Phones

We got Gemini and Claude running on actual physical devices, not just emulators. Watch them navigate complex, multi-step tasks end-to-end:

Instruction: 帮我去小红书整理下我关注的WebAgentLab转发的5篇最近的论文title,整理给我 (Please go to Xiaohongshu and help me collect the titles of the 5 most recent papers reposted by the account WebAgentLab that I follow, and organize them for me.) Model: Claude

Instruction: 帮我去小红书整理下我关注的WebAgentLab转发的5篇最近的论文title,整理给我 (Please go to Xiaohongshu and help me collect the titles of the 5 most recent papers reposted by the account WebAgentLab that I follow, and organize them for me.) Model: Claude

Instruction: Go to X (formerly Twitter) and navigate to Elon Musk's profile page. Find 5 his most recent tweets and provide a summary of their main topics and content. Model: Gemini

Instruction: Go to X (formerly Twitter) and navigate to Elon Musk's profile page. Find 5 his most recent tweets and provide a summary of their main topics and content. Model: Gemini

Now let's dig into how we got here.


The Challenge: Making Frontier Models Work as Mobile Agents In An End-to-end manner

If you've tried to use GPT-5, Gemini 3 Pro, or Claude 4.5 Sonnet as GUI agents, you'll know it's not as simple as just sending screenshots and asking for actions. A comprehensive review of the leaderboards for AndroidWorld reveals that the majority of existing research focuses on constructing agent frameworks based on general-purpose models (e.g., AutoDevice using Gemini-3-Pro, DroidRun using GPT-5). However, the end-to-end execution capabilities of these models for mobile GUI tasks have been significantly overlooked. Each model has its own quirks around coordinate systems, action formats, and multi-turn conversation structures.

Model Coordinate System Notes
Gemini 3 Pro Relative (0-1000) Uses normalized coordinates
Seed-2.0-Pro
Qwen-3.5
Kimi-K2.5 Relative (0-1)
Claude 4.5 Sonnet Absolute pixels Requires image resize to 1280×720
GPT-5 Relatively weak grounding capabilities; not evaluated

(For a deeper dive into grounding evaluation, see our grounding benchmark blog post .)

Results: What Actually Works?

Overall Performance

Model GUI-Only Ask User
Seed 2.0 pro 63.2 61.4
Gemini 3 pro 51.3 29.5
KIMI K2.5 49.6 51.2
Claude Sonnet 4.5 47.8 38.6
Qwen3.5-397B-A17B 42.7 54.4