MobileWorld Update: Can Frontier Models Really Control Your Phone? Evaluating End-to-End Mobile-Use on Real Devices

TLDR: We adapted mainstream end-to-end models, including Gemini 3 Pro, Claude Sonnet 4.5, KIMI K2.5, Qwen-3.5 and Seed-2.0-Pro , and benchmarked their mobile-use capabilities, analyzed current task performance on MobileWorld, and demonstrated how these models actually work on real phones.

It's been a few months since we released MobileWorld, our benchmark for evaluating mobile agents on realistic, long-horizon tasks. Back then, we only evaluated models through agentic frameworks — pairing a reasoning LLM with a separate grounding model. But many frontier models have since shipped built-in GUI grounding capabilities, and the natural question is: can they handle mobile tasks end-to-end, without any external grounding module? We spent the last few weeks finding out, and this post shares everything we learned. We'll walk through how we adapted five frontier models for direct end-to-end evaluation via understanding their coordinate systems, action formats, and multi-turn structures since none of this is well-documented. We'll show which model came out on top, what it actually costs to run them, and how choices like the number of history screenshots can make or break performance in ways that differ across models. Most importantly, we'll demonstrate these models working on real physical phones and show you how to set this up yourself in just 6 steps.

But first — here's what it actually looks like:

See It in Action: Frontier Models Controlling Real Phones

We got Gemini and Claude running on actual physical devices, not just emulators. Watch them navigate complex, multi-step tasks end-to-end:

Instruction: 帮我去小红书整理下我关注的WebAgentLab转发的5篇最近的论文title，整理给我 (Please go to Xiaohongshu and help me collect the titles of the 5 most recent papers reposted by the account WebAgentLab that I follow, and organize them for me.) Model: Claude

Instruction: Go to X (formerly Twitter) and navigate to Elon Musk's profile page. Find 5 his most recent tweets and provide a summary of their main topics and content. Model: Gemini

Now let's dig into how we got here.

The Challenge: Making Frontier Models Work as Mobile Agents In An End-to-end manner

If you've tried to use GPT-5, Gemini 3 Pro, or Claude 4.5 Sonnet as GUI agents, you'll know it's not as simple as just sending screenshots and asking for actions. A comprehensive review of the leaderboards for AndroidWorld reveals that the majority of existing research focuses on constructing agent frameworks based on general-purpose models (e.g., AutoDevice using Gemini-3-Pro, DroidRun using GPT-5). However, the end-to-end execution capabilities of these models for mobile GUI tasks have been significantly overlooked. Each model has its own quirks around coordinate systems, action formats, and multi-turn conversation structures.

Grounding capabilities: one of the first hurdles is grounding where we get models to output precise screen coordinates for taps and gestures. Different models handle this differently:

Model	Coordinate System	Notes
Gemini 3 Pro	Relative (0-1000)	Uses normalized coordinates
Seed-2.0-Pro
Qwen-3.5
Kimi-K2.5	Relative (0-1)
Claude 4.5 Sonnet	Absolute pixels	Requires image resize to 1280×720
GPT-5	—	Relatively weak grounding capabilities; not evaluated

(For a deeper dive into grounding evaluation, see our grounding benchmark blog post .)

System Prompts and Multi-Turn Organization: we took two approaches to prompting for models:
- General End-to-End Prompt: Building on our MobileWorld evaluation, we distilled a highly adaptable, general end-to-end prompt for direct coordinate output. Designed to maximize compatibility across a diverse spectrum of general-purpose LLMs, this framework establishes a fair and unified benchmark for rigorously evaluating models' instruction following and GUI operation capabilities.The key design decisions:
  - Structured action space: We define a comprehensive set of actions (click, long_press, drag, scroll, input_text, etc.) with explicit JSON schemas that include coordinate parameters
  - Normalized coordinates: Actions like click expect {"action_type": "click", "coordinate": [x, y]} where coordinates are normalized to a configurable scale factor (0-1000 for Gemini, absolute pixels for Claude)
  - Thought-Action format: Each response follows a Thought: ... Action: {...} structure, making it easy to parse and debug agent reasoning
  - User interaction support: We include ask_user and answer actions to handle MobileWorld's user interaction tasks
- Model-Specific Adaptation: For Seed-2.0-Pro, we adapted the seed_agent.py prompt from the OSWorld repository to better match its expected input format.

Results: What Actually Works?

Overall Performance

Model	GUI-Only	Ask User
Seed 2.0 pro	63.2	61.4
Gemini 3 pro	51.3	29.5
KIMI K2.5	49.6	51.2
Claude Sonnet 4.5	47.8	38.6
Qwen3.5-397B-A17B	42.7	54.4