Many major AI companies have released demo videos where a model navigates a mobile or desktop interface with perfect precision. Reproducing those results is a different story. We spent a lot of time testing frontier models on GUI grounding, the seemingly simple task of clicking the right pixel, and discovered that the gap between reports and deployment hides in details that nobody publishes: grounding paradigms, coordinate systems, prompt templates, thinking patterns, and tool augmentations. This post summarizes our journey and lessons learned from reproducing those details.

In 2026, almost every major multimodal model ships with a GUI grounding score on its technical report. The vision is compelling: Imagine never touching a mouse again. You say "file a reimbursement for yesterday's lunch," and an AI agent opens the portal, clicks through three dropdown menus, uploads the receipt, and hits submit — all from a single sentence. That's the promise that has made GUI grounding an important metric for every frontier multimodal model in 2026. But a metric on a spec sheet and a working agent on your desktop are very different things.
The major AI labs showcase grounding as a headline capability. Technical reports show impressive numbers. Demo videos show flawless execution. What they rarely share is how they got there: which grounding paradigm, coordinate system, what image resolution, whether a zoom-in tool was involved, and what the prompt actually used. For anyone trying to build on top of these models, this opacity is the real obstacle.
We set out to close this gap. We evaluated Gemini-3-Pro, Claude-Sonnet-4.5, Seed1.8, Kimi-K2.5 and our MAI-UI across two challenging benchmarks:
Our goal was threefold:
The way models approach grounding has evolved significantly over time. Understanding this evolution is essential context for repoducing the grounding performance of latest models.
During earlier development stages, MLLMs lacked fine-grained grounding capabilities. Consequently, researchers employed specialized, small-scale models to identify UI elements and overlay bounding boxes on the interface. In this paradigm, the MLLM functions as a reasoning engine to select the appropriate identifier based on natural language instructions.
We evaluated Claude-Sonnet-4.5, Seed1.8, and Kimi-K2.5 on the refined OSWorld-G benchmark, utilizing SoM extracted via OmniParser V2. To mitigate the impact of visual noise as shown in Figure 1, we provided the original unmarked images alongside the SoM visualizations, ensuring that bounding box overlays did not obscure critical interface details.