AwaRes — Look Where It Matters

Abstract

Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.

We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.

AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.

Main Results

AwaRes matches full-resolution performance while using only 36% of visual tokens, outperforming all efficient baselines across six benchmarks.

Model	Chart	RTR	Doc	RTR	OCR	RTR	POPE	RTR	Real	RTR	V*	RTR	Avg	RTR
Qwen2.5-VL-7B	79.80	1.00	94.00	1.00	81.10	1.00	87.87	1.00	68.80	1.00	71.20	1.00	80.46	1.00
Qwen2.5-VL-7B-LR	65.00	0.25	91.00	0.25	70.70	0.25	84.41	0.25	66.00	0.25	63.20	0.25	73.39	0.25
Holo-V (70%)	69.32	0.70	76.40	0.70	72.40	0.70	87.37	0.70	68.89	0.70	69.11	0.70	73.92	0.70
SparseVLM (70%)	75.80	0.70	87.20	0.70	79.30	0.70	85.40	0.70	68.50	0.70	54.45	0.70	75.11	0.70
VisionZIP (70%)	76.72	0.70	90.75	0.70	72.70	0.70	87.86	0.70	64.84	0.70	65.97	0.70	76.47	0.70
VisionThink	79.90	1.15	90.35	0.32	80.10	0.83	86.70	0.34	66.60	0.55	71.73	0.49	79.23	0.61
AwaRes (Ours)	80.64	0.32	94.43	0.28	81.30	0.42	85.73	0.27	68.50	0.43	71.20	0.42	80.30	0.36

Table 1: RTR = fraction of visual tokens retained (lower is better). AwaRes matches full-resolution at 36% compute.

Performance vs. Wall Clock Time

Figure 3: AwaRes achieves sub-second latency across benchmarks via short tool calls, while VisionThink's reasoning traces increase decoding time (4.3s vs. 0.6s on ChartQA).

Method

AwaRes learns a coupled-decision policy (CDP) that jointly decides whether additional resolution is needed and where to acquire it, by selecting a subset of crops from a predefined candidate set.

Resolution Sufficiency Labeling

An LLM judge compares low-res vs. high-res outputs against ground truth to determine when cropping is needed.

Crop Target Construction

An oracle grounding model localizes the visual evidence, producing bounding boxes mapped to a discrete crop set via IoU.

Cold-Start SFT

Supervised fine-tuning on multi-turn tool-use trajectories teaches the model the tool protocol and yields a reference policy.

Multi-Turn GRPO

RL with a composite reward balancing correctness against crop-cost penalties refines tool usage for efficiency.

Data Curation Pipeline

Figure 2: Each sample is processed at two resolutions. An LLM judge determines sufficiency; insufficient cases are routed to an oracle for crop localization, producing multi-turn tool-use trajectories.

Crop Annotation Examples

The oracle grounding model localizes answer regions, which are mapped to discrete crops for training.

Example 1: "What website is being advertised on the plane?" — The oracle localizes the text region; the selected crop reveals "Jetstar.com".

Example 2: "What is the Grand Total for Net Block As of 31.3.2012?" — The oracle targets the relevant table cell; the crop enables reading "52708.86".

Training Pipeline

AwaRes is trained in two stages: supervised cold-start (SFT) to teach the tool-calling protocol, followed by multi-turn GRPO to optimize the accuracy–efficiency trade-off.

Stage 1 Cold-Start SFT

We fine-tune on a mixture of direct-answer trajectories (single turn, low-res only) and tool-call-then-answer trajectories (two turns with crop retrieval). This teaches the coupled-decision policy (CDP)—jointly deciding whether to escalate and where to crop—and yields a reference policy $\pi_{\text{ref}}$ for the RL stage.

We minimize a weighted negative log-likelihood over assistant tokens:

$$\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t \mid h_t)$$

where $h_t$ is the dialogue history at step $t$ and $w_t$ is a per-token weight. We set $w_t{=}5$ for the tool-call turn (1 otherwise) to stabilize learning of the first-turn decision—this single change boosts accuracy from 77.9→79.7 and achieves 100% tool-call formatting validity. We use trajectory-level SFT, optimizing the full two-turn interaction as a single sample. After SFT, the model is frozen as $\pi_{\text{ref}}$.

Stage 2 Multi-Turn GRPO

The SFT model has learned the tool-calling protocol but still uses it inefficiently, often requesting crops when the low-resolution view already suffices. We apply GRPO on full multi-turn interactions to improve tool usage and overall efficiency, using a composite reward:

$$R(\tau) = R_{\text{ans}}(\hat{a}, a^\star) \;-\; C_{\text{tool}}(C, y)$$

$R_{\text{ans}}$ is the cosine similarity between sentence-transformer embeddings of the predicted and ground-truth answers. The tool cost is asymmetric—missing a necessary crop is penalized more heavily than making an unnecessary request:

$$C_{\text{tool}}(C, y) = \begin{cases} \alpha_{\text{miss}} & \text{if } y{=}\texttt{HR} \text{ and } C{=}\emptyset \\[4pt] \alpha_{\text{use}} + \lambda \|C\| & \text{if } C{\neq}\emptyset \\[4pt] 0 & \text{if } y{=}\texttt{LR} \text{ and } C{=}\emptyset \end{cases}$$

For each prompt, we sample $G{=}8$ trajectories and compute group-relative advantages. The policy is optimized with a PPO-style clipped objective regularized toward $\pi_{\text{ref}}$:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x} \Bigg[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|\tau_i|}\sum_{t=1}^{|\tau_i|} \min\!\Big(r_t^{(i)}\hat{A}_i,\; \text{clip}(r_t^{(i)}, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\Big) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\Bigg]$$

GRPO shifts the tool-usage distribution closer to the oracle annotations while also discovering new crop strategies not present in the supervision. The resulting policy matches full-resolution accuracy (80.3 vs. 80.5) while using only 36% of the visual tokens.

Full training hyperparameters and ablation details are provided in the paper.

Look Where It Matters