Look Where It Matters

High-Resolution Crops Retrieval for Efficient VLMs
Nimrod Shabtay1,2*, Moshe Kimhi1,3*, Artem Spector1, Sivan Haray1, Ehud Rivlin3
Chaim Baskin4, Raja Giryes2, Eli Schwartz1
1IBM Research   2Tel-Aviv University   3Technion   4Ben-Gurion University
*Equal contribution
36%Achieves Qwen2.5-VL accuracy using only 36% of visual tokens
4.4×Faster than Dynamic Methods

Abstract

Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.

We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.

AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.

Overview

AwaRes processes a low-resolution global view by default. When additional detail is required, it invokes a structured tool-call to request only specific high-resolution sub-regions — then answers conditioned on both views.

AwaRes teaser
Figure 1: Left: Given a low-resolution image, AwaRes uses tool-calling to request only the needed crops. Right: Accuracy vs. retained visual tokens — AwaRes achieves Qwen2.5-VL accuracy (80.3%) using only 36% of the visual tokens.

Main Results

AwaRes matches full-resolution performance while using only 36% of visual tokens, outperforming all efficient baselines across six benchmarks.

Model ChartRTR DocRTR OCRRTR POPERTR RealRTR V*RTR AvgRTR
Qwen2.5-VL-7B79.801.0094.001.0081.101.0087.871.0068.801.0071.201.0080.461.00
Qwen2.5-VL-7B-LR65.000.2591.000.2570.700.2584.410.2566.000.2563.200.2573.390.25
Holo-V (70%)69.320.7076.400.7072.400.7087.370.7068.890.7069.110.7073.920.70
SparseVLM (70%)75.800.7087.200.7079.300.7085.400.7068.500.7054.450.7075.110.70
VisionZIP (70%)76.720.7090.750.7072.700.7087.860.7064.840.7065.970.7076.470.70
VisionThink79.901.1590.350.3280.100.8386.700.3466.600.5571.730.4979.230.61
AwaRes (Ours)80.640.3294.430.2881.300.4285.730.2768.500.4371.200.4280.300.36

Table 1: RTR = fraction of visual tokens retained (lower is better). AwaRes matches full-resolution at 36% compute.

Performance vs. Wall Clock Time

Performance vs wall clock time
Figure 3: AwaRes achieves sub-second latency across benchmarks via short tool calls, while VisionThink's reasoning traces increase decoding time (4.3s vs. 0.6s on ChartQA).

Qualitative Results

Examples of AwaRes adaptively selecting crops across charts, documents, and natural images. Each shows the question, tool call, low-res input, and retrieved high-res crop.

Method

AwaRes learns a coupled-decision policy (CDP) that jointly decides whether additional resolution is needed and where to acquire it, by selecting a subset of crops from a predefined candidate set.

01

Resolution Sufficiency Labeling

An LLM judge compares low-res vs. high-res outputs against ground truth to determine when cropping is needed.

02

Crop Target Construction

An oracle grounding model localizes the visual evidence, producing bounding boxes mapped to a discrete crop set via IoU.

03

Cold-Start SFT

Supervised fine-tuning on multi-turn tool-use trajectories teaches the model the tool protocol and yields a reference policy.

04

Multi-Turn GRPO

RL with a composite reward balancing correctness against crop-cost penalties refines tool usage for efficiency.

Data Curation Pipeline

Data curation pipeline
Figure 2: Each sample is processed at two resolutions. An LLM judge determines sufficiency; insufficient cases are routed to an oracle for crop localization, producing multi-turn tool-use trajectories.

Crop Annotation Examples

The oracle grounding model localizes answer regions, which are mapped to discrete crops for training.

Crop example - plane
Example 1: "What website is being advertised on the plane?" — The oracle localizes the text region; the selected crop reveals "Jetstar.com".
Crop example - document
Example 2: "What is the Grand Total for Net Block As of 31.3.2012?" — The oracle targets the relevant table cell; the crop enables reading "52708.86".

Training Pipeline

AwaRes is trained in two stages: supervised cold-start (SFT) to teach the tool-calling protocol, followed by multi-turn GRPO to optimize the accuracy–efficiency trade-off.

Stage 1 Cold-Start SFT

We fine-tune on a mixture of direct-answer trajectories (single turn, low-res only) and tool-call-then-answer trajectories (two turns with crop retrieval). This teaches the coupled-decision policy (CDP)—jointly deciding whether to escalate and where to crop—and yields a reference policy $\pi_{\text{ref}}$ for the RL stage.

We minimize a weighted negative log-likelihood over assistant tokens:

$$\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t \mid h_t)$$

where $h_t$ is the dialogue history at step $t$ and $w_t$ is a per-token weight. We set $w_t{=}5$ for the tool-call turn (1 otherwise) to stabilize learning of the first-turn decision—this single change boosts accuracy from 77.9→79.7 and achieves 100% tool-call formatting validity. We use trajectory-level SFT, optimizing the full two-turn interaction as a single sample. After SFT, the model is frozen as $\pi_{\text{ref}}$.

Stage 2 Multi-Turn GRPO

The SFT model has learned the tool-calling protocol but still uses it inefficiently, often requesting crops when the low-resolution view already suffices. We apply GRPO on full multi-turn interactions to improve tool usage and overall efficiency, using a composite reward:

$$R(\tau) = R_{\text{ans}}(\hat{a}, a^\star) \;-\; C_{\text{tool}}(C, y)$$

$R_{\text{ans}}$ is the cosine similarity between sentence-transformer embeddings of the predicted and ground-truth answers. The tool cost is asymmetric—missing a necessary crop is penalized more heavily than making an unnecessary request:

$$C_{\text{tool}}(C, y) = \begin{cases} \alpha_{\text{miss}} & \text{if } y{=}\texttt{HR} \text{ and } C{=}\emptyset \\[4pt] \alpha_{\text{use}} + \lambda \|C\| & \text{if } C{\neq}\emptyset \\[4pt] 0 & \text{if } y{=}\texttt{LR} \text{ and } C{=}\emptyset \end{cases}$$

For each prompt, we sample $G{=}8$ trajectories and compute group-relative advantages. The policy is optimized with a PPO-style clipped objective regularized toward $\pi_{\text{ref}}$:

$$\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x} \Bigg[\frac{1}{G}\sum_{i=1}^{G} \frac{1}{|\tau_i|}\sum_{t=1}^{|\tau_i|} \min\!\Big(r_t^{(i)}\hat{A}_i,\; \text{clip}(r_t^{(i)}, 1{-}\epsilon, 1{+}\epsilon)\hat{A}_i\Big) - \beta\, D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\Bigg]$$

GRPO shifts the tool-usage distribution closer to the oracle annotations while also discovering new crop strategies not present in the supervision. The resulting policy matches full-resolution accuracy (80.3 vs. 80.5) while using only 36% of the visual tokens.

Full training hyperparameters and ablation details are provided in the paper.

Citation

If you find AwaRes useful in your research, please consider citing:

@article{shabtay2026look,
title={Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs},
author={Shabtay, Nimrod and Kimhi, Moshe and Spector, Artem and Haray, Sivan and Rivlin, Ehud and Baskin, 
Chaim and Giryes, Raja and Schwartz, Eli},
journal={arXiv preprint arXiv:2603.16932},
year={2026}
}