Look Where It Matters

High-Resolution Crops Retrieval for Efficient VLMs
Nimrod Shabtay1,2*, Moshe Kimhi1,3*, Artem Spector1, Sivan Haray1, Ehud Rivlin3
Chaim Baskin4, Raja Giryes2, Eli Schwartz1
1IBM Research   2Tel-Aviv University   3Technion   4Ben-Gurion University
*Equal contribution
36%Achieves Qwen2.5-VL accuracy using only 36% of visual tokens
4.4×Faster than Dynamic Methods

Abstract

Vision-language models (VLMs) typically process images at native high resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs promote efficiency but potentially miss critical visual information like small text.

We present AwaRes, a spatial-on-demand framework that resolves this accuracy–efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only the high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs. high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward combining semantic answer correctness with explicit crop-cost penalties.

AwaRes provides a practical, deployment-friendly path to high-detail VLM reasoning under tight compute budgets.

Overview

AwaRes processes a low-resolution global view by default. When additional detail is required, it invokes a structured tool-call to request only specific high-resolution sub-regions — then answers conditioned on both views.

AwaRes teaser
Figure 1: Left: Given a low-resolution image, AwaRes uses tool-calling to request only the needed crops. Right: Accuracy vs. retained visual tokens — AwaRes achieves Qwen2.5-VL accuracy (80.3%) using only 36% of the visual tokens.

Method

AwaRes learns a coupled-decision policy (CDP) that jointly decides whether additional resolution is needed and where to acquire it, by selecting a subset of crops from a predefined candidate set.

01

Resolution Sufficiency Labeling

An LLM judge compares low-res vs. high-res outputs against ground truth to determine when cropping is needed.

02

Crop Target Construction

An oracle grounding model localizes the visual evidence, producing bounding boxes mapped to a discrete crop set via IoU.

03

Cold-Start SFT

Supervised fine-tuning on multi-turn tool-use trajectories teaches the model the tool protocol and yields a reference policy.

04

Multi-Turn GRPO

RL with a composite reward balancing correctness against crop-cost penalties refines tool usage for efficiency.

Data Curation Pipeline

Data curation pipeline
Figure 2: Each sample is processed at two resolutions. An LLM judge determines sufficiency; insufficient cases are routed to an oracle for crop localization, producing multi-turn tool-use trajectories.

Crop Annotation Examples

The oracle grounding model localizes answer regions, which are mapped to discrete crops for training.

Crop example - plane
Example 1: "What website is being advertised on the plane?" — The oracle localizes the text region; the selected crop reveals "Jetstar.com".
Crop example - document
Example 2: "What is the Grand Total for Net Block As of 31.3.2012?" — The oracle targets the relevant table cell; the crop enables reading "52708.86".

Main Results

AwaRes matches full-resolution performance while using only 36% of visual tokens, outperforming all efficient baselines across six benchmarks.

Model ChartRTR DocRTR OCRRTR POPERTR RealRTR V*RTR AvgRTR
Qwen2.5-VL-7B79.801.0094.001.0081.101.0087.871.0068.801.0071.201.0080.461.00
Qwen2.5-VL-7B-LR65.000.2591.000.2570.700.2584.410.2566.000.2563.200.2573.390.25
Holo-V (70%)69.320.7076.400.7072.400.7087.370.7068.890.7069.110.7073.920.70
SparseVLM (70%)75.800.7087.200.7079.300.7085.400.7068.500.7054.450.7075.110.70
VisionZIP (70%)76.720.7090.750.7072.700.7087.860.7064.840.7065.970.7076.470.70
VisionThink79.901.1590.350.3280.100.8386.700.3466.600.5571.730.4979.230.61
AwaRes (Ours)80.640.3294.430.2881.300.4285.730.2768.500.4371.200.4280.300.36

Table 1: RTR = fraction of visual tokens retained (lower is better). AwaRes matches full-resolution at 36% compute.

Performance vs. Wall Clock Time

Performance vs wall clock time
Figure 3: AwaRes achieves sub-second latency across benchmarks via short tool calls, while VisionThink's reasoning traces increase decoding time (4.3s vs. 0.6s on ChartQA).

From Over-Calling to Looking Where It Matters

Crop selection evolution
Figure 4: Crop selection evolution: Oracle GT → SFT → GRPO. SFT over-uses the crop tool (LR drops 80.8%→46.3%); GRPO corrects this via tool-use cost, recovering to 72.2% low-res decisions.

Visual Results

Examples of AwaRes adaptively selecting crops across charts, documents, and natural images. Each shows the question, tool call, low-res input, and retrieved high-res crop.

Citation

If you find AwaRes useful in your research, please consider citing:

@misc{shabtay2026lookmattershighresolutioncrops,
      title={Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs}, 
      author={Nimrod Shabtay and Moshe Kimhi and Artem Spector and Sivan Haray and Ehud Rivlin 
      and Chaim Baskin and Raja Giryes and Eli Schwartz},
      year={2026},
      eprint={2603.16932},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.16932},
}