AwaRes processes a low-resolution global view by default. When additional detail is required, it invokes a structured tool-call to request only specific high-resolution sub-regions — then answers conditioned on both views.
AwaRes learns a coupled-decision policy (CDP) that jointly decides whether additional resolution is needed and where to acquire it, by selecting a subset of crops from a predefined candidate set.
An LLM judge compares low-res vs. high-res outputs against ground truth to determine when cropping is needed.
An oracle grounding model localizes the visual evidence, producing bounding boxes mapped to a discrete crop set via IoU.
Supervised fine-tuning on multi-turn tool-use trajectories teaches the model the tool protocol and yields a reference policy.
RL with a composite reward balancing correctness against crop-cost penalties refines tool usage for efficiency.
The oracle grounding model localizes answer regions, which are mapped to discrete crops for training.
AwaRes matches full-resolution performance while using only 36% of visual tokens, outperforming all efficient baselines across six benchmarks.
| Model | Chart | RTR | Doc | RTR | OCR | RTR | POPE | RTR | Real | RTR | V* | RTR | Avg | RTR |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 79.80 | 1.00 | 94.00 | 1.00 | 81.10 | 1.00 | 87.87 | 1.00 | 68.80 | 1.00 | 71.20 | 1.00 | 80.46 | 1.00 |
| Qwen2.5-VL-7B-LR | 65.00 | 0.25 | 91.00 | 0.25 | 70.70 | 0.25 | 84.41 | 0.25 | 66.00 | 0.25 | 63.20 | 0.25 | 73.39 | 0.25 |
| Holo-V (70%) | 69.32 | 0.70 | 76.40 | 0.70 | 72.40 | 0.70 | 87.37 | 0.70 | 68.89 | 0.70 | 69.11 | 0.70 | 73.92 | 0.70 |
| SparseVLM (70%) | 75.80 | 0.70 | 87.20 | 0.70 | 79.30 | 0.70 | 85.40 | 0.70 | 68.50 | 0.70 | 54.45 | 0.70 | 75.11 | 0.70 |
| VisionZIP (70%) | 76.72 | 0.70 | 90.75 | 0.70 | 72.70 | 0.70 | 87.86 | 0.70 | 64.84 | 0.70 | 65.97 | 0.70 | 76.47 | 0.70 |
| VisionThink | 79.90 | 1.15 | 90.35 | 0.32 | 80.10 | 0.83 | 86.70 | 0.34 | 66.60 | 0.55 | 71.73 | 0.49 | 79.23 | 0.61 |
| AwaRes (Ours) | 80.64 | 0.32 | 94.43 | 0.28 | 81.30 | 0.42 | 85.73 | 0.27 | 68.50 | 0.43 | 71.20 | 0.42 | 80.30 | 0.36 |
Table 1: RTR = fraction of visual tokens retained (lower is better). AwaRes matches full-resolution at 36% compute.
Examples of AwaRes adaptively selecting crops across charts, documents, and natural images. Each shows the question, tool call, low-res input, and retrieved high-res crop.