ResAdapt — Adaptive Resolution for Efficient Multimodal Reasoning

Abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. Existing efficiency strategies only partially resolve this tension: model-side token compression discards fine-grained evidence after encoding and can disrupt optimized inference kernels, whereas output-side agentic reasoning adds iterative latency and can still miss decisive cues when the initial view is too coarse.

We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding.

ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy–cost learning signal. We further introduce a temporal-similarity regularizer that suppresses redundant high-budget allocation on adjacent similar frames, encouraging differentiated, content-aware allocation in a single forward pass.

Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency–accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16× more frames at the same visual budget while delivering over 15% performance gain. The learned policy exhibits open-loop active perception, concentrating visual budget on information-dense content without modifying the backbone architecture.

Paradigm

ResAdapt overview

Input-side adaptation vs. token reduction after encoding

Comparison of token-reduction paradigms. We introduce input-side adaptation, which dynamically allocates variable image resolutions or video frame quantities before visual encoding. This mitigates initial feature explosion, preserves fine-grained information for the encoder, keeps the backbone’s native token interface, and runs in a single non-iterative pass.

Figure 1 Comparison of efficiency paradigms; ResAdapt allocates budget on the input side.

Method

ResAdapt architecture

Allocator + frozen MLLM, trained with CAPO

Overview of the ResAdapt framework. (a) A lightweight vision Allocator pairs with a frozen MLLM; the Allocator assigns token budgets per frame from task context, which determine spatial resolutions for the dynamic visual encoder. (b) End-to-end training uses Cost-Aware Policy Optimization (CAPO) with reinforcement learning to balance accuracy and cost.

Figure 2 Allocator–backbone interface and CAPO training loop.

Experiments

Main results

Main metrics (auto carousel), latency, operators, reward / CAPO ablations, and temporal analysis — each carousel highlights the current panel below its title.

Main results. Video QA, temporal grounding, and image reasoning under controlled visual budget. The line below matches the panel currently shown (auto-advances every few seconds; pauses on hover).

Video QA — efficiency–accuracy under budget.

Video QA results — (a) Video QA — efficiency–accuracy under budget

Temporal grounding results — (b) Temporal grounding

Image reasoning results — (c) Image reasoning

1 / 3

Figure 3 Main results across video QA, temporal grounding, and image reasoning (full tables and splits in the paper).

Runtime overhead. Inference cost relative to baselines.

Figure 4 Runtime / overhead analysis.

Operator generalization. Cross-operator robustness of the learned policy.

Figure 5 Generalization across visual operators.

Reward design and CAPO ablations. Training / validation curves for policy adaptivity, mean predicted scale, and reward variants; see the paper for definitions.

Per-sample adaptivity and scale range — training vs. validation (CAPO vs. direct cost / no cost).

Policy adaptivity and resolution bands — Policy adaptivity and resolution-band behaviour under CAPO

Reward schedule evaluation variant — Mean predicted scale (train / val)

Reward scaling ablation — Validation: mean predicted scale (ablations)

1 / 3

Figure 6 Reward-training and testing ablations (CAPO); refer to the paper for full experimental protocol.

Temporal regularization and allocation behaviour. Similarity loss (L_sim), dataset-level allocation on VideoMME, and emergent active perception.

L_sim ablation — temporal regularization complements CAPO (scale vs. frames).

Temporal regularization L_sim ablation — Temporal regularization complements CAPO (L_sim ablation)

Global allocation statistics VideoMME — Global allocation statistics on VideoMME

1 / 3

Figure 7 Temporal similarity regularization, dataset-level allocation statistics, and qualitative active perception; details and definitions in the paper.

Appendix

Case studies (VideoMME / Video-MMMU)

Qualitative grids from the paper appendix (32 sampled frames; warmer borders = higher scale). Prompts match Appendix §Qualitative Case Studies.

Qualitative case studies. Each slide shows the task prompt (Q:) and the predicted per-frame scale layout. The carousel below matches the style of the main-result panels: auto-advance, pause on hover, pause when the tab is hidden.

Case 1 — Video-MMMU Comprehension