VLM-Focus: Task-Relevant Scene Reduction for Planning and Control in Clutter

Abstract

Despite recent advances in general-purpose robotic manipulation, real-world multi-object clutter remains challenging to handle for today's prevalent approaches. The problem scales in complexity due to more objects and collisions, harder dynamics, distractors, and task ambiguity. Bridging this gap to real-world deployment requires effective scene abstractions, yet current methods rely heavily on manually engineered representations — an approach that does not scale.

We instead propose to automatically generate scene-specific, task-specific, adaptive abstractions, without manual intervention. VLM-Focus produces a de-cluttered abstracted scene representation by merging (e.g., stacked objects) or pruning (e.g., distant objects) scene entities in a closed loop in response to task progress. In our experiments, we show that VLM-Focus improves classical planning, model-based control, and a vision-language-action model across a diverse set of highly cluttered manipulation scenes.

Video

Method

VLM-Focus is a general, task-agnostic framework for constructing task-focused, dynamic scene abstractions using vision-language models and iterative task feedback. Given a natural language task description and a scene description, a VLM (GPT-4.1 mini) is prompted to output: (i) a minimal set of task-relevant objects critical for task completion, and (ii) groups of objects that may be merged into a single composite entity.

Figure 2. System overview of VLM-Focus. Given a task description via the full scene and prompt information, VLM-Focus (blue) prunes and merges objects to produce an abstracted scene, which is passed to the planner or controller (grey). When triggered by failure or timeout, VLM-Focus revises its scene pruning. This approach generalizes to TAMP, model-based controllers, and VLAs.

✂️ Pruning

Objects excluded from the task-relevant set are omitted from the planner/controller's world model — eliminating unnecessary decision variables (TAMP), contact constraints (MPC), and visual distractors (VLA).

🔗 Merging

Groups of objects that are functionally or dynamically coupled are replaced by a single composite entity whose collision geometry is the union of its members, reducing rigid-body count and contact interactions.

🔄 Closed-Loop Re-prompting

When a downstream planner finds the abstracted scene infeasible or a controller fails, that feedback is passed back to the VLM — which iteratively corrects the abstraction, restoring over-pruned objects or refining merges. This loop is triggered by timeouts or error codes from downstream systems.

Contributions

1
A general, task-agnostic method for estimating task relevance using VLMs and constructing task-focused scene abstractions. Closed-loop re-prompting with planning/simulation feedback corrects VLM errors and refines abstractions.
2
Integration with three distinct paradigms: (i) Task and Motion Planning (TAMP), (ii) optimization-based contact-implicit control (C3+), and (iii) the π_0.5 Vision-Language-Action (VLA) model.
3
Experimental evaluation in cluttered tabletop settings demonstrating improved scalability, robustness, and efficiency with increasing object count, on both simulation and real hardware.

Experiments & Results

We evaluate VLM-Focus across three planning and control paradigms on cluttered tabletop manipulation tasks, comparing against baselines operating on unfiltered scenes.

Fig. 3. Three TAMP environments of increasing complexity: (a) light-clutter tabletop, (b) heavy-clutter tabletop, (c) clutter + stack. The stack is a tray (black) with two goal objects (blue).

Fig. 4. Baseline-TAMP vs. VLM-Focus-TAMP success rates (top) and runtimes (bottom). VLM-Focus achieves higher success and lower runtimes on the two more complex tasks. Iteration 1 and 2 are shown stacked.

Fig. 5. Examples of scene pruning for (a) Heavy Clutter and (b) Clutter & Stack. Semi-transparent objects are pruned; the stack is merged into a single composite object.

TAMP anecdote — distance-based baseline failure

Fig. 6. Example TAMP scene where a distance-based baseline fails. The set of nearest objects to the goal (highlighted) fails to include a critical blocking object further away, which must be removed to enable a successful approach to the target. VLM-Focus identifies the blocker as task-relevant.

TAMP — Heavy Clutter

TAMP — Clutter & Stack

Qualitative results showing VLM-Focus pruning and planning in heavy-clutter and clutter+stack environments.

Fig. 7. Execution time vs. object count (log scale). Baseline C3+ degrades sharply — exceeding 1600s at 7 objects. VLM-Focus stays within ~30–125s across all counts.

Fig. 8. Hardware task outcomes for pruning and merging experiments, broken down by iteration. Success distributed across iterations highlights the value of iterative re-querying.

Fig. 9. VLM-Focus vs. nearest-k and radius-based geometric pruning baselines (n=30). VLM-Focus captures task-relevant objects regardless of proximity, outperforming heuristics.

Fig. 10. Three iterations of the pruned (top) and merged (bottom) object sets. Pruned objects shown in gray; successive iterations adapt to the evolving environment state. Merged groups (uniform colors) become more focused over iterations.

C3+ — Pruning on Hardware

C3+ — Merging on Hardware

Real robot planar pushing experiments demonstrating pruning and merging strategies across multiple feedback iterations.

Fig. 11. VLA pruning visualization. Grey objects are pruned; colored mask overlays are task-relevant. Goal (green) and target (purple) are selected; a second distractor goal is selected due to prompt ambiguity.

Fig. 12. Task success rates on LIBERO-Spatial (100 episodes, 10 tasks). Clutter drops baseline from 0.85 → 0.50; VLM-Focus recovers to 0.70.

Fig. 13. Real hardware VLA experiment. VLM-Focus applied on hand images (b) without simulator bounding boxes. GroundingDINO + SAM + LaMa segment and inpaint task-irrelevant objects (c).

Table I — Hardware VLA (π₀.₅), n=10 per fruit

Method	Task	🍍	🍐	🍌	🍎
VLA (full scene)	Pick Up	0.8	0.7	0.0	0.0
VLA (full scene)	Place	0.2	0.0	0.0	0.0
+ VLM-Focus	Pick Up	1.0	0.7	0.7	0.7
+ VLM-Focus	Place	0.5	0.3	0.3	0.7

VLA results in LIBERO-Spatial simulation. Clutter drops baseline success from 0.85 to 0.50; VLM-Focus recovers to 0.70 by inpainting task-irrelevant objects before querying the policy.

Front Camera

Hand Camera

Hand Camera — Inpainted

Real hardware fruit pick-and-place using the full open-vocabulary pipeline (GroundingDINO + SAM + LaMa). The inpainted hand-camera view (right) shows distractors removed before querying the π₀.₅ policy.