VLM-Focus: Task-Relevant Scene Reduction for Planning and Control in Clutter

Aileen Liao1 ·  Rachel Holladay1 ·  Dinesh Jayaraman1 ·  Michael Posa1
1University of Pennsylvania
Paper arXiv soon Video Code soon BibTeX soon
VLM-Focus teaser

Figure 1. Scene complexity is reduced via pruning and merging. (Left) Control task: move green "G" to a target. VLM-Focus identifies "S", "T", "R" (gray) can be ignored; "A", "E", "I" (pink) are relevant but mergeable; "P" (dark blue) and "D" (light blue) must be modeled individually. (Top right) TAMP: robot must pick three green cubes — VLM-Focus prunes the vast majority of distractors. (Bottom right) VLA: distractor bowls are pruned before querying the policy.

Abstract

Despite recent advances in general-purpose robotic manipulation, real-world multi-object clutter remains challenging to handle for today's prevalent approaches. The problem scales in complexity due to more objects and collisions, harder dynamics, distractors, and task ambiguity. Bridging this gap to real-world deployment requires effective scene abstractions, yet current methods rely heavily on manually engineered representations — an approach that does not scale.

We instead propose to automatically generate scene-specific, task-specific, adaptive abstractions, without manual intervention. VLM-Focus produces a de-cluttered abstracted scene representation by merging (e.g., stacked objects) or pruning (e.g., distant objects) scene entities in a closed loop in response to task progress. In our experiments, we show that VLM-Focus improves classical planning, model-based control, and a vision-language-action model across a diverse set of highly cluttered manipulation scenes.

Video

Method

VLM-Focus is a general, task-agnostic framework for constructing task-focused, dynamic scene abstractions using vision-language models and iterative task feedback. Given a natural language task description and a scene description, a VLM (GPT-4.1 mini) is prompted to output: (i) a minimal set of task-relevant objects critical for task completion, and (ii) groups of objects that may be merged into a single composite entity.

System overview of VLM-Focus

Figure 2. System overview of VLM-Focus. Given a task description via the full scene and prompt information, VLM-Focus (blue) prunes and merges objects to produce an abstracted scene, which is passed to the planner or controller (grey). When triggered by failure or timeout, VLM-Focus revises its scene pruning. This approach generalizes to TAMP, model-based controllers, and VLAs.


✂️  Pruning

Objects excluded from the task-relevant set are omitted from the planner/controller's world model — eliminating unnecessary decision variables (TAMP), contact constraints (MPC), and visual distractors (VLA).

🔗  Merging

Groups of objects that are functionally or dynamically coupled are replaced by a single composite entity whose collision geometry is the union of its members, reducing rigid-body count and contact interactions.

🔄  Closed-Loop Re-prompting

When a downstream planner finds the abstracted scene infeasible or a controller fails, that feedback is passed back to the VLM — which iteratively corrects the abstraction, restoring over-pruned objects or refining merges. This loop is triggered by timeouts or error codes from downstream systems.


Contributions

Experiments & Results

We evaluate VLM-Focus across three planning and control paradigms on cluttered tabletop manipulation tasks, comparing against baselines operating on unfiltered scenes.

TAMP environments

Fig. 3. Three TAMP environments of increasing complexity: (a) light-clutter tabletop, (b) heavy-clutter tabletop, (c) clutter + stack. The stack is a tray (black) with two goal objects (blue).

TAMP results

Fig. 4. Baseline-TAMP vs. VLM-Focus-TAMP success rates (top) and runtimes (bottom). VLM-Focus achieves higher success and lower runtimes on the two more complex tasks. Iteration 1 and 2 are shown stacked.

TAMP pruning examples

Fig. 5. Examples of scene pruning for (a) Heavy Clutter and (b) Clutter & Stack. Semi-transparent objects are pruned; the stack is merged into a single composite object.

TAMP anecdote — distance-based baseline failure

Fig. 6. Example TAMP scene where a distance-based baseline fails. The set of nearest objects to the goal (highlighted) fails to include a critical blocking object further away, which must be removed to enable a successful approach to the target. VLM-Focus identifies the blocker as task-relevant.

TAMP — Heavy Clutter

TAMP — Clutter & Stack

Qualitative results showing VLM-Focus pruning and planning in heavy-clutter and clutter+stack environments.

C3 Scaling

Fig. 7. Execution time vs. object count (log scale). Baseline C3+ degrades sharply — exceeding 1600s at 7 objects. VLM-Focus stays within ~30–125s across all counts.

C3 Hardware

Fig. 8. Hardware task outcomes for pruning and merging experiments, broken down by iteration. Success distributed across iterations highlights the value of iterative re-querying.

C3 Baselines

Fig. 9. VLM-Focus vs. nearest-k and radius-based geometric pruning baselines (n=30). VLM-Focus captures task-relevant objects regardless of proximity, outperforming heuristics.

C3 Pruning scene graph

C3 Merging scene graph

Fig. 10. Three iterations of the pruned (top) and merged (bottom) object sets. Pruned objects shown in gray; successive iterations adapt to the evolving environment state. Merged groups (uniform colors) become more focused over iterations.

C3+ — Pruning on Hardware

C3+ — Merging on Hardware

Real robot planar pushing experiments demonstrating pruning and merging strategies across multiple feedback iterations.

VLA Pruning

Fig. 11. VLA pruning visualization. Grey objects are pruned; colored mask overlays are task-relevant. Goal (green) and target (purple) are selected; a second distractor goal is selected due to prompt ambiguity.

VLA Success Rates

Fig. 12. Task success rates on LIBERO-Spatial (100 episodes, 10 tasks). Clutter drops baseline from 0.85 → 0.50; VLM-Focus recovers to 0.70.

VLA Hardware

Fig. 13. Real hardware VLA experiment. VLM-Focus applied on hand images (b) without simulator bounding boxes. GroundingDINO + SAM + LaMa segment and inpaint task-irrelevant objects (c).

Table I — Hardware VLA (π₀.₅), n=10 per fruit

MethodTask🍍🍐🍌🍎
VLA (full scene)Pick Up0.80.70.00.0
VLA (full scene)Place0.20.00.00.0
+ VLM-FocusPick Up1.00.70.70.7
+ VLM-FocusPlace0.50.30.30.7

VLA results in LIBERO-Spatial simulation. Clutter drops baseline success from 0.85 to 0.50; VLM-Focus recovers to 0.70 by inpainting task-irrelevant objects before querying the policy.

Front Camera

Hand Camera

Hand Camera — Inpainted

Real hardware fruit pick-and-place using the full open-vocabulary pipeline (GroundingDINO + SAM + LaMa). The inpainted hand-camera view (right) shows distractors removed before querying the π₀.₅ policy.