A Careful Control Loop
For Grounding language in perception; referring expressions, read the figure as an interface check: identify the language input, grounding evidence, action representation, safety gate, and logged result before accepting the agent behavior described below.
Build And Evaluation Checklist
Depth and self-containment. This section must explain how words like 'the mug beside the kettle' become probabilities over visible entities. Readers need both the matching objective and the failure cases created by occlusion, symmetry, and stale state estimates.
Production and evaluation contract. The evaluation artifact is not a generic retrieval score. It should log the scene, the referring expression, the candidate set, the chosen referent, and whether the downstream action succeeded with that choice.
For Grounding language in perception; referring expressions, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.
For Grounding language in perception; referring expressions, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.
Grounding language in perception is the bridge between words and scene variables. Referring expressions are useful only when the robot can resolve them into one object or region under the current viewpoint and state uncertainty.
This section shows how embodied agents resolve names, attributes, and relations such as color, position, containment, and ownership into specific entities visible to the robot.
The practical question is how to combine visual evidence with relational language when several objects share the same category or attribute.
Referring expressions are not labels; they are filters over a candidate set. The right target emerges only after the system scores attributes and relations jointly.
Theory
Given objects $z_1, \ldots, z_n$ extracted from perception and a referring expression $x$, the grounding problem is $$\hat z = \arg\max_i \; p(z_i \mid x, o_t) \propto p(x \mid z_i, o_t)\, p(z_i \mid o_t).$$ The perceptual prior says what objects are present; the language likelihood says which of those objects best matches the expression.
Relational language makes the problem harder because the referent depends on other objects. In 'the mug beside the kettle,' the target score depends on both the mug's own features and the probability that a nearby kettle exists and is correctly localized. Grounding therefore inherits every weakness of the detector and every ambiguity of the language model.
A robust system scores three kinds of evidence: unary attributes such as color or category, binary relations such as left-of or inside, and dialogue context such as the last mentioned object. The winner is the object whose combined evidence stays strongest after these factors are multiplied or summed.
Worked Example
Code Fragment 1 scores three candidate objects against color, category, and spatial relation. It is deliberately tiny, but it exposes the same reasoning pattern used by larger grounding models and dialogue systems.
# Resolve a referring expression using attribute and relation evidence.
# Each candidate receives unary scores and a relation score to the kettle.
# The selected object is the one with the highest combined grounding score.
candidates = [
{"id": "obj_1", "label": "mug", "color": "red", "near_kettle": True},
{"id": "obj_2", "label": "mug", "color": "blue", "near_kettle": True},
{"id": "obj_3", "label": "bowl", "color": "red", "near_kettle": False},
]
scores = {}
for obj in candidates:
unary = 0.7 if obj["label"] == "mug" else 0.1
color = 0.4 if obj["color"] == "red" else 0.0
relation = 0.5 if obj["near_kettle"] else -0.2
scores[obj["id"]] = round(unary + color + relation, 2)
print(scores)
print(max(scores, key=scores.get))
The expected output is a score table where obj_1 beats obj_2 by a modest relation-driven margin, followed by the winning referent id. That pattern tells the reader the grounding model used the kettle relation rather than only the noun mug; if obj_2 won instead, the likely failure is missing relational evidence rather than missing category evidence.
In practice, Grounding DINO, OWL-ViT, SAM 2, and open-vocabulary VLMs provide the candidate boxes or masks in a few lines. Those tools replace manual proposal generation and feature extraction, but the system still needs explicit relation reasoning and a downstream verifier.
Practical Recipe
- Detect or segment a candidate set before asking the language model to choose among them.
- Score attributes and relations separately so you can diagnose which signal failed.
- Keep the candidate set visible to the planner instead of passing only the winning object id.
- When the top two candidates are close, trigger a clarification question or an active view change.
- Re-ground after any action that changes visibility, object pose, or scene layout.
A common benchmark shortcut is to evaluate referring-expression accuracy on still images while downstream execution uses a moving camera and partial views. That mismatch makes grounding look solved even when the live system loses the referent after one arm motion.
A service robot hearing 'hand me the notebook under the lamp' must localize both the notebook and the lamp, reason about the support relation, and preserve that relation after viewpoint changes. If the lamp leaves the frame, the system needs either memory or a new view, not blind confidence.
Referring expressions are what happen when humans assume everyone in the room is already looking at the same scene. Robots are polite enough to pretend they are, right up until they pick the wrong mug.
Recent grounding work couples open-vocabulary detectors with segmentation, 3D scene memory, and dialogue. The active frontier is deciding when the agent should move the camera, query the user, or use past context to disambiguate the referent.
Can your system explain which attribute or relation eliminated the runner-up candidate, and what it would do if that evidence disappeared after a viewpoint change?
Referring expressions are a clean example of embodied partial observability. The agent may know the words but not the full scene, or it may see the scene but lack one relational anchor. Good systems therefore maintain uncertainty over the referent rather than forcing a premature point estimate.
This is also why closed-loop evaluation matters. A target selection model can have strong top-1 accuracy and still be a poor embodied component if it never signals uncertainty and therefore never triggers clarification or camera motion. The value of grounding lies in the final action outcome, not only in the static matching score.
| Tool or Library | Role in the Topic | Builder Advice |
|---|---|---|
| Grounding DINO | Open-vocabulary region proposals from text prompts. | Use it when object categories are not fixed ahead of time. |
| SAM 2 | Mask refinement and object persistence across frames. | Use it when manipulation requires accurate support surfaces or object boundaries. |
| OWL-ViT | Zero-shot text-conditioned detection. | Use it when you need fast category queries without training a custom detector. |
| RTAB-Map or a semantic map | Persistent world memory for entities and relations. | Use it when the referent may leave the current camera frame. |
| TEACh or ALFRED | Embodied datasets where referents matter for action. | Use them when static phrase grounding metrics are too weak for the downstream task. |
Code Fragment 2 records the grounding result as an auditable object. The important fields are the chosen referent and the score margin over the runner-up, because that margin should drive clarification or active sensing.
- Store the candidate list, the winning id, and the runner-up gap in one record.
- Tie the winner to the current camera frame or map timestamp so stale groundings are detectable.
- Route low-margin groundings to clarification or view-planning instead of execution.
- Log whether the downstream action preserved the intended relation after grasp or motion.
- Benchmark on scenes with distractors, occlusion, and viewpoint change, not only on clean still images.
The expected output is a compact grounding artifact with a winner, a runner-up, and a margin small enough to trigger clarify rather than execute. In a live system this exact trace is what separates a calibrated referential agent from an overconfident one: the same top prediction is preserved, but the action gate changes because the uncertainty is still operationally significant.
When grounding fails, separate detector failures, relation failures, stale-memory failures, and confidence-calibration failures. Different fixes apply to each: better proposals, explicit relation modeling, view planning, or threshold tuning.
Embodied grounding succeeds when words, scene evidence, and uncertainty are represented in the same decision loop.
Design a grounding record for the phrase 'the box under the table near the door.' List the unary and relational scores you would log, and say which ambiguity should trigger an active view change.
Grounding DINO is a widely used reference for text-conditioned region proposals that can serve embodied grounding pipelines.
Meta AI (2024). 'SAM 2: Segment Anything in Images and Videos.'
SAM 2 is useful when object masks and persistence matter more than coarse boxes, especially for manipulation.
Padmakumar et al. (2022). "TEACh: Task-driven Embodied Agents that Chat." AAAI.
TEACh is a strong reference for grounding in the presence of dialogue and hidden world state.