Section 33.2: SayCan: affordance-grounded planning | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 33.2: SayCan: affordance-grounded planning. — Figure 33.2A: SayCan grounding: an LLM scores candidate action phrases by language likelihood while a value function scores them by physical affordance, and the product selects the action that is both linguistically plausible and physically feasible.

Read the figure as the SayCan product rule in system form: language likelihood proposes what is useful, affordance likelihood estimates what is possible, and the robot acts only where both scores support an executable skill.

Figure 33.2: A closed-loop map for SayCan: affordance-grounded planning. The diagram forces the reader to name the input, model boundary, action interface, and evidence record before trusting the system.

Build And Evaluation Checklist

Depth and self-containment. Readers should leave with the exact factorization used by SayCan and a clear view of why language plausibility alone is insufficient for robot planning. The section must also clarify where the affordance score comes from and what it assumes.

Production and evaluation contract. The artifact must record candidate skills, language-model probabilities, affordance values, the combined score, and the selected action. Only then can one audit whether the planner failed semantically or physically.

Checklist Memory Anchor

For SayCan: affordance-grounded planning, name the language interface, grounded world state, executable action contract, and evidence artifact before trusting any claimed improvement.

Mini Audit Exercise

For SayCan: affordance-grounded planning, write one evidence row recording instruction, world-state estimate, chosen action, verifier result, and failure label. Then identify which field would change first under command misunderstanding.

Big Picture

SayCan is the canonical pattern for combining language priors with physical affordances. It lets the LLM suggest what sounds right while a grounded value function asks what is actually executable now.

This section explains why affordance grounding is the natural antidote to free-text planning in robotics and why the combination is stronger than either source of evidence alone.

The practical question is how to combine semantic relevance and executability without letting one wash out the other.

Action Is The Test

SayCan works because semantic plausibility and physical feasibility answer different questions. One says what the human probably wants next; the other says what the robot can actually do now.

Theory

SayCan scores each candidate skill $k$ with a language prior and an affordance value: $$k^* = \arg\max_k \; p_\text{LLM}(k \mid x, h_t) \cdot V_k(s_t).$$ The language term prefers semantically appropriate next steps, while the value term estimates whether the robot can execute that step successfully in the current state.

The multiplication matters. A skill with high semantic probability but near-zero affordance should be rejected, and a highly executable skill with no semantic relevance should not dominate just because it is easy. The method therefore depends on score calibration and on the quality of the candidate skill library.

Mechanism

A good mental model is product-of-experts planning. The LLM narrows the skill search to task-consistent options, and the affordance model removes options that are impossible or low value in the current world state.

Worked Example

Code Fragment 1 implements the core SayCan score on three skills. The example is tiny, but it makes the product structure visible and shows how the selected action can differ from the highest language score alone.

# Combine semantic plausibility with grounded affordance values.
# The best skill is not the one with the largest language score alone.
# Product scoring removes semantically attractive but infeasible actions.
skills = {
    "pick_sponge": {"p_llm": 0.55, "affordance": 0.92},
    "turn_on_sink": {"p_llm": 0.30, "affordance": 0.95},
    "wipe_spill": {"p_llm": 0.80, "affordance": 0.18},
}

combined = {name: round(v["p_llm"] * v["affordance"], 3) for name, v in skills.items()}
print(combined)
print(max(combined, key=combined.get))

{'pick_sponge': 0.506, 'turn_on_sink': 0.285, 'wipe_spill': 0.144} pick_sponge

The expected output is a ranking where the chosen skill is not merely the most semantically plausible sentence completion, but the action with the highest joint semantic and affordance score. Here `pick_sponge` wins because it is both relevant and executable in the present scene, while `wipe_spill` is semantically tempting but prematurely chosen.

Code Fragment 1: This score composition shows why `wipe_spill` loses even though it has the strongest semantic score. The current world state makes it a poor next action, so the product favors `pick_sponge`, which is both relevant and executable.

Library Shortcut

The same idea can be implemented with a few lines using an LLM API plus an affordance model wrapped behind a typed tool interface. Those libraries remove prompt and schema boilerplate, but they do not remove the need to calibrate the affordance score and the candidate skill set.

Practical Recipe

Define a compact skill library whose actions expose clear preconditions and effects.
Generate only semantically plausible skill candidates rather than scoring the entire API surface.
Estimate affordance or value in the current state before execution, not from a stale scene snapshot.
Normalize or calibrate the two scores so one term does not dominate by scale alone.
Inspect failure cases where the right long-horizon plan starts with a low-probability semantic step.

Common Failure Mode

SayCan can fail if the candidate skill set is too narrow, the value functions are poorly calibrated, or the semantic model overprefers narratively obvious steps that are not optimal for the current embodiment.

Practical Example

In kitchen cleanup, 'wipe the spill' sounds like the right next step, but the robot may first need to pick the sponge or move a blocking bowl. Affordance grounding keeps the planner from issuing impossible or premature skills.

Memory Hook

SayCan is the polite adult in the room. It lets the language model dream big, then asks whether the robot can actually reach the sponge before promising heroics.

Research Frontier

Recent work extends the SayCan idea with better search, richer world models, and longer-horizon credit assignment, sometimes adding heuristic planners or learned payoff estimates on top of the original language-times-affordance product.

Self Check

If a skill has the highest language score but the lowest affordance, do you know where that skill should still appear in the diagnostic trace and why it should not win execution?

The scientific subtlety in SayCan is calibration. The product formula is simple, but only meaningful if the two terms are roughly comparable in their interpretation. A miscalibrated value model can dominate the semantic term and reduce the planner to greedily choosing whichever skill is easiest right now.

The method also inherits the classic option-discovery problem from hierarchical RL. It can only select among skills it already knows. If the correct subtask is missing from the library, no amount of language fluency will recover it, which is why skill design and affordance learning remain central.

Tool Choices Around SayCan-style Planning

Tool or Library	Role in the Topic	Builder Advice
LLM API with structured outputs	Candidate skill proposal.	Use it when the skill library is large enough that language can prune it meaningfully.
RL or success-value model	Affordance estimate for each skill.	Use it when executability depends on the current scene and embodiment.
BehaviorTree.CPP	Execution shell for chosen skills.	Use it when each skill needs explicit retry and failure handling.
ROS 2 actions	Typed skill invocation.	Use actions when each selected skill is long running and needs feedback.
EmbodiedBench or task-specific simulator	Construct-matched evaluation.	Use a matched benchmark when comparing SayCan-style planners against simpler baselines.

Code Fragment 2 stores the separate scores as an audit artifact rather than only the winning skill. This is the minimum needed to understand whether the semantic prior or the affordance estimator caused a bad decision.

Log the candidate skills and both scores for each decision point.
Keep the value-estimation state snapshot or seed so scores can be reproduced.
Store the chosen skill and the first rejected alternative for debugging.
Measure how often the affordance term changes the top language choice.
Benchmark with the same skill library and same execution stack when comparing alternatives.

The expected output is an audit tuple that preserves both factors of the SayCan product. If a future run selected the wrong skill with a high semantic score but a weak affordance score, the repair target would be calibration or candidate generation rather than the planner prompt alone.

Code Fragment 2: This audit record keeps the two terms of the SayCan product separate, which is essential for failure analysis. If the planner chose poorly, this trace tells you whether the semantic prior, the affordance model, or their calibration was at fault.

If the chosen skill is poor, first check candidate generation, then affordance calibration, then library coverage. SayCan errors often come from what is missing from the candidate set, not only from how the final score is computed.

Key Takeaway

SayCan succeeds by treating language and affordance as complementary experts rather than competing controllers.

Exercise 33.2.1

Construct a three-skill example where the top semantic choice is not executable and the top affordance choice is semantically irrelevant. Show how the product rule resolves the conflict and when it might still fail.

Bibliography and Further Reading

Primary Sources and Tools

Ahn et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv.

This is the primary SayCan source and the definitive reference for the language-times-affordance factorization.

Paper or Documentation

Li et al. (2023). "SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge as Heuristics." arXiv.

SayCanPay is a useful follow-on showing how the original idea can be extended with heuristic planning and payoff estimates.

Paper or Documentation

MoveIt 2 Documentation.

MoveIt is relevant because many SayCan-style systems still hand off chosen subgoals to classical geometric planning stacks.

Paper or Documentation