Section 4.5: 2D and 3D transformations; transform trees (tf in ROS) | Building Embodied AI: From Perception to Autonomous Action

A Careful Control Loop

Technical illustration for Section 4.5: 2D and 3D transformations; transform trees (tf in ROS). — Figure 4.5A: A ROS-style transform tree for a mobile manipulator: world-to-odom, odom-to-base, base-to-arm, arm-to-gripper, and camera frames, with tf lookups resolving any pair at a given timestamp.

Big Picture

2D and 3D transformations; transform trees (tf in ROS) turn many local frame relationships into one queryable robot-wide spatial memory. A mobile robot may have frames for map, odom, base_link, lidar, camera, wrist, and gripper. The transform tree lets any subsystem ask, "Where is this point in the frame I control?" without every subsystem manually knowing every calibration edge.

This section develops the technical contract for 2D and 3D transforms as a graph problem. First we define a transform tree as a directed acyclic frame graph with one parent per child. Then we show how a lookup composes edges along a path. Finally we connect the math to the ROS tf2 discipline of stamped transforms, buffer windows, and explicit lookup times.

The key question is practical: when a camera detects an obstacle, which chain of transforms converts that obstacle into the planning frame at the time the planner needs it?

Theory

A transform tree stores edges such as $T_{\text{map},\text{odom}}$, $T_{\text{odom},\text{base}}$, and $T_{\text{base},\text{camera}}$. A lookup from camera to map follows the unique path through the tree and composes the transforms in path order:

$$T_{\text{map},\text{camera}} = T_{\text{map},\text{odom}}T_{\text{odom},\text{base}}T_{\text{base},\text{camera}}.$$

This is why tf2 insists on parent and child frame names. Without the names, a transform matrix is only a 4 by 4 array. With the names and timestamp, it becomes a claim about where one coordinate convention sits relative to another at a specific time.

2D transforms are the same idea with fewer degrees of freedom. A planar robot often uses $(x, y, \theta)$ and the group SE(2). A flying robot, manipulator, or camera-bearing humanoid needs SE(3), because roll, pitch, yaw, and vertical translation are load-bearing state variables.

Mechanism

A tf buffer is a time-indexed graph. Static edges store calibration, such as base to camera. Dynamic edges store motion estimates, such as map to odom or odom to base. A correct lookup must choose both a path and a time; spatial correctness and temporal correctness are inseparable.

Worked Example

Code Fragment 4.5.1 implements the smallest useful transform-tree lookup. It stores three edges, composes the path from map to camera, and applies the resulting transform to one point reported by the camera.

# Compose a tf-style path from map to camera and transform one point.
# Each edge is named by parent and child frame to prevent silent direction bugs.
# The example omits rotation so the path arithmetic is easy to inspect.
import numpy as np

def translate(x, y, z):
    transform = np.eye(4)
    transform[:3, 3] = [x, y, z]
    return transform

edges = {
    ("map", "odom"): translate(2.0, 0.0, 0.0),
    ("odom", "base_link"): translate(0.5, 1.0, 0.0),
    ("base_link", "camera"): translate(0.2, 0.0, 0.8),
}

path = [("map", "odom"), ("odom", "base_link"), ("base_link", "camera")]
map_from_camera = np.eye(4)
for edge in path:
    map_from_camera = map_from_camera @ edges[edge]

point_camera = np.array([1.0, 0.0, 0.0, 1.0])
point_map = map_from_camera @ point_camera
print(point_map[:3].round(3).tolist())

[3.7, 1.0, 0.8]

Code Fragment 4.5.1 composes the named edges map to odom, odom to base_link, and base_link to camera. The resulting point_map value shows how a camera measurement becomes planner-ready map-frame evidence.

Expected output: the point moves by the sum of the three translations. If a real tf2 lookup gives a different direction, inspect whether the code requested source-to-target or target-to-source, and whether the lookup time matches the sensor timestamp.

Library Shortcut

The hand-built fragment keeps frame semantics visible. In production, SciPy Rotation handles rotation representations, ROS 2 tf2 keeps a time-buffered frame tree, spatialmath-python gives compact pose algebra, Drake exposes typed rigid transforms, and OpenCV calibration anchors camera intrinsics and extrinsics. The shortcut removes boilerplate, but the hand-built version remains the debugging oracle.

Failure Modes

Wrong lookup direction. Requesting tf.lookup("camera", "map") instead of tf.lookup("map", "camera") returns the transpose of the intended transform. In SE(3) those are different objects. One-point sanity checks (does the camera appear in front of the robot?) are faster than reading quaternion signs.
Timestamp mismatch. tf2 interpolates between buffered transforms. If you look up the camera-to-odom transform at wall time rather than the camera image timestamp, you introduce latency-proportional pose error. For a robot moving at 1 m/s and a 50 ms latency, that is 5 cm of systematic placement error.
Static transform republished on every tick. Publishing a calibration edge (base to camera) as a dynamic transform causes every downstream subscriber to receive a duplicate. Use StaticTransformBroadcaster in ROS 2 for edges that never move.
Cycle in the tree. tf silently fails if two nodes each claim to be the other's parent. The error appears far downstream as an impossible pose or a buffer timeout, not at the frame where the cycle was introduced.

Memory Hook

The tf tree is implicit matrix multiplication made explicit, named, and time-stamped. Every silent frame-direction bug in robot code is really a silent matrix-order bug that the transform tree disciplines away.

Research Frontier

Static tf trees assume rigid bodies. Research on deformable robots, soft actuators, and contact-rich manipulation requires probabilistic or deformable frame representations. The GTSAM factor graph attaches covariance to each edge so that a SLAM back-end can propagate uncertainty through the tree. Neural implicit representations (NeRF-based SLAM) take a different approach: rather than maintaining a frame tree, they embed geometry directly in a continuous function and query poses by optimization. Both directions are active, and neither has displaced tf2 for real-time reactive control as of 2026.

Transform-tree bugs look like weak perception or control. Check parent-child direction, timestamp, static-vs-dynamic classification, and buffer latency before changing the robot policy.

Section References

Foote, T. "tf: The transform library." IEEE Conference on Technologies for Practical Robot Applications (TePRA), 2013.

The design document for the ROS tf system: frame naming, parent-child conventions, time-buffered lookup, and the motivation for separating static from dynamic edges.

Lynch, K. M., and Park, F. C. "Modern Robotics: Mechanics, Planning, and Control." Cambridge University Press, 2017. http://modernrobotics.org

Establishes the screw-theory view of SE(2) and SE(3) composition used throughout this chapter; the transform-tree lookup is Chapter 3 composition in graph form.

ROS 2 tf2 documentation. https://docs.ros.org/en/rolling/Concepts/Intermediate/About-Tf2.html

The authoritative reference for buffer windows, lookup API, static vs. dynamic broadcasters, and tf2 migration from ROS 1.

Exercise 4.5.1

Extend the Code Fragment above with a rotation. Give the odom-to-base_link edge a 90° yaw rotation (rotation matrix that swaps x and y). Compose the full path map to camera and verify: (a) the camera origin in map coordinates, (b) that a unit vector pointing forward in the camera frame maps to the correct direction in the map frame, and (c) that map_from_camera @ camera_from_map = I. Explain which intermediate transform is most likely to be wrong if the robot turns left when commanded to go forward.