Humans train robots to complete tasks in one environment, and expect robots to perform those same tasks in new environments. As humans, we know which aspects of the environment (i.e., the state) are relevant to the task. But there are also things that do not matter; e.g., the color of the table or the presence of clutter in the background. Ideally, the robot's policy learns to ignore these irrelevant state components. Achieving this invariance improves generalization: the robot knows not to factor irrelevant variables into its control decisions, making the policy more robust to environment changes. In this paper we therefore propose a self-supervised method to learn a mask which, when multiplied by the observed state, transforms that state into a latent representation that is biased towards relevant elements. Our method --- which we call TransMASK --- can be combined with a variety of imitation learning frameworks (such as diffusion policies) without any additional labels or alterations to the loss function. To achieve this, we recognize that the learned policy updates to better match the human's true policy. This true policy only depends on the relevant parts of the state; hence, as the gradients pass back through the learned policy and our proposed mask, they increase the value for elements that cause the robot to better imitate the human. We can therefore train TransMASK at the same time as we learn the policy. By normalizing the magnitude of each row in TransMASK, we force the mask to align with the Jacobian of the expert policy: columns that correspond to relevant states have large magnitudes, while columns for irrelevant states approach zero magnitude. We compare our approach to other methods that extract relevant states for downstream imitation learning. Across experiments with visual and non-visual states, we see that TransMASK results in policies that are more robust to distribution shifts for irrelevant features.
Robot learning to pick up a green block. In imitation learning setting, an expert provides demonstrations by teleoperating the robot. These demonstrations include visual observations of the environment, the robot proprioceptive states, and the expert actions corresponding to each observation. While the expert actions are only influenced by certain task-relevant objects (e.g., green block, robot's relative position to the block, etc.), the robot observations contain information about the entire scene. A robot policy trained on these demonstrations can be extremely sensitive to environmental perturbations. For instance, when deploying a policy trained on demonstrations over a wooden table for an identical task over a marble table, the visual observations become out-of-distribution, causing the policy to fail. Our insight is: a robust policy must attend to the same features a human expert would. Thus, we design policies with structures that mask out task-irrelevant information.
Typically in imitation learning, the high-dimensional image observations are encoded to a feature space using vision encoders such as ResNet. We assume that the features we get are disentangled, i.e., each element of the feature either correlates to task-relevant or task-irrelevant information. TransMASK, our proposed approach for robust imitation learning, learns a mask \(M\) which transforms the input feature state \(s\) to a represntation \(z = M s\). This mask is a square matrix with the same dimension as \(s\). Importantly, \(M\) is a sparse, constant matrix, ensuring that the representation maintains a correlation with the state and only include certain elements of \(s\) in \(z\).
We parameterize the mask matrix \(M\) with a learnable parameter \(\theta\). We train TransMASK with the standard imitation learning objective \(L(\psi, M) = \sum_{(s, a) \in \mathcal{D}} \frac{1}{2} \| \pi_\psi(M s) - a \|^2\), where the state \(s\) can be privileged information (i.e., exact location of the objects) or disentangled features obtained from image observations. The policy \(\pi\), parameterized by \(\psi\), is conditioned on the state representation \(z = M s\) instead of the state. Our insight is: since the expert's actions are exclusively influenced by information critical to the task, the Jacobian of the expert policy will have higher values in columns corresponding to task-relevant elements of \(s\) and near zero for columns corresponding to task-irrelevant elements. As training converges, the robot policy \(\pi\) approaches the expert policy and its Jacobian approaches the same sparse structure. During optimization, the gradients of \(\pi\) change parameters \(\theta\) such that each row of \(M\) emphasizes task-relevant elements of \(s\) while suppressing irrelevant ones. This learned mask then transforms the state into a representation that completely removes the extraneous information about the environment.
We evaluate in three table-top manipulation tasks: Pick, Stack, and Scoop. For each task, the training demonstrations are collected exclusively in the robot scene with a wooden table. We call this scene In Domain (ID) because during evaluation observations from this scene lie in the training set. Next, we evaluate the policy in a different robot scene where the table is covered with a white sheet. This scene is labeled Out of Domain (OOD) as the change in the background introduces significant distributional shift in the image observations. We show that our method, which extracts task-relevant features from the observations while suppressing irrelevant features, learns a policy that is robust to these environmental disturbances.
@article{coming soon
}