Towards Balanced Behavior Cloning from Imbalanced Datasets

Sagar Parekh      Heramb Nemlekar      Dylan Losey
Collab, Virginia Tech Under Review

Abstract

Robots should be able to learn complex behaviors from human demonstrations. In practice, these human-provided datasets are inevitably imbalanced: i.e., the human demonstrates some subtasks more frequency than others. State-of-the-art methods default to treating each element of the human's dataset as equally important. So if --- for instance --- the majority of the human's data focuses on reaching a goal, and only a few state-action pairs move to avoid an obstacle, the learning algorithm will place greater emphasis on goal reaching. More generally, misalignment between the relative amounts of data and the importance of that data causes fundamental problems for imitation learning approaches. In this paper we analyze and develop learning methods that automatically account for mixed datasets. We formally prove that imbalanced data leads to imbalanced policies when each state-action pair is weighted equally; these policies emulate the most represented behaviors, and not the human's complex, multi-task demonstrations. We next explore algorithms that rebalance offline datasets (i.e., reweight the importance of different state-action pairs) without human oversight. Reweighting the dataset can enhance the overall policy performance. However, there is no free lunch: each method for autonomously rebalancing brings its own pros and cons. We formulate these advantages and disadvantages, helping other researchers identify when each type of approach is most appropriate. We conclude by introducing a novel meta-gradient rebalancing algorithm that addresses the primary limitations behind existing approaches. Our experiments show that dataset rebalancing leads to better downstream learning, improving the performance of general imitation learning algorithms without requiring additional data collection.

Effect of Data Imbalance on Learned Policy


Our propostition is that the policy learned from an imbalanced dataset is biased by the behaviors that are more frequently represented in the dataset. This comes at the cost of the policy ignoring the underrepresented behaviors. We formulate the behaviors as sub-policies \(\pi_i(a\mid s_i)\) of the human, i.e., humans use different sub-policies to demonstrate different behaviors. For example, when demonstrating how to organize a table, humans use sub-policy \(\pi_1(a\mid s_1)\) to demonstrate picking up an object and sub-policy \(\pi_2(a\mid s_2)\) to demonstrate opening a drawer. Assuming that these sub-policies are Gaussian, when we learn a policy \(\pi_{\theta}\) on an imbalanced dataset, the optimal value of the parameter \(\theta\) is \(\sum_{i=1}^{k} \rho_i \cdot \theta_i\) where \(\rho_i\) is the joint probability denisty of states and actions from sub-policy \(\pi_i\). For a detailed proof, please refer to our paper.

What is Important for Balancing Datasets?


One straightforward way to balance a dataset is to equally weight each behavior. This means we can upsample the underrepresented behaviors until the joint probability density of states and actions for each of the \(k\) sub-policies becomes equal to \(\frac{1}{k}\). This method is inspired from the common approach used in class imbalance problems in classification tasks. However, in imitation learning it can be limiting because it does not take into consideration the relative difficulty of different behaviors. For example, consider a dataset containing two sub-tasks of varying difficulty: throwing a ball at a moving target (hard) and dropping the ball into a large stationary bin (easy). Typically, the robot would require more data and training iterations to learn the first subtask than the second. Due to this, when we train the robot by giving equal weight to both subtasks, the robot may still learn to drop the ball more accurately than throwing. Instead, we want to dynamically change the weights for each behavior based on their relative difficulty. we increase the weights for sub-policies that have a higher expected loss by maximizing the following objective: \(\mathcal{L}_{eq-l} = \sum_{i=1}^{k} \alpha_{i} \left( \mathbb{E}_{(s,a) \sim\mathcal{D}_{i}} D_{KL} (\pi_{i} || \pi_{\theta}) - \mathcal{L}_{i}^{ref} \right)\) Here, \(\mathcal{L}_{i}^{ref}\) is the target loss for each sub-policy and represents its desired accuracy.

How Do We Calculate the Target Loss \(\mathcal{L}^{ref}\)?


We propose a meta-gradient approach for estimating the lowest achievable training loss for each behavior and setting it as its reference. We start by defining the expected loss for behavior \(\pi_{i}\): \(\mathcal{L}_{i} = \mathbb{E}_{(s,a)\sim\mathcal{D}_{i}} D_{KL}(\pi_{i}||\pi_{\theta_{\alpha}})\) To determine the minimum value for the above equation, we must find weights \(\alpha^{*}\) that make the best use of the entire dataset for learning \(\pi_{i}\). To learn these weights and compute the reference loss for each behavior, we first update the policy parameters using the loss \(\mathcal{L}_{\alpha}(\theta) = \sum_{i=1}^{k} \alpha_{i} \left(\mathbb{E}_{(s,a)\sim\mathcal{D}_{i}} D_{KL}(\pi_{i}||\pi_{\theta_{\alpha}})\right)\) for one (or a few) gradient step(s): \(\theta_{new} = \theta - \beta_{1} \nabla_{\theta}\mathcal{L}_{\alpha}(\theta)\) Next, we update the weights \(\alpha\) using \(\alpha_{new} = \alpha - \beta_{2} \nabla_{\alpha}\mathcal{L}_{i}(\theta_{new})\) We use the converged weights \(\alpha_{new} = \alpha^{*}\) to compute \(\mathcal{L}_{i}^{min}\) and set it as the reference \(\mathcal{L}_{i}^{ref}\) for that behavior. The reference losses \(\mathcal{L}_{1}^{min}, \ldots, \mathcal{L}_{k}^{min}\) computed by our approach represent the best training performance for each behavior, avoiding the pitfalls of overestimating or underestimating the targets.

BibTeX

@misc{parekh2025balancedbehaviorcloningimbalanced,
      title={Towards Balanced Behavior Cloning from Imbalanced Datasets}, 
      author={Sagar Parekh and Heramb Nemlekar and Dylan P. Losey},
      year={2025},
      eprint={2508.06319},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2508.06319},}

Acknowledgements

This work was supported by the USDA National Institute of Food and Agriculture, Grant 2022-67021-37868.