IEEE Robotics & Automation Magazine - June 2022 - 80

●
●
Step A: online RL of a control policy, using privileged information
Step
B: policy analysis and sample collection, using privileged
information
● Step C: offline supervised learning of the SE
●
Step D: online adaptation learning of the control policy to
the SE.
We rely on privileged information in the form of rich and
accurate state measurements, which are often available in a
lab setting via external sensing, such as motion capture. We
also rely on a means of automating sample collection with a
specified distribution. In the case of the Furuta pendulum,
state measurements are readily available from the joint encoders,
and samples can be easily gathered using a standard combination
of energy-pumping and linear-quadratic regulator
(LQR) controllers. An important benefit of automating sample
collection is that it makes it possible to quickly and easily
collect new data sets. As is always the case, development is an
iterative process, and speeding it up is critical, yet seldom discussed
in the literature [7].
Step A: Learning the Control Policy
To focus on a reliable and sample-efficient training process,
we train a Proximal Policy Optimization (PPO) [19] RL agent
using privileged information as input. In the case of the Furuta
platform, the agent learns to swing up and balance the pole
in approximately 12 h of interaction time, which is equivalent
to 8 h of samples gathered for learning and 4 h for resetting.
The entire process is automated and could be run in a single
session without any intervention.
To enable the agent to learn this task reliably, it was important
to tune the reward function and adjust hyperparameters
based on knowledge about the system. We use a continuous
reward function, which accelerates training by providing a
reliable, steady increase in the accumulated reward. For the
Furuta pendulum, we use a quadratic reward penalizing the
angle positions of the pendulum, with
rt
=- -cc
cm (1)
1
5
4
|| ||
180
tt
5
1
ai 2
180
.
We train the agent with a small learning rate and clipping
factor (see Table S1), which also helps to reliably increase the
reward across training episodes. Agents with a large learning
rate learned the swing-up task more quickly but were not able
to learn to balance the pendulum reliably: they were susceptible
to " fatal forgetting, " or sudden large drops in their reward.
We surmise that this is because balancing requires very precise
control inputs and therefore a smaller learning rate.
Step B: Policy Analysis and Data Collection
Based on the control policy trained on privileged information,
we empirically identify minimum precision requirements by
injecting noise into the state until the task can no longer be
fulfilled. This threshold is then used as the convergence criteria
for " Step C: Learning Precise SE. " For the Furuta pendulum,
we add zero-mean Gaussian noise to the angles and
80 * IEEE ROBOTICS & AUTOMATION MAGAZINE * JUNE 2022
propagate it via finite differences to the angular velocities. At
a sampling frequency of 120 Hz, the agent can tolerate noise
with a standard deviation of
1 1 .c We also noticed that this
level of precision is necessary only to balance the pendulum
near the equilibrium point; the policy is able to swing up the
pendulum even with higher noise. Based on this observation,
we separately collect data for the swing-up and balancing
portions of the task (see " Reproducible Platform " ). The convergence
criteria are then tested only on images relevant for
balancing, which we heuristically determined as ||
a 1 10 .c
As we will see in the " Step C: Learning Precise SE " section,
converging to high precision across the entire state space not
only requires more training time but a larger DNN.
Step C: Learning Precise SE
Precise predictions require an unknown minimum network
capacity, which makes it difficult to reduce the execution
time with limited computational resources. We balance this
tradeoff with a deliberate choice of the DNN architecture, a
biased data set, and data augmentation methods.
To increase precision, we simplify the learning task and
train a DNN by using standard convolutional layers to estimate
only the pose from a single image. Velocities are then
computed from a buffer of previously estimated positions
and velocities via finite differences and a first-order lowpass
filter (see Figure 3). This structure reduces the SE's
prediction error to roughly a fifth compared to a recurrent
neural network architecture of similar size, which we speculate
is due to the freed capacity being available for higher
accuracy on a simpler task. Alternatively, velocities could be
estimated by using a history of images as input, but again,
this would significantly increase the network size, which we
need to reduce as much as possible.
We also down-sample the input image from 540 × 720 to
220 × 220 pixels, which enables the DNN depth to be
increased; we found this was more important for precision
than a higher image resolution. To compensate for the downsampling,
we add a very small stride of one pixel per step.
With a depth of 12 layers, the SE reaches a precision that is
able to distinguish individual pixels.
Despite these measures, the limited network size makes
it difficult for the DNN to converge to a low error everywhere.
Precise state estimates are often not needed
throughout the entire state space, and we can evaluate
where the SE should be more precise based on the policy
analysis conducted in " Step B: Policy Analysis and Data
Collection. " For the Furuta pendulum, we bias the training
data set to be more densely sampled around the upper
equilibrium point. An SE trained on a very biased data set
can meet our convergence criteria after just four episodes
of training. Due to its reliably low prediction error for
small angles (refer to Figure 4), the RL agent could also
adapt much faster.
To avoid overfitting to the training data set, and to increase
the SE's robustness, we also apply data augmentation methods
[20] during training. The input images are randomly zoomed,
IEEE Robotics & Automation Magazine - June 2022

Table of Contents for the Digital Edition of IEEE Robotics & Automation Magazine - June 2022