Chosen Simulator
〈st+1, rt 〉

point, x, conditioned on the training data. The predicted output value at x is normally distributed, with mean nt (x) and
variance vt 2 (x) given by
nt (x) = n (x) + k (x,

Chooses Simulator
to Execute at





Algorithm 1: GP-VI-MFRL
1: procedure
2: Input: confidence parameters v th and v sum
th ; simulator
chain GR, fidelity parameters b, state mappings tH; L .
3: Initialize: Transition functions P assl(i ) and D i for
i ! " 1, f, d ,; change = False.
t i ! Planner (P assl(i )) .
4: Initialize: i ! 1; Q
5: while terminal condition is not met
t i (s, a)
a t ! argmax a Q
if v i (s t, a t) # v th : change = True
if v( t i (s t), a t) $ v th and change and i 2 1
s t ! t i (s t), i ! i - 1, continue
G s t + 1, rt + 1H ! execute a t in R i
append G s t, a t, s t + 1, rtH to D i
P assl(i ) ! update GP i using D i
t i ! call Planner with input P assl(i )
t ! t +1
j = t-1
if / j = t - L v i (s j, a j) # v sum
and t 2 L and i 1 d
s t ! t -i +11(s t), i ! i + 1;
change = False
18: end procedure
20: procedure Planner (P assl(i ))
21: Initialize: Q(s, a) = 0 for each (s, a), D = 3
22: while T 2 0.1
24: for every (s, a)
t i - 1( t i (s), a) + b i
temp ! Q(s, a), Q(s, a) = Q
for k ! " i, f, d ,
s k = t -k 1 f t -i +12 t -i +11(s)
if v k(s k, a) # v th : P assl(i ) = P assl(k)
Q(s, a) ! / a / sl P assl [R assl + c max a Q (sl , a)]
T ! max(T, | temp - Q(s, a)|)
31: return Q(s, a)
32: end procedure





(x) = k (x, x) - k (x, X) [K (X, X) + ~ I] k (X, x), (3)

K xl, xm = v 2 exp e - 1

/ ` x dl -l d x dm j2 o + ~ 2;



, l d, and ~ 2 are hyperparameters that can be either set by
the user or learned online through the training data; and D is
the dimension of the training inputs.
In the GPQ-MFRL algorithm, we use GPs to learn Q values. GPs are proven to be consistent function approximators
in RL with convergence guarantees [19]. A set of state-action
pairs is the input to the GP, and Q values are the output/observation values to be predicted. In the GP-VI-MFRL algorithm,
we use GPs to learn the transition function. The input to the
GPs is a set of observed state-action pairs, and we predict the
next state as output for a newly observed state-action pair.

Figure 2. The simulators are represented by R 1, R 2, f, R d . The
algorithms determine the simulator in which the current action
is to be executed by the agent. Also, the action values in the
chosen simulator are updated and used to select the best action,
using the information from higher and lower simulators.


where K (X, X) is the kernel. The entry K xl, xm gives the covariance between two inputs x l and x m, and n(x) in (2) is the
prior mean of the output value at x .
We use a zero-mean prior and a squared-exponential kernel where K xl, xm is given by

Greedily Chooses



X ) [K (X, X ) + ~ 2 I]-1 y,


Algorithm Description
In this section, we first describe both versions of our algorithm.
We compare the proposed algorithms with baseline strategies
through simulations. A flowchart of our algorithms is shown in
Figure 2. We make the following assumptions for both algorithms.
● The reward function is known to the agent; we make this
assumption for ease of exposition. In general, one can use
GPs to estimate the reward function as well. This assumption is required only for the GP-VI-MFRL algorithm.
● The state space in simulator R i - 1 is a subset of the state
space in simulator Ri . The many-to-one mapping ti maps
states from simulator Ri to states in simulator R i - 1 . We
give an example of such mapping in subsequent sections.
Let t -i 1 denote the respective inverse mapping (which can
be one-to-many) from states in R i - 1 to states in Ri . The
action space is discrete and the same for all of the simulators
and the real world.
GP-VI-MFRL Algorithm
Algorithm 1 consists of a model learner and a planner. The
model learner learns the transition functions using GP regression. We use VI [18] as our planner to calculate the optimal
policy with learned transition functions. Let s t + 1 = f (s t, a t)
be the (unknown) transition function that must be learned.
We observe transitions D = " Gs t, a t, s t + 1H , . Our goal is to learn
an estimate tf (s, a) of f (s, a). We can then use this estimated tf for unvisited state-action pairs (in place of f ) during VI


