IEEE Computational Intelligence Magazine - May 2022 - 39

generally regarded as an essential step in some areas. For instance,
in natural language processing, most state-of-the-art methods use
these pre-trained models [70]. In addition, modern deep learningbased
3D object reconstruction [72] and disparity estimation in
stereo vision [73] rely on self-supervised learning to overcome the
time-consuming manual annotation of training data.
A common approach for meta-learning in Bayesian statistics
is to recast the problem as hierarchical Bayes [74], with the
prior ()p t
ip; for each task conditioned on a new global variable
p (Figure 6d). p can represent continuous metaparameters
or discrete information about the structure of the BNN,
i.e., to learn probable functional models, or the underlying subgraph
of the PGM, i.e., to learn probable stochastic models.
Multiple levels can be added to organize the tasks in a more
complex hierarchy if needed. Here, we present only the case
with one level since the generalization is straightforward. With
this broad Bayesian understanding of meta-learning, both transfer
learning and self-supervised learning are special cases of
meta-learning. The general posterior becomes:
pD pD Dp phm ^ h ,, .
h ^
^ip;? ;;
tT
h
c% ^ tt
y
!
In practice, the problem is often approached with empirical
Bayes (Section V-D), and only a point estimate pt
is considered
for the global variable, ideally the MAP estimate obtained by
marginalizing (,
but this is not always the case.
In transfer learning, the usual approach would be to set
pim ,=t
with mi
new prior can then be obtained from p ,t for example:
p ip; () N(( (),),),
= xvp
I
(33)
where x is a selection of the parameters to transfer and v is a
parameter to tune manually. Unselected parameters are assigned
a new prior, with a mean of 0 by convention. If a BNN has
been trained for the main task, then v can be estimated from
the previous posterior, with an increment to account for the
additional uncertainty caused by the domain shift.
Self-supervised learning can be implemented in two steps.
The first step learns the pretext task while the second one performs
transfer learning. This can be considered overly complex
but might be required if the pretext task has a high computational
complexity (e.g., BERT models in natural language processing
[70]). Recent contributions
[75] have shown that
jointly learning the pretext task and the final task (Figure 6e)
can improve the results obtained in self-supervised learning.
This approach, which is closer to hierarchical Bayes, also allows
setting the prior a single time while still retaining the benefits
of self-supervised learning.
V. Bayesian Inference Algorithms
A priori, a BNN does not require a learning phase as one just
needs to sample the posterior and do model averaging; see
Algorithm 1. However, sampling the posterior is not easy in
being the coefficients of the main task. The
pD);ip and selecting the most likely point,
A. Markov Chain Monte Carlo (MCMC)
The idea behind MCMC methods is to construct a Markov
chain, a sequence of random samples ,Si
depend only on the previous sample S ,i 1which
probabilistically
such that the Si
are
distributed following a desired distribution. Unlike standard sampling
methods such as rejection or inversion sampling, most
MCMC algorithms require an initial burn-in time before the
Markov chain converges to the desired distribution. Moreover,
the successive Si
's might be autocorrelated. This means that a
large set of samples H has to be generated and subsampled to
obtain approximately independent samples from the underlying
distribution. The final collection of samples H has to be stored
after training, which is expensive for most deep learning models.
Despite their inherent drawbacks, MCMC methods can be
considered among the best available and the most popular solutions
for sampling from exact posterior distributions in Bayesian
statistics [78]. However, not all MCMC algorithms are relevant
for Bayesian deep learning. Gibbs sampling [79], for example, is
very popular in general statistics and unsupervised machine
learning but is very ill-suited for BNNs. The most relevant
MCMC method for BNNs is the Metropolis-Hastings algorithm
[80]. The property that makes the Metropolis-Hasting
algorithm popular is that it does not require knowledge about
the exact probability distribution ()xP
to sample from. Instead, a
function ()xf that is proportional to that distribution is sufficient.
This is the case of a Bayesian posterior distribution, which
is usually quite easy to compute except for the evidence term.
The Metropolis-Hasting algorithm, see Algorithm 4,
starts with a random initial guess, ,0i and then samples a new
candidate point il around the previous
tribution ().Q ;iil
target distribution, it is accepted. If it is less likely, it is accepted
with a certain probability or rejected otherwise.
MAY 2022 | IEEE COMPUTATIONAL INTELLIGENCE MAGAZINE 39
i , using a proposal disIf
il is more likely than i according to the
x tt
ii pp
(32)
the general case. While the conditional probability () of
PD H;
the data and the probability P(H) of the model are given by the
stochastic model, the integral for the evidence term
() ()
PD HP HdH
H
# ; ll l might be excessively difficult to compute.
For nontrivial models, even if the evidence has been
computed, directly sampling the posterior is prohibitively
difficult due to the high dimensionality of the sampling
space. Instead of using traditional methods, e.g., inversion
sampling or rejection sampling to sample the posterior, dedicated
algorithms are used. The most popular ones are Markov
chain Monte Carlo (MCMC) methods [76], a family of
algorithms that exactly sample the posterior, or variational
inference [77], a method for learning an approximation of
the posterior; see Figure 2.
This section reviews these methods. First, in subsection V-A
and V-B, we introduce MCMC and variational inference as
they are used in traditional Bayesian statistics. Then, in subsection
V-E, we review different simplifications or approximations
that have been proposed for deep learning. We also provide a
practical example in the Supplementary Material (Practical
example III), which compares different learning strategies.
IEEE Computational Intelligence Magazine - May 2022

Table of Contents for the Digital Edition of IEEE Computational Intelligence Magazine - May 2022