IEEE Spectrum - North American - March 2016 - 35

can then be restarted from the last valid
checkpoint instead of beginning some
immense calculation anew.
This approach won't work indefinitely,
though, because as computers get bigger,
the time needed to create a checkpoint
increases. Eventually, this interval will
become longer than the typical period
before the next fault. A challenge for
exascale computing is what to do about
this grim reality.
Several groups are trying to improve
the speed of writing checkpoints. To the
extent they are successful, these efforts
will forestall the need to do something
totally different. But ultimately, applications will have to be rewritten to withstand a constant barrage of faults and
keep on running.
Unfortunately, today's programming
models and languages don't offer any
mechanism for such dynamic recovery from faults. In June 2012, members
of an international forum composed of
vendors, academics, and researchers
from the United States, Europe, and Asia
met and discussed adding resilience to
message-passing interface, or MPI, the
programming model used in nearly all
supercomputing code. Those present at
that meeting voted that the next version
of MPI would have no resilience capabilities added to it. So for the foreseeable future, programming models will
continue to offer no methods for notification or recovery from faults.

One reason is that there is no standard
that describes the types of faults that
the software will be notified about and
the mechanism for that notification. A
standard fault model would also define
the actions and services available to the
software to assist in recovery. Without
even a de facto fault model to go by, it
was not possible for these forum members to decide how to augment MPI for
greater resilience.
So the first order of business is for the
supercomputer community to agree on
a standard fault model. That's more difficult than it sounds because some faults
might be easy for one manufacturer to
deal with and hard for another. So there
are bound to be fierce squabbles. More
important, nobody really knows what
problems the fault model should address.
What are all the possible errors that affect
today's supercomputers? Which are most
common? Which errors are most concerning? No one yet has the answers.
And while I've talked a lot about faults
causing machines to crash, these are
not, in fact, the most dangerous. More
menacing are the errors that allow the
application to run to the end and give
an answer that looks correct but is actually wrong. You wouldn't want to fly in
an airliner designed using such a calculation. Nor would you want to certify a
new nuclear reactor based on one. These
undetected errors-their types, rates,
and impact-are the scariest aspect of
supercomputing's monster in the closet.
Given all the Gloom and doom
I've shared, you might wonder: How
can an exascale supercomputer ever be
expected to work? The answer may lie
in a handful of recent studies for which
researchers purposely injected different types of errors inside a computer
at random times and locations while it
was running an application. Remarkably enough, 90 percent of those errors
proved to be harmless.
One reason for that happy outcome
is that a significant fraction of the computer's main memory is usually unused.
And even if the memory is being used,
the next action on a memory cell after
the bit it holds is erroneously flipped

may be to write a value to that cell. If
so, the earlier bit flip will be harmless.
If instead the next action is to read
that memory cell, an incorrect value
f lows into the computation. But the
researchers found that even when a
bad value got into a computation, the
final result of a large simulation was
often the same.
Errors don't, however, limit themselves to data values: They can affect the
machine instructions held in memory,
too. The area of memory occupied by
machine instructions is much smaller
than the area taken up by the data, so the
probability of a cosmic ray corrupting an
instruction is smaller. But it can be much
more catastrophic. If a bit is flipped in a
machine instruction that is then executed,
the program will most likely crash. On
the other hand, if the error hits in a part
of the code that has already executed,
or in a path of the code that doesn't get
executed, the error is harmless.
There are also errors that can occur in
silicon logic. As a simple example, imagine that two numbers are being multiplied, but because of a transient error in
the multiplication circuitry, the result is
incorrect. How far off it will be can vary
greatly depending on the location and
timing of the error.
As with memory, flips that occur in silicon logic that is not being used are harmless. And even if this silicon is being used,
any flips that occur outside the narrow
time window when the calculation is
taking place are also harmless. What's
more, a bad multiplication is much like
a bad memory value going into the computation: Many times these have little
or no affect on the final result.
So many of the faults that arise in
future supercomputers will no doubt be
innocuous. But the ones that do matter
are nevertheless increasing at an alarming rate. So the supercomputing community must somehow address the serious
hardware and software challenges they
pose. What to do is not yet clear, but it's
clear we must do something to prevent
this monster from eating us alive. n
poSt Your commeNtS at http://spectrum.
ieee.org/resilience0316

SPectRUM.ieee.oRG

|

noRth AMeRicAn

|

MAR 2016

|

35


http://spectrum http://www.ieee.org/resilience0316 http://SPectRUM.ieee.oRG

Table of Contents for the Digital Edition of IEEE Spectrum - North American - March 2016

Contents
IEEE Spectrum - North American - March 2016 - Cover1
IEEE Spectrum - North American - March 2016 - Cover2
IEEE Spectrum - North American - March 2016 - 1
IEEE Spectrum - North American - March 2016 - 2
IEEE Spectrum - North American - March 2016 - Contents
IEEE Spectrum - North American - March 2016 - 4
IEEE Spectrum - North American - March 2016 - 5
IEEE Spectrum - North American - March 2016 - 6
IEEE Spectrum - North American - March 2016 - 7
IEEE Spectrum - North American - March 2016 - 8
IEEE Spectrum - North American - March 2016 - 9
IEEE Spectrum - North American - March 2016 - 10
IEEE Spectrum - North American - March 2016 - 11
IEEE Spectrum - North American - March 2016 - 12
IEEE Spectrum - North American - March 2016 - 13
IEEE Spectrum - North American - March 2016 - 14
IEEE Spectrum - North American - March 2016 - 15
IEEE Spectrum - North American - March 2016 - 16
IEEE Spectrum - North American - March 2016 - 17
IEEE Spectrum - North American - March 2016 - 18
IEEE Spectrum - North American - March 2016 - 19
IEEE Spectrum - North American - March 2016 - 20
IEEE Spectrum - North American - March 2016 - 21
IEEE Spectrum - North American - March 2016 - 22
IEEE Spectrum - North American - March 2016 - 23
IEEE Spectrum - North American - March 2016 - 24
IEEE Spectrum - North American - March 2016 - 25
IEEE Spectrum - North American - March 2016 - 26
IEEE Spectrum - North American - March 2016 - 27
IEEE Spectrum - North American - March 2016 - 28
IEEE Spectrum - North American - March 2016 - 29
IEEE Spectrum - North American - March 2016 - 30
IEEE Spectrum - North American - March 2016 - 31
IEEE Spectrum - North American - March 2016 - 32
IEEE Spectrum - North American - March 2016 - 33
IEEE Spectrum - North American - March 2016 - 34
IEEE Spectrum - North American - March 2016 - 35
IEEE Spectrum - North American - March 2016 - 36
IEEE Spectrum - North American - March 2016 - 37
IEEE Spectrum - North American - March 2016 - 38
IEEE Spectrum - North American - March 2016 - 39
IEEE Spectrum - North American - March 2016 - 40
IEEE Spectrum - North American - March 2016 - 41
IEEE Spectrum - North American - March 2016 - 42
IEEE Spectrum - North American - March 2016 - 43
IEEE Spectrum - North American - March 2016 - 44
IEEE Spectrum - North American - March 2016 - 45
IEEE Spectrum - North American - March 2016 - 46
IEEE Spectrum - North American - March 2016 - 47
IEEE Spectrum - North American - March 2016 - 48
IEEE Spectrum - North American - March 2016 - 49
IEEE Spectrum - North American - March 2016 - 50
IEEE Spectrum - North American - March 2016 - 51
IEEE Spectrum - North American - March 2016 - 52
IEEE Spectrum - North American - March 2016 - 53
IEEE Spectrum - North American - March 2016 - 54
IEEE Spectrum - North American - March 2016 - 55
IEEE Spectrum - North American - March 2016 - 56
IEEE Spectrum - North American - March 2016 - 57
IEEE Spectrum - North American - March 2016 - 58
IEEE Spectrum - North American - March 2016 - 59
IEEE Spectrum - North American - March 2016 - 60
IEEE Spectrum - North American - March 2016 - 61
IEEE Spectrum - North American - March 2016 - 62
IEEE Spectrum - North American - March 2016 - 63
IEEE Spectrum - North American - March 2016 - 64
IEEE Spectrum - North American - March 2016 - 65
IEEE Spectrum - North American - March 2016 - 66
IEEE Spectrum - North American - March 2016 - 67
IEEE Spectrum - North American - March 2016 - 68
IEEE Spectrum - North American - March 2016 - 69
IEEE Spectrum - North American - March 2016 - 70
IEEE Spectrum - North American - March 2016 - 71
IEEE Spectrum - North American - March 2016 - 72
IEEE Spectrum - North American - March 2016 - 73
IEEE Spectrum - North American - March 2016 - 74
IEEE Spectrum - North American - March 2016 - 75
IEEE Spectrum - North American - March 2016 - 76
IEEE Spectrum - North American - March 2016 - Cover3
IEEE Spectrum - North American - March 2016 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1017
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0917
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0817
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0717
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0617
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0517
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0417
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0317
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1016
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0916
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0816
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0716
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0616
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0516
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0416
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0316
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1015
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0915
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0815
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0715
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0615
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0515
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0415
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0315
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1014
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0914
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0814
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0714
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0614
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0514
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0414
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0314
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1013
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0913
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0813
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0713
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0613
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0513
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0413
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0313
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1012
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0912
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0812
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0712
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0612
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0512
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0412
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0312
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1011
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0911
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0811
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0711
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0611
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0511
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0411
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0311
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1010
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0910
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0810
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0710
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0610
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0510
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0410
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0310
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1009
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0909
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0809
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0709
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0609
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0509
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0409
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0309
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1008
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0908
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0808
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0708
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0608
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0508
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0408
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0308
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1107
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1007
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0907
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0807
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0707
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0607
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0507
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0407
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0307
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0107
https://www.nxtbookmedia.com