IEEE Spectrum July, 2013 - 31

The Human Genome Project's goal was to
sequence the 3 billion letters that make up the
genome of a human being. Because humans
To sequence a species' genome for the first time, the double-stranded DNA molecule is first
split down the middle [1]. This creates a single-stranded template. The template is copied many
are more than 99 percent genetically identimes [2]. All the templates are then diced up at random, to create many small fragments of different
tical, this first genome has been used as a
lengths [3]. A sequencing machine determines the order of nucleotides on each of those short
"reference" to guide future analyses. A larger,
fragments [4]. Then a software program looks for the places where the sequences of letters overlap
and constructs an "assembly graph" [5]. Finally, the program constructs one continuous sequence
ongoing project is the 1000 Genomes Project,
of letters: the genome [6].
aimed at compiling a more comprehensive
picture of how genomes vary among individuals and ethnic groups. For the U.S. National
Institutes of Health's Cancer Genome Atlas, researchers are sequencing samples from more
than 20 different types of tumors to study how
the mutated genomes present in cancer cells
differ from normal genomes, and how they
vary among different types of cancer.
Ideally, a DNA sequencer would simply take
a biological sample and churn out, in order,
the complete nucleotide sequence of the DNA
molecule contained therein. At the moment,
though, no sequencing technology is capable
of this. Instead, modern sequencers produce
a vast number of short strings of letters from
the DNA. Each string is called a sequencing
read, or "read" for short. A modern sequencer
produces reads that are a few hundred or perhaps a few thousand nucleotides long.
The aggregate of the millions of reads generated by the sequencer covers the person's entire genome many times over. For example, the
HiSeq 2000 machine, made by the San Diego-
based biotech company Illumina, is one of the
most powerful sequencers available. It can
sequence roughly 600 billion nucleotides in
about a week-in the form of 6 billion reads of 100 nucleotides each.
We have an easier job when we're studying a species whose genome
For comparison, an entire human genome contains 3 billion nucleo- has already been assembled. If we're examining mutations in human
tides. And the human genome isn't a particularly long one-a pine
cancer genomes, for example, we can download the previously
tree genome has 24 billion nucleotides.
assembled human genome from the National Institutes of Health
Thus our first daunting task upon receiving the reads is to stitch website and use it as a reference. For each read, we find the point where
them together into longer, more interpretable units, such as genes. that string of letters best matches the genome, using an approximate
For a organism that has never been fully sequenced before, like the
matching algorithm; the process is similar to how your spell-check
pine tree, it's a massive challenge to assemble the genome from
program finds the correct spelling based on your misspelled word.
scratch, or de novo.
The place where the read sequence most closely matches the referHow can we assemble a genome for the first time if we have no
ence sequence is our best guess as to where it belongs. Thanks to
knowledge of what the finished product should look like? Imagine
the Human Genome Project and similar projects for other species
taking 100 copies of the Charles Dickens novel A Tale of Two Cities and (mouse, fruit fly, chicken, cow, and thousands of microbial species,
dropping them all into a paper shredder, yielding a huge number of
for example), many assembled genomes are available for use as refsnippets the size of fortune-cookie slips. The first step to reassembling
erences for this task, which is called read alignment.
the novel would be to find snippets that overlap: "It was the best" and
In general, these reference genomes are far too long for brute
"the best of times," for example. A de novo assembly algorithm for force scanning algorithms-those that simply start at the beginning
DNA data does something analogous. It finds reads whose sequences
of the sequence and work their way through the entire genome,
"overlap" and records those overlaps in a huge diagram called an as- looking for the part that best matches the read in question. Instead,
sembly graph. For a large genome, this graph can occupy many tera- researchers have lately focused on building an effective genome
index, which allows them to rapidly home in on only those porbytes of RAM, and completing the genome sequence can require
weeks or months of computation on a world-class supercomputer.
tions of the reference genome that contain good matches. Just like

de novo sequencIng

illustration by

L-Dopa

SPectrum.ieee.orG

|

north american

|

jul 2013

|

31


http://SPectrum.ieee.orG

Table of Contents for the Digital Edition of IEEE Spectrum July, 2013

IEEE Spectrum July, 2013 - Cover1
IEEE Spectrum July, 2013 - Cover2
IEEE Spectrum July, 2013 - 1
IEEE Spectrum July, 2013 - 2
IEEE Spectrum July, 2013 - 3
IEEE Spectrum July, 2013 - 4
IEEE Spectrum July, 2013 - 5
IEEE Spectrum July, 2013 - 6
IEEE Spectrum July, 2013 - 7
IEEE Spectrum July, 2013 - 8
IEEE Spectrum July, 2013 - 9
IEEE Spectrum July, 2013 - 10
IEEE Spectrum July, 2013 - 11
IEEE Spectrum July, 2013 - 12
IEEE Spectrum July, 2013 - 13
IEEE Spectrum July, 2013 - 14
IEEE Spectrum July, 2013 - 15
IEEE Spectrum July, 2013 - 16
IEEE Spectrum July, 2013 - 17
IEEE Spectrum July, 2013 - 18
IEEE Spectrum July, 2013 - 19
IEEE Spectrum July, 2013 - 20
IEEE Spectrum July, 2013 - 21
IEEE Spectrum July, 2013 - 22
IEEE Spectrum July, 2013 - 23
IEEE Spectrum July, 2013 - 24
IEEE Spectrum July, 2013 - 25
IEEE Spectrum July, 2013 - 26
IEEE Spectrum July, 2013 - 27
IEEE Spectrum July, 2013 - 28
IEEE Spectrum July, 2013 - 29
IEEE Spectrum July, 2013 - 30
IEEE Spectrum July, 2013 - 31
IEEE Spectrum July, 2013 - 32
IEEE Spectrum July, 2013 - 33
IEEE Spectrum July, 2013 - 34
IEEE Spectrum July, 2013 - 35
IEEE Spectrum July, 2013 - 36
IEEE Spectrum July, 2013 - 37
IEEE Spectrum July, 2013 - 38
IEEE Spectrum July, 2013 - 39
IEEE Spectrum July, 2013 - 40
IEEE Spectrum July, 2013 - 41
IEEE Spectrum July, 2013 - 42
IEEE Spectrum July, 2013 - 43
IEEE Spectrum July, 2013 - 44
IEEE Spectrum July, 2013 - 45
IEEE Spectrum July, 2013 - 46
IEEE Spectrum July, 2013 - 47
IEEE Spectrum July, 2013 - 48
IEEE Spectrum July, 2013 - 49
IEEE Spectrum July, 2013 - 50
IEEE Spectrum July, 2013 - 51
IEEE Spectrum July, 2013 - 52
IEEE Spectrum July, 2013 - 53
IEEE Spectrum July, 2013 - 54
IEEE Spectrum July, 2013 - 55
IEEE Spectrum July, 2013 - 56
IEEE Spectrum July, 2013 - 57
IEEE Spectrum July, 2013 - 58
IEEE Spectrum July, 2013 - 59
IEEE Spectrum July, 2013 - 60
IEEE Spectrum July, 2013 - Cover3
IEEE Spectrum July, 2013 - Cover4
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1017
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0917
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0817
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0717
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0617
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0517
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0417
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0317
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0217
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0117
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1016
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0916
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0816
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0716
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0616
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0516
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0416
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0316
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0216
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0116
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1015
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0915
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0815
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0715
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0615
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0515
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0415
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0315
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0215
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0115
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1014
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0914
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0814
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0714
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0614
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0514
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0414
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0314
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0214
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0114
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1013
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0913
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0813
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0713
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0613
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0513
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0413
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0313
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0213
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0113
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1012
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0912
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0812
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0712
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0612
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0512
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0412
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0312
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0212
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0112
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1011
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0911
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0811
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0711
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0611
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0511
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0411
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0311
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0211
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0111
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1010
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0910
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0810
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0710
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0610
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0510
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0410
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0310
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0210
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0110
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1009
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0909
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0809
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0709
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0609
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0509
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0409
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0309
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0209
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0109
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1008
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0908
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0808
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0708
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0608
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0508
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0408
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0308
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0208
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0108
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1107
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_1007
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0907
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0807
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0707
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0607
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0507
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0407
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0307
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0207
https://www.nxtbook.com/nxtbooks/ieee/spectrum_na_0107
https://www.nxtbookmedia.com