COMPARISON OF NEXT — GENERATION SEQUENCING SYSTEMS

LIN LIU, YINHU LI, SILIANG LI, NI HU, YIMIN HE, RAY PONG, DANNI LIN, LIHUA LU, and MAGGIE LAW

10.1 INTRODUCTION

(Deoxyribonucleic acid) DNA was demonstrated as the genetic material by Oswald Theodore Avery in 1944. Its double helical strand structure composed of four bases was determined by James D. Watson and Francis Crick in 1953, leading to the central dogma of molecular biology. In most cases, genomic DNA defined the species and individuals, which makes the DNA sequence fundamental to the research on the structures and functions of cells and the decoding of life mysteries [1]. DNA sequencing technolo­gies could help biologists and health care providers in a broad range of ap­plications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA sequencing technologies ide­ally should be fast, accurate, easy-to-operate, and cheap. In the past thirty years, DNA sequencing technologies and applications have undergone tremendous development and act as the engine of the genome era which is characterized by vast amount of genome data and subsequently broad range of research areas and multiple applications. It is necessary to look back on the history of sequencing technology development to review the NGS systems (454, GA/HiSeq, and SOLiD), to compare their advantages and disadvantages, to discuss the various applications, and to evaluate the recently introduced PGM (personal genome machines) and third-genera­

tion sequencing technologies and applications. All of these aspects will be described in this paper. Most data and conclusions are from independent users who have extensive first-hand experience in these typical NGS sys­tems in BGI (Beijing Genomics Institute).

Before talking about the NGS systems, we would like to review the history of DNA sequencing briefly. In 1977, Frederick Sanger developed DNA sequencing technology which was based on chain-termination meth­od (also known as Sanger sequencing), and Walter Gilbert developed an­other sequencing technology based on chemical modification of DNA and subsequent cleavage at specific bases. Because of its high efficiency and low radioactivity, Sanger sequencing was adopted as the primary tech­nology in the “first generation” of laboratory and commercial sequencing applications [2]. At that time, DNA sequencing was laborious and radioac­tive materials were required. After years of improvement, Applied Biosys­tems introduced the first automatic sequencing machine (namely AB370) in 1987, adopting capillary electrophoresis which made the sequencing faster and more accurate. AB370 could detect 96 bases one time, 500 K bases a day, and the read length could reach 600 bases. The current model AB3730xl can output 2.88 M bases per day and read length could reach 900 bases since 1995. Emerged in 1998, the automatic sequencing instru­ments and associated software using the capillary sequencing machines and Sanger sequencing technology became the main tools for the comple­tion of human genome project in 2001 [3]. This project greatly stimulated the development of powerful novel sequencing instrument to increase speed and accuracy, while simultaneously reducing cost and manpower. Not only this, X-prize also accelerated the development of next-genera­tion sequencing (NGS) [4]. The NGS technologies are different from the Sanger method in aspects of massively parallel analysis, high throughput, and reduced cost. Although NGS makes genome sequences handy, the fol­lowed data analysis and biological explanations are still the bottle-neck in understanding genomes.

Following the human genome project, 454 was launched by 454 in 2005, and Solexa released Genome Analyzer the next year, followed by (Sequencing by Oligo Ligation Detection) SOLiD provided from Agen — court, which are three most typical massively parallel sequencing systems in the next-generation sequencing (NGS) that shared good performance on throughput, accuracy, and cost compared with Sanger sequencing (shown in Table 1(a)). These founder companies were then purchased by other companies: in 2006 Agencourt was purchased by Applied Biosystems, and in 2007, 454 was purchased by Roche, while Solexa was purchased by Illumina. After years of evolution, these three systems exhibit better performance and their own advantages in terms of read length, accuracy, applications, consumables, man power requirement and informatics in­frastructure, and so forth. The comparison of these three systems will be focused and discussed in the later part of this paper (also see Tables 1(a), 1(b), and 1(c)).

TABLE 1: (a) Advantage and mechanism of sequencers. (b) Components and cost of sequencers. (c) Application of sequencers.

(A)

Sequencer

454 GS FLX

HiSeq 2000

SOLiDv4

Sanger 3730xl

Sequencing

mechanism

Pyrosequencing

Sequencing by synthesis

Ligation and two-

base coding

Dideoxy chain termination

Read length

700 bp

50SE, 50PE, 101PE

50 + 35 bp or 50+50bp

4 0 0 — 9 0 0 bp

Accuracy

99.9%*

98%, (100PE)

99.94% *raw data

99.999%

Reads

1 M

3 G

1200~1400M

Output data/ run

0.7 Gb

600 Gb

120 Gb

1.9~84 Kb

Time/run

24 Hours

3~10 Days

7 Days for SE 14 Days for PE

20 Mins~3 Hours

Advantage

Read length, fast

High throughput

Accuracy

High quality, long read length

Disadvantage

Error rate with polybase more than 6, high cost, low throughput

Short read assembly

Short read assembly

High cost low

throughput

TABLE 1: Cont. (B)

Sequencers

454 GS FLX

HiSeq 2000

SOLiDv4

3730xl

Instrument

Instrument

Instrument

Instrument

Instrument

price

$500,000, $7000 $690,000, $6000/

$495,000,

$95,000, about $4

per run

(30x) human genome

$15,000/100 Gb

per 800 bp reaction

CPU

2* Intel Xeon

2* Intel Xeon

8* processor

Pentium IV

X5675

X5560

2.0 GHz

3.0 GHz

Memory

48 GB

48 GB

16 GB

1 GB

Hard disk

1.1 TB

3 TB

10 TB

280 GB

Automation in library prepara­tion

Yes

Yes

Yes

No

Other required device

REM e system

cBot system

EZ beads system

No

Cost/million

bases

$10

$0.07

$0.13

$2400

(C)

Sequencers

454 GS FLX HiSeq 2000 SOLiDv4

3730xl

Resequencing

Yes

Yes

De novo

Yes Yes

Yes

Cancer

Yes Yes

Yes

Array

Yes Yes

Yes

Yes

High GC sample

Yes Yes

Yes

Bacterial

Yes Yes

Yes

Large genome

Yes Yes

Mutation detection

Yes Yes

Yes

Yes

(1) All the data is taken from daily average performance runs in BGI. The average daily sequence data output is about 8 Tb in BGI when about 80% sequencers (mainly HiSeq 2000) are running.

(2) The reagent cost of 454 GS FLX Titanium is calculated based on the sequencing of 400 bp; the reagent cost of HiSeq 2000 is calculated based on the sequencing of200 bp; the reagent cost of SOLiDv4 is calculated based on the sequencing of 85 bp.

(3) HiSeq 2000 is more flexible in sequencing types like 50SE, 50PE, or 101PE.

(4) SOLiD has high accuracy especially when coverage is more than 30x, so it is widely used in detecting variations in resequencing, targeted resequencing, and transcriptome sequencing. Lanes can be independently run to reduce cost.