Category Archives: ADVANCES IN

CLUSTERING OF RELEVANT GO TERMS AND DIFFERENTIAL EXPRESSION

The transcripts annotated in+N and — N transcriptomes were first clas­sified based on Gene Ontology (GO) terms. In the+N and — N datasets, respectively, 6,846 and 7,473 transcripts were classified into 306 and 218 broader GO term categories in accordance with the Gene Ontology Consortium [26]. An enrichment analysis of the broader GO terms was performed using the modified Fisher’s Exact test in Blast2GO to quanti­tatively compare the distribution of differentially enriched GO terms be­tween the+N case and the entire data set (Figure 4A), and between the — N case and the entire data set (Figure 4B). The functional categories enriched under+N were distinctly different from those enriched under the — N condition. In the+N case (Figure 4A), functional categories linked to carbon fixation, photosynthesis, protein machinery, and cellular growth were highly enriched compared to the — N condition; reflecting the higher growth rate, higher cell mass, and increased chlorophyll content observed in+N. Under — N conditions, genes associated with carboxylic acid and lipid biosynthetic process, NADPH regeneration, the pentose-phosphate pathway, phospholipid metabolic process, and lipid transport demonstrat­ed a greater enrichment of transcripts than the overall dataset (Figure 4B). These enriched GO terms directly correlated with the observed increase of lipid accumulation in — N cells. Other major categories identified as significantly expressed under the -N condition included the synthesis of value added products such as terpenoids, pigments, and vitamins as well as cellular response to nitrogen starvation, nitrate metabolic process, and nitrate assimilation (Figure 4B). Genes involved in the latter three func­tional categories were exclusively expressed in the nitrogen-limited cells.

image096

FIGURE 4: Over representation analysis of selected significant GO terms. (A) contains results for+N versus the full dataset and (B) contains results for — N versus the full dataset.

GC DEPTH DISTRIBUTION

The GC depth distribution is better in Ion Torrent from Figure 1. In Ion Torrent, the sequencing depth is similar while the GC content is from 63% to 73%. However in HiSeq 2000, the average sequencing depth is 4x when the GC content is 60%, while it is 3x with 70% GC content.

Ion Torrent has already released Ion 314 and 316 and planned to launch Ion 318 chips in late 2011. The chips are different in the number of wells resulting in higher production within the same sequencing time. The Ion 318 chip enables the production of >1 Gb data in 2 hours. Read length is expected to increase to >400 bp in 2012.

NITROGEN, BIOMASS AND BIOMOLECULE ANALYSIS

The nitrate concentration of culture media was determined daily by pas­sage through a 0.2 pm pore-size filter and analysis on an ion chromato­graph equipped with conductivity detection [50]. Microalgae growth was monitored daily by measuring the optical density of the cultures at 730 nm (OD730) using a spectrophotometer (HP 8453, Hewlett Packard, Palo Alto, CA, USA). Biomass samples for analysis of cellular constituents (starch, proteins, chlorophyll and lipids), and extraction of total RNA were harvested on day-11 by centrifugation at 10,000 g for 5 min at 4°C. Cell pellets were snap-frozen in liquid nitrogen and immediately transferred to -80°C until further analysis. The dry cell weight (DCW) of cultures was determined by filtering an aliquot of cultures on pre-weighed 0.45 pm pore size filters and drying the filters at 90°C until constant weight was reached. For analysis of starch content, 109 cells ml-1 were suspended in deionized water in 2 ml screw-cap tubes containing 0.3 g of 0.5 mm glass beads, and disrupted by two cycles of bead-beating at 4800 oscillations per minute for 2 min, followed by three freeze/thaw cycles. The suspension was then incubated in a boiling water bath for 3 min and autoclaved for 1 hour at 121°C to convert starch granules into a colloidal solution. After samples were cooled to 60°C, cell debris was removed by centrifugation at 4,000 g for 5 min. The concentration of starch in the supernatant was measured enzymatically using the Sigma Starch Assay Kit (amylase/amy — loglucosidase method, Sigma-Aldrich, Saint Louis, MO, USA) according to the manufacturer’s instruction. Chlorophyll a and b were measured by the N, N’-dimethylformamide method and calculated from spectrophoto — metric adsorption measurement at 603, 647, and 664 nm, as previously reported [51,52]. The total protein content of cells was determined with minor modifications to the original Bradford method [53] as described in [54]. Starch, chlorophyll, and protein measurements were performed in at least triplicates, and averages and standard deviations are reported as a percent of DCW.

The total lipid content of the cells was determined using a modified Bligh and Dyer method utilizing 2:1 chloroform:methanol [55]. To deter­mine the profile of fatty acids, lipid samples were transesterified [56] and the resulting fatty acid methyl esters (FAME) were analyzed using a liq­uid chromatography-mass spectrometer (Varian 500-MS, 212-LC pumps, Agilent Technologies, Santa Clara, CA, USA) equipped with a Waters nor­mal phase, Atlantis® HILIC silica column (2.1 x 150 mm, 3 pm pore size) (Waters, Milford, MA, USA), and atmospheric pressure chemical ioniza­tion [56]. Identification was based upon the retention time and the mass to charge ratio of standard FAME mixtures. The sum of FAME was used as a proxy for TAG content [22].

GENE IDENTIFIER CONVERSION

Due to the existence of several protein identifier types (FM3.1, FM4, Au5, Au10.2), different identifiers are associated with an individual protein within the Chlamydomonas genome. In order to extend annotations from one identifier type to another, matching protein identifiers are deduced by sequence similarity filtering for mutual best hits between identifiers using BLAST. Matching identifiers with 100% sequence coverage are kept, and the rest of the mutual best hits are filtered to include only those proteins with matches with at least 75% coverage. Potential ambiguities involving proteins similar to multiple other proteins are resolved by considering only the reciprocal best hit from the BLAST query in the opposite direction. The information derived by this analysis is used to convert gene identifiers between different types, which allows the Algal Annotation Tool to work with multiple protein identifier types.

CULTIVATION AND LIPID EXTRACTION PROPERTIES OF MICROALGAE

High lipid productivity is not the only factor that should be considered early during strain selection. Outdoor cultivation should determine wheth­er the selected microalgae are robust enough to withstand variable local climatic conditions and whether they can dominate a culture. This is par­ticularly important for open pond systems where other algae strains, graz­ers or viruses may easily contaminate the culture. For this purpose, many phycologists recommend the use of local dominant species, even if their lipid productivity may not be as high as other species [43].

TABLE 4: Examples of potential microalgae species for omega-3 production [48].

Species

Eicosapentaenoic acid (EPA) (% of total fatty acids)

Docosahexaenoic acid (DHA) (% of total fatty acids)

Isochrysis galbana

0.9

Nannochloropsis sp.

30.1

Chaetoceros calcitrans

34

Tetraselmis suecica

6.2

Chaetoceros muelleri

12.8

0.8

Pavlova salina

19.1

1.5

Skeletonema costatum

40.7

2.3

Porphyridium cruentum

30.7

Crypthecodinium cohnii

30

Chroomonas salina

12.9

7.1

Chaetoceros constriccus

18.8

0.6

Tetraselmis viridis

6.7

Harvesting capability is another important feature of microal­gae with biodiesel potential. Harvesting or dewatering can be best achieved through settling, flocculation or froth flotation [49,50]. For example, many microalgae settle under adverse conditions, and this can be tested under small scale conditions [51]. Lipid extraction ef­ficiency from microalgae is dependent on residual water content after drying and in particular the structure of their cell wall. For example, Nannochloropsis sp. is regarded a highly productive microalga with strong potential for large-scale biodiesel production [43], but ideally requires pretreatment to open up the highly rigid cell walls for higher lipid extraction efficiency.

9.3 CONCLUSIONS

Development of biodiesel production from microalgae presents an im­portant move to address the limitations posed by current first generation biodiesel crops. Microalgae, once developed for commercial biodiesel production, may offer many economical and environmental advantages. Current biodiesel production from microalgae is in the research phase, but is being developed to commercial scale in many countries. Finding promising microalgae for commercial cultivation is multi-facetted and challenging because particular microalgae strains have different require­ments in terms of nutrients intake, environmental and culturing condi­tions and lipid extraction technology. However, diversity of lipid-pro­ducing microalgae species is one of the major advantages of this group of organisms that is likely to lead to selection of suitable algae crops to achieve algal biodiesel production in different regions. A combination of conventional and modern techniques is likely the most efficient route from isolation to large-scale cultivation (Figure 2). Careful initial analy­ses and far-sighted selection of microalgae with a view towards down­stream processing and large-scale production with potential value-add products, is an important prerequisite to domesticate and develop algae crops for biodiesel production.

FATTY ACID BIOSYNTHESIS PATHWAY IS UP — REGULATED AND THE B-OXIDATION PATHWAY IS REPRESSED UNDER NITROGEN-LIMITING CONDITIONS

The majority of genes governing fatty acid biosynthesis were identified as being overexpressed in nitrogen limited cells as shown in the global meta­bolic pathway level and fatty acid biosynthesis module. The fold-change and abundances of identified transcripts for the components of fatty acid biosynthesis at the gene level are presented in Figure 5A. The first step in fatty acid biosynthesis is the transduction of acetyl-CoA into malonyl — CoA by addition of carbon dioxide. This reaction is the first committing step in the pathway and catalyzed by Acetyl-CoA Carboxylase (ACCase). While the gene encoding ACCase was repressed under the — N condition, the biotin-containing subunit of ACCase, biotin carboxylase (BC), was significantly up-regulated in response to nitrogen starvation. The BC cata­lyzes the ATP-dependent carboxylation of the biotin subunit and is part of the heteromeric ACCase that is present in the plastid—the site of de novo fatty acid biosynthesis [27]. To proceed with fatty acid biosynthe­sis, malonyl-CoA is transferred to an acyl-carrier protein (ACP), by the action of malonyl-CoA ACP transacylase (MAT). This step is followed by a round of condensation, reduction, dehydration, and again reduction reactions catalyzed by beta-ketoacyl-ACP synthase (KAS), beta-ketoac — yl-ACP reductase (KAR), beta-hydroxyacyl-ACP dehydrase (HAD), and enoyl-ACP reductase (EAR), respectively. The expression of genes cod­ing for MAT, KAS, HAD, and EAR were up-regulated, whereas the KAR encoding gene was repressed in — N cells. The synthesis ceases after six or seven cycles when the number of carbon atoms reaches sixteen (C16:0- [ACP]) or eighteen (C18:0-[ACP]). ACP residues are then cleaved off by thioesterases oleoyl-ACP hydrolase (OAH) and Acyl-ACP thioesterase A (FatA) generating the end products of fatty acid synthesis (i. e. palmitic (C16:0) and stearic (C18:0) acids). Genes coding for these thioesterases, i. e. FatA and OAH, were overexpressed in -N cells. The up-regulation of these thioesterase encoding genes, as previously reported in E. coli and the microalga P. tricornutum, is associated with reducing the feedback inhibi­tion that partially controls the production rate of fatty acid biosynthesis

Fen» sod
Ю*уПГ#Мі

F**yece in’,

‘глгі-Хіа — Ooleocr*t-IACJ»|

Ohft*ra i pnosptim -—- |оіцдау»ц

РПОЦЯШМК ВСЮ — ГОвСАМСМ*

oi. eo*

Подпись: AC—awryaa

Подпись: 1 : r.«-.a CCA Подпись: ADP-Oiucow image100
image101
image102
Подпись: Aeyt-COA
image104
Подпись: Aretv GCA
Подпись: . Г • , -'.І'

image107irtm НІГ EnoH-COA

Подпись:

Подпись: •МуКвА Подпись: Regulation in -H Подпись: Transcript abundance in -N

З Kmcacyt CoA

I Qfib leveT]

FIGURE 5: Differential expression of genes involved in (A) the fatty acid biosynthesis; (B) triacylglycerol biosynthesis; (C)P-oxidation; and (D) starch biosynthesis. Pathway were reconstructed based on the de novo assembly and quantitative annotation of the N. oleoabundans transcriptome. (A) Enzymes include: ACC, acetyl-CoA carboxylase (EC:

6.4.1.2) ; MAT, malonyl-CoA ACP transacylase (EC: 2.3.1.39); KAS, beta-ketoacyl-ACP synthase (KAS I, EC: 2.3.1.41; KASII, EC: 2.3.1.179; KAS III, EC: 2.3.1.180); KAR, beta-ketoacyl-ACP reductase (EC: 1.1.1.100); HAD, beta-hydroxyacyl-ACP dehydrase (EC: 4.2.1.-); EAR, enoyl-ACP reductase (EC: 1.3.1.9); AAD, acyl-ACP desaturase (EC:

1.14.19.2) ; OAH, oleoyl-ACP hydrolase (EC: 3.1.2.14); FatA, Acyl-ACP thioesterase A (EC: 3.1.2.-); A12D, A12(ro6)-desaturase (EC: 1.4.19.6); A15D, A15(ro3)-desaturase (EC: 1.4.19.-); (B) Enzymes include: GK, glycerol kinase (EC: 2.7.1.30); GPAT, glycerol-3- phosphate O-acyltransferase (EC: 2.3.1.15); AGPAT, 1-acyl-sn-glycerol-3-phosphate O-acyltransferase (EC:2.3.1.51); PP, phosphatidate phosphatase (EC: 3.1.3.4); DGAT, diacylglycerol O-acyltransferase (EC: 2.3.1.20); and PDAT, phopholipid:diacy glycerol acyltransferase (EC 2.3.1.158); (C) Enzymes include: ACS, acyl-CoA synthetase (EC: 6.2.1.3); ACOX1, acyl-CoA oxidase (EC: 1.3.3.6); ECH, enoly-CoA hydratase (EC: 4.2.1.17); HADH, 3-hydroxyacyl-CoA dehydrogenase (EC: 1.1.1.35); ACAT, acetyl-CoA C-acetyltransferase (EC: 2.3.1.16, 2.3.1.9); (D) Enzymes include: PGM, phosphoglucomutase (EC: 5.4.2.2); AGPase, ADP-glucose pyrophosphorylase (EC:
2.7.7.27); SS, starch synthase (EC: 2.4.1.21); BE, a-1,4-glucan branching enzyme (EC: 2.4.1.18); and HXK, hexokinase (2.7.1.1). Starch catabolism enzymes include: a-AMY, a-amylase (EC: 3.2.1.1); O1,6G, oligo-1,6-glucosidase (EC: 3.2.1.10); P-AMY, P-amylase (EC: 3.2.1.2); and SPase, starch phosphorylase (EC: 2.4.1.1). Ethanol fermentation via pyruvate enzymes include: PDC, pyruvate decarboxylase (EC: 4.1.1.1); and ADH, alcohol dehydrogenase (EC: 1.1.1.1). Enzymes aceE, pyruvate dehydrogenase E1 component (EC 1.2.4.1); aceF, pyruvate dehydrogenase E2 component (EC: 2.3.1.12); and pdhD, dihydrolipoamide dehydrogenase (EC 1.8.1.4), transforms pyruvate into acetyl-CoA. Key enzymes are shown with an asterisk (*) next to the enzyme abbreviations, and dashed arrows denote reaction(s) for which the enzymes are not shown. All presented fold changes are statistically significant, q value < 0.05.

[7,8], and results in the overproduction of fatty acids [9]. It has also been suggested that an increase in FatA gene expression and the associated acyl-ACP hydrolysis may aid in increased fatty acid transport from the chloroplast to the endoplasmic reticulum site where TAG assembly occurs [10,28]. Finally, for supplying reducing equivalents via NADPH to power fatty acid biosynthesis, genes encoding for the pentose phosphate pathway were strongly up-regulated in the — N condition (Table 2).

The altered expression of genes associated with the generation of dou­ble bonds in fatty acids reflects the observed increase in the proportion of unsaturated of fatty acids (Figure 1D), and the enrichment of C18:1 during nitrogen limitations. The acyl-ACP desaturase (AAD), which intro­duces a one double bond to C16:0/C18:0, and delta-15 desaturase, which converts C18:2 to C18:3, were significantly up-regulated in the — N case, whereas the delta-12 desaturase catalyzing the formation of C18:2 from C18:1was repressed during nitrogen limitation.

Under nitrogen limitations, 10 of the 13 genes associated with fatty acid degradation (a and P-oxidation pathways for saturated and unsatu­rated acids) were significantly repressed. Figure 5C demonstrates the typi­cal P-oxidation pathway for saturated fatty acids, while Table 3 displays expression levels for additional peroxisomal genes associated with fatty acid oxidation, but not shown in Figure 5C. Before undergoing oxidative degradation, fatty acids are activated through esterification to Coenzyme A. The activation reaction, is catalyzed by acyl-CoA synthetase (ACSL), which was up-regulated in — N cells. The acyl-CoA enters the P-oxidation pathway and undergoes four enzymatic reactions in multiple rounds. The first three steps of the pathway; oxidation, hydration and again oxidation of acyl-CoA are catalyzed by acyl-CoA oxidase (ACOX1), enoly-CoA hydratase (ECH), and hydroxyacyl-CoA dehydrogenase (HADH), respec­tively. In the last step of the pathway, acetyl-CoA acetyltransferase (ACAT) catalyzes the cleavage of one acetyl-CoA, yielding a fatty acyl-CoA that is 2 carbons shorter than the original acyl-CoA. The cycle continues until all the carbons are released as acetyl-CoA. The expression level of ECH and HADH were unchanged and genes encoding for enzymes ACOX1 and ACAT catalyzing the first and last reactions in the cycle were identified as significantly repressed in — N cells.

Table 2. N. oleoabundans genes involved in the pentose phosphate pathway

Pentose phosphate pathway Log2FC

Phosphogluconate dehydrogenase (decarboxylating) (PGD, EC: 1.1.1.44) -1.13

Glucose-6-phosphate dehydrogenase (G6PD, EC: 1.1.1.49) -1.41

Transketolase (tktA, EC: 2.2.1.1) 2.55

Transaldolase (talA, EC: 2.2.1.2) -0.66

6-phosphofructokinase (PFK, EC: 2.7.1.11) -0.45

Gluconokinase (gntK, EC: 2.7.1.12) 0.10

Ribokinase (rbsK, EC: 2.7.1.15) 0.11

Ribose-phosphate diphosphokinase (PRPS, EC: 2.7.6.1) -0.10

Gluconolactonase (GNL, EC: 3.1.1.17) -0.67

6-phosphogluconolactonase (PGLS, EC: 3.1.1.31) 0.07

Fructose-bisphosphatase (FBP, EC: 3.1.3.11) -0.24

Fructose-bisphosphate aldolase (fbaB, EC: 4.1.2.13) 0.17

Ribulose-phosphate 3-epimerase (RPE, EC: 5.1.3.1) -0.11

Ribose-5-phosphate isomerase (rpiA, EC: 5.3.1.6) -0.34

Glucose-6-phosphate isomerase (GPI, EC: 5.3.1.9) -1.21

Phosphoglucomutase (pgm, EC: 5.4.2.2) -0.83

Negative Log2FC values represent up-regulation under nitrogen limitation. All presented fold changes are statistically significant, q value < 0.05.

TABLE 3: N. oleoabundans genes involved in catabolic pathways related to peroxisomal fatty acid oxidation, lysosomal lipases, and the regulation of autophagy

Enzyme encoding gene Log2FC

Peroxisome

a-oxidation

2-hydroxyacyl-coa lyase 1 (HACL1, EC: 4.1.-.-) 0.35

Unsaturated fatty acid p-oxidation

Peroxisomal 2,4-dienoyl-coa reductase (DECR2, EC: 1.3.1.34) 0.21

A(3,5)-A(2,4)-dienoyl-coa isomerase (ECH1, EC: 5.3.3.-) -0.27

ATP-binding cassette, subfamily D (ALD), member 1 (ABCD1) 0.25

Long-chain acyl-coa synthetase (ACSL, EC: 6.2.1.3) 0.25

Other oxidation

Peroxisomal 3,2-trans-enoyl-coa isomerase (PECI, EC: 5.3.3.8) 0.59

Carnitine O-acetyltransferase (CRAT, EC: 2.3.1.7) 0.30

NAD+diphosphatase (NUDT12, EC: 3.6.1.22) 0.47

Glycerolipid metabolism

Triacylglycerol lipase (EC: 3.1.1.3) 0.33

Acylglycerol lipase (MGLL, EC: 3.1.1.23) -0.13

Glycerophospholipid metabolism

Phospholipase A1 (plda, EC: 3.1.1.32) -1.26

Phospholipase A2 (PLA2G, EC: 3.1.1.4) -0.31

Phospholipase C (plcc, EC: 3.1.4.3) -0.10

Lysosome Lipases

Lysosomal acid lipase (LIPA, EC: 3.1.1.13) -0.48

Lysophospholipase III (LYPLA3, EC: 3.1.1.5) 0.20

Regulation of autophagy

Unc51-like kinase (ATG1, EC: 2.7.11.1) -0.53

5′-AMP-activated protein kinase, catalytic alpha subunit (snrk1, PRKAA) -0.05

Vacuolar protein 8 (VAC8) 0.13

Beclin 1 (BECN1) -0.59

TABLE 3: cont.

Enzyme encoding gene Log2FC

Phosphatidylinositol 3-kinase (VPS34, EC: 2.7.1.137) -1.26

Phosphoinositide-3-kinase, regulatory subunit 4, p150 (VPS15, EC: 2.7.11.1) 0.11

Autophagy-related protein 3 (ATG3) 0.11

Autophagy-related protein 4 (ATG4) -0.16

Autophagy-related protein 5 (ATG5) -0.27

Autophagy-related protein 7 (ATG7) 0.17

Autophagy-related protein 8 (ATG8) -0.50

Autophagy-related protein 12 (ATG12) -0.58

Negative log2 fold change (Log2FC) values represent up-regulation under nitrogen limitation. All presented fold changes are statistically significant, q value < 0.05.

MISEQ FROM ILLUMINA

MiSeq which still uses SBS technology was launched by Illumina. It inte­grates the functions of cluster generation, SBS, and data analysis in a sin­gle instrument and can go from sample to answer (analyzed data) within a single day (as few as 8 hours). The Nextera, TruSeq, and Illumina’s re­versible terminator-based sequencing by synthesis chemistry was used in this innovative engineering. The highest integrity data and broader range of application, including amplicon sequencing, clone checking, ChIP-Seq, and small genome sequencing, are the outstanding parts of MiSeq. It is also flexible to perform single 36 bp reads (120 MB output) up to 2 * 150 paired-end reads (1-1.5 GB output) in MiSeq. Due to its significant improvement in read length, the resulting data performs better in contig assembly compared with HiSeq (data not shown). The related sequencing result of MiSeq is shown in Table 3. We also compared PGM with MiSeq in Table 4.

TABLE 3: MiSeq 150PE data.

Sample

GC

Q20 Q30

Human HPV

33.57; 33.62

98.26; 95.52 93.64; 88.52

Bacteria

61.33; 61.43

90.84; 83.86 78.46; 69.04

(1) The data in the table includes both read 1 and read 2 from paired-end sequencing.

(2) GC represents the GC content of libraries.

(3) Q20 value is the average Q20 of all bases in a read, which represents the ratio of bases with probability of containing no more than one error in 100 bases. Q30 value is the average Q30 of all bases in a read, which represents the ratio of bases with probability of containing no more than one error in 1,000 bases.

TABLE 4: The comparison between PGM and MiSeq.

PGM

MiSeq

Output

10 MB-100 MB

120 MB-1.5 GB

Read length

~200bp

Up to 2 x 150 bp

Sequencing time

2 hours for 1 x 200 bp

3 hours for 1 x 36 single read

27 hours for 2 x 150 bp pair end read

Sample preparation time

8 samples in parallel, less than 6 hrs

As fast as 2 hrs, with 15 minutes hand on time

Sequencing method

semiconductor technology

with a simple

sequencing chemistry

Sequencing by synthesis (SBS)

Potential for development

Various parameters

(read length, cycle time, accuracy, etc.)

Limited factors, major con­centrate in flowcell surface size, insert sizes, and how to pack cluster in tighter

Input amount

pg

Ng (Nextera)

Data analysis

Off instrument

On instrument

RNA EXTRACTION, CONSTRUCTION OF CDNA LIBRARIES AND DNA SEQUENCING

To control for cell synchronization, cells for the+N and — N conditions were harvested at the same time of day. Total RNA was extracted and purified separately from each of the two nitrogen replete and the two nitrogen limited cultures using the RNeasy Lipid Tissue Mini Kit (Qiagen, Valencia, CA,

USA). The quality of purified RNA was determined on an Agilent 2100 bio­analyzer (Agilent Technologies, Santa Clara, CA, USA). Isolation of mRNA from total RNA was carried out using two rounds of hybridization to Dynal oligo(dT) magnetic beads (Invitrogen, Carlsbad, CA, USA). Aliquots from mRNA samples were used for construction of the cDNA libraries using the mRNA-Seq Kit supplied by Illumina (Illumina, Inc., San Diego, CA, USA). Briefly, the mRNA was fragmented in the presence of divalent cations at 94°C, and subsequently converted into double stranded cDNA following the first — and second-strand cDNA synthesis using random hexamer primers. After polishing the ends of the cDNA using T4 DNA polymerase and Kle — now DNA polymerase for 30 min at 20°C, a single adenine base was added to the 3’ ends of cDNA molecules. Illumina mRNA-Seq Kit specific adap­tors were then ligated to cDNA 3’ ends. Next, the cDNA was PCR-amplified for 15 cycles, amplicons were purified (QIAquick PCR purification kit, Qia — gen Inc., Valencia CA, USA), and the size and concentration of the cDNA libraries were determined on an Agilent 2100 bioanalyzer. Each of the four cDNA libraries (two nitrogen deplete and two nitrogen replete) was layered on a separate Illumina flow cell and sequenced at the Yale University Center for Genome Analysis using Illumina HiSeq 100 bp single-end sequencing. An additional lane was dedicated to sequencing PhiX control libraries to provide internal calibration and to optimize base calling. The sequence data produced in this study can be accessed at NCBI’s Sequence Read Achieve with the accession number SRA048723.

WEB-BASED INTERFACE AND UPDATES

The web interface of the Algal Functional Annotation Tool consists of a set of portals that give access to the different types of analyses available. Re­sults are shown within expandable/collapsible HTML tables that display annotation information along with the statistical results of the analysis. When expanded, the results table shows which gene identifiers contain a specific annotation along with further information regarding matching gene identifiers and BLAST E-values. Updates to the Algal Functional Annotation Tool are semi-automated using a set of Perl scripts that parse and process updated flat files from the various integrated annotation da­tabases at regular intervals. Currently, functional data from the primary annotation databases is set to be updated every 4 months.

COMPARISON OF NEXT — GENERATION SEQUENCING SYSTEMS

LIN LIU, YINHU LI, SILIANG LI, NI HU, YIMIN HE, RAY PONG, DANNI LIN, LIHUA LU, and MAGGIE LAW

10.1 INTRODUCTION

(Deoxyribonucleic acid) DNA was demonstrated as the genetic material by Oswald Theodore Avery in 1944. Its double helical strand structure composed of four bases was determined by James D. Watson and Francis Crick in 1953, leading to the central dogma of molecular biology. In most cases, genomic DNA defined the species and individuals, which makes the DNA sequence fundamental to the research on the structures and functions of cells and the decoding of life mysteries [1]. DNA sequencing technolo­gies could help biologists and health care providers in a broad range of ap­plications such as molecular cloning, breeding, finding pathogenic genes, and comparative and evolution studies. DNA sequencing technologies ide­ally should be fast, accurate, easy-to-operate, and cheap. In the past thirty years, DNA sequencing technologies and applications have undergone tremendous development and act as the engine of the genome era which is characterized by vast amount of genome data and subsequently broad range of research areas and multiple applications. It is necessary to look back on the history of sequencing technology development to review the NGS systems (454, GA/HiSeq, and SOLiD), to compare their advantages and disadvantages, to discuss the various applications, and to evaluate the recently introduced PGM (personal genome machines) and third-genera­

tion sequencing technologies and applications. All of these aspects will be described in this paper. Most data and conclusions are from independent users who have extensive first-hand experience in these typical NGS sys­tems in BGI (Beijing Genomics Institute).

Before talking about the NGS systems, we would like to review the history of DNA sequencing briefly. In 1977, Frederick Sanger developed DNA sequencing technology which was based on chain-termination meth­od (also known as Sanger sequencing), and Walter Gilbert developed an­other sequencing technology based on chemical modification of DNA and subsequent cleavage at specific bases. Because of its high efficiency and low radioactivity, Sanger sequencing was adopted as the primary tech­nology in the “first generation” of laboratory and commercial sequencing applications [2]. At that time, DNA sequencing was laborious and radioac­tive materials were required. After years of improvement, Applied Biosys­tems introduced the first automatic sequencing machine (namely AB370) in 1987, adopting capillary electrophoresis which made the sequencing faster and more accurate. AB370 could detect 96 bases one time, 500 K bases a day, and the read length could reach 600 bases. The current model AB3730xl can output 2.88 M bases per day and read length could reach 900 bases since 1995. Emerged in 1998, the automatic sequencing instru­ments and associated software using the capillary sequencing machines and Sanger sequencing technology became the main tools for the comple­tion of human genome project in 2001 [3]. This project greatly stimulated the development of powerful novel sequencing instrument to increase speed and accuracy, while simultaneously reducing cost and manpower. Not only this, X-prize also accelerated the development of next-genera­tion sequencing (NGS) [4]. The NGS technologies are different from the Sanger method in aspects of massively parallel analysis, high throughput, and reduced cost. Although NGS makes genome sequences handy, the fol­lowed data analysis and biological explanations are still the bottle-neck in understanding genomes.

Following the human genome project, 454 was launched by 454 in 2005, and Solexa released Genome Analyzer the next year, followed by (Sequencing by Oligo Ligation Detection) SOLiD provided from Agen — court, which are three most typical massively parallel sequencing systems in the next-generation sequencing (NGS) that shared good performance on throughput, accuracy, and cost compared with Sanger sequencing (shown in Table 1(a)). These founder companies were then purchased by other companies: in 2006 Agencourt was purchased by Applied Biosystems, and in 2007, 454 was purchased by Roche, while Solexa was purchased by Illumina. After years of evolution, these three systems exhibit better performance and their own advantages in terms of read length, accuracy, applications, consumables, man power requirement and informatics in­frastructure, and so forth. The comparison of these three systems will be focused and discussed in the later part of this paper (also see Tables 1(a), 1(b), and 1(c)).

TABLE 1: (a) Advantage and mechanism of sequencers. (b) Components and cost of sequencers. (c) Application of sequencers.

(A)

Sequencer

454 GS FLX

HiSeq 2000

SOLiDv4

Sanger 3730xl

Sequencing

mechanism

Pyrosequencing

Sequencing by synthesis

Ligation and two-

base coding

Dideoxy chain termination

Read length

700 bp

50SE, 50PE, 101PE

50 + 35 bp or 50+50bp

4 0 0 — 9 0 0 bp

Accuracy

99.9%*

98%, (100PE)

99.94% *raw data

99.999%

Reads

1 M

3 G

1200~1400M

Output data/ run

0.7 Gb

600 Gb

120 Gb

1.9~84 Kb

Time/run

24 Hours

3~10 Days

7 Days for SE 14 Days for PE

20 Mins~3 Hours

Advantage

Read length, fast

High throughput

Accuracy

High quality, long read length

Disadvantage

Error rate with polybase more than 6, high cost, low throughput

Short read assembly

Short read assembly

High cost low

throughput

TABLE 1: Cont. (B)

Sequencers

454 GS FLX

HiSeq 2000

SOLiDv4

3730xl

Instrument

Instrument

Instrument

Instrument

Instrument

price

$500,000, $7000 $690,000, $6000/

$495,000,

$95,000, about $4

per run

(30x) human genome

$15,000/100 Gb

per 800 bp reaction

CPU

2* Intel Xeon

2* Intel Xeon

8* processor

Pentium IV

X5675

X5560

2.0 GHz

3.0 GHz

Memory

48 GB

48 GB

16 GB

1 GB

Hard disk

1.1 TB

3 TB

10 TB

280 GB

Automation in library prepara­tion

Yes

Yes

Yes

No

Other required device

REM e system

cBot system

EZ beads system

No

Cost/million

bases

$10

$0.07

$0.13

$2400

(C)

Sequencers

454 GS FLX HiSeq 2000 SOLiDv4

3730xl

Resequencing

Yes

Yes

De novo

Yes Yes

Yes

Cancer

Yes Yes

Yes

Array

Yes Yes

Yes

Yes

High GC sample

Yes Yes

Yes

Bacterial

Yes Yes

Yes

Large genome

Yes Yes

Mutation detection

Yes Yes

Yes

Yes

(1) All the data is taken from daily average performance runs in BGI. The average daily sequence data output is about 8 Tb in BGI when about 80% sequencers (mainly HiSeq 2000) are running.

(2) The reagent cost of 454 GS FLX Titanium is calculated based on the sequencing of 400 bp; the reagent cost of HiSeq 2000 is calculated based on the sequencing of200 bp; the reagent cost of SOLiDv4 is calculated based on the sequencing of 85 bp.

(3) HiSeq 2000 is more flexible in sequencing types like 50SE, 50PE, or 101PE.

(4) SOLiD has high accuracy especially when coverage is more than 30x, so it is widely used in detecting variations in resequencing, targeted resequencing, and transcriptome sequencing. Lanes can be independently run to reduce cost.