Category Archives: ADVANCES IN

THE THIRD GENERATION SEQUENCER

While the increasing usage and new modification in next generation se­quencing, the third generation sequencing is coming out with new insight in the sequencing. Third-generation sequencing has two main character­istics. First, PCR is not needed before sequencing, which shortens DNA preparation time for sequencing. Second, the signal is captured in real time, which means that the signal, no matter whether it is fluorescent (Pacbio) or electric current (Nanopore), is monitored during the enzymatic reaction of adding nucleotide in the complementary strand.

Single-molecule real-time (SMRT) is the third-generation sequencing method developed by Pacific Bioscience (Menlo Park, CA, USA), which made use of modified enzyme and direct observation of the enzymatic reaction in real time. SMRT cell consists of millions of zero-mode wave­guides (ZMWs), embedded with only one set of enzymes and DNA tem­plate that can be detected during the whole process. During the reaction, the enzyme will incorporate the nucleotide into the complementary strand and cleave off the fluorescent dye previously linked with the nucleotide. Then the camera inside the machine will capture signal in a movie format in real-time observation [19]. This will give out not only the fluorescent signal but also the signal difference along time, which may be useful for the prediction of structural variance in the sequence, especially useful in epigenetic studies such as DNA methlyation [22].

Comparing to second generation, PacBio RS (the first sequencer launched by PacBio) has several advantages. First the sample preparation is very fast; it takes 4 to 6 hours instead of days. Also it does not need PCR step in the preparation step, which reduces bias and error caused by PCR. Second, the turnover rate is quite fast; runs are finished within a day. Third, the average read length is 1300 bp, which is longer than that of any second-generation sequencing technology. Although the throughput of the PacBioRS is lower than second-generation sequencer, this technology is quite useful for clinical laboratories, especially for microbiology research. A paper has been published using PacBio RS on the Haitian cholera out­break [19].

image085

FIGURE 2: Sequencing of a fosmid DNA using Pacific Biosciences sequencer. With coverage, the accuracy could be above 97%. The figure was constructed by BGI’s own data.

We have run a de novo assembly of DNA fosmid sample from Oyster with PacBio RS in standard sequencing mode (using LPR chemistry and SMRTcells instead of the new version FCR chemistry and SMRTcells). An SMRT belt template with mean insert size of 7500 kb is made and run in one SMRT cell and a 120-minute movie is taken. After Post-QC fil­ter, 22,373,400 bp reads in 6754 reads (average 2,566 bp) were sequenced with the average Read Score of 0.819. The Coverage is 324x with mean read score of 0.861 and high accuracy (~99.95). The result is exhibited in Figure 2.

Nanopore sequencing is another method of the third generation se­quencing. Nanopore is a tiny biopore with diameter in nanoscale [23], which can be found in protein channel embedded on lipid bilayer which facilitates ion exchange. Because of the biological role of nanopore, any particle movement can disrupt the voltage across the channel. The core concept of nanopore sequencing involves putting a thread of single-strand­ed DNA across a-haemolysin (aHL) pore. aHL, a 33 kD protein isolated from Staphylococcus aureus [20], undergoes self-assembly to form a hep — tameric transmembrane channel [23]. It can tolerate extraordinary voltage up to 100 mV with current 100 pA [20]. This unique property supports its role as building block of nanopore. In nanopore sequencing, an ionic flow is applied continuously. Current disruption is simply detected by standard electrophysiological technique. Readout is relied on the size difference between all deoxyribonucleoside monophosphate (dNMP). Thus, for giv­en dNMP, characteristic current modulation is shown for discrimination. Ionic current is resumed after trapped nucleotide entirely squeezing out.

Nanopore sequencing possesses a number of fruitful advantages over existing commercialized next-generation sequencing technologies. First­ly, it potentially reaches long read length >5 kbp with speed 1 bp/ns [19]. Moreover, detection of bases is fluorescent tag-free. Thirdly, except the use of exonuclease for holding up ssDNA and nucleotide cleavage [24], involvement of enzyme is remarkably obviated in nanopore sequencing [22]. This implies that nanopore sequencing is less sensitive to tempera­ture throughout the sequencing reaction and reliable outcome can be main­tained. Fourthly, instead of sequencing DNA during polymerization, single DNA strands are sequenced through nanopore by means of DNA strand depolymerization. Hence, hand-on time for sample preparation such as cloning and amplification steps can be shortened significantly.

ABILITY TO RE-RUN ANALYSIS FOR SUBSETS OF GENES

Once a gene list is supplied and enrichment results have been returned, a subset of genes corresponding to those that contain a particular anno­tation may be isolated and re-run through the tool to be analyzed as a separate, smaller gene list. This allows users to select a particularly inter­esting group of functionally related genes and isolate them to see if they are also enriched for other functional terms. This also allows the user to prune large gene lists into more focused lists of functionally similar genes and removing some of the inherent noise associated with high-throughput experimental techniques and their resulting gene lists. This feature of the tool may be accessed by expanding the enrichment results of a particular annotation and selecting to re-run the analysis using only that subset of proteins. From this step, users may select which database types to query for enrichment (e. g. pathway, ontology, protein family).

NITROGEN LIMITATION AFFECTS THE NITROGEN — ASSIMILATORY PATHWAY AT THE TRANSCRIPTOME LEVEL

We identified a number of genes that encode for components of the ni­trogen assimilatory pathway (Table 4). Genes that encode for enzymes catalyzing the reduction of NO3- to NH4+ and the biosynthesis of nitrogen­carrying amino acids were strongly expressed under nitrogen limitation [32,33]. Along with the pentose phosphate pathway, these genes were among the most up-regulated genes in — N cells of N. oleoabundans. The increased expression of these genes was consistent with their role in nitro­gen uptake and assimilation, and the nitrogen limited growth environment from which cells were derived.

TABLE 4: N. oleoabundans genes involved in nitrogen assimilation

Nitrogen assimilation

Log2FC

High affinity nitrate transporters

-4.4

Ammonium transporters

-2.8

Nitrate reductase (NR, EC: 1.7.1.1)

-3.8

Ferredoxin-nitrite reductase (NiR, EC: 1.7.7.1)

-3.9

Glutamine synthetase (GS, EC: 6.3.1.2)

-2.3

Glutamate synthase (NADH) (GOGAT, EC: 1.4.1.13-14)

-1.4

Glutamate synthase (Ferredoxin) (EC: 1.4.7.1)

0.27

Glutamate dehydrogenase (GDH, EC: 1.4.1.3)

0.89

Aspartate aminotransferase (aspat, EC: 2.6.1.1)

-2.3

Asparagine synthetase (AS, EC: 6.3.5.4)

-1.5

Negative Log2FC values represent up-regulation under nitrogen limitation. All presented fold changes are statistically significant, q value < 0.05.

454 GS FLX TITANIUM SOFTWARE

GS RunProcessor is the main part of the GS FLX Titanium system. The software is in charge of picture background normalization, signal location correction, cross-talk correction, signals conversion, and sequencing data generation. GS RunProcessor would produce a series of files including SFF (standard flowgram format) files each time after run. SFF files con­tain the basecalled sequences and corresponding quality scores for all indi­vidual, high-quality reads (filtered reads). And it could be viewed directly from the screen of GS FLX Titanium system. Using GS De Novo Assem­bler, GS Reference Mapper and GS Amplicon Variant Analyzer provided by GS FLX Titanium system, SFF files can be applied in multiaspects and converted into fastq format for further data analyzing.

DISCUSSION OF NGS APPLICATIONS

Fast progress in DNA sequencing technology has made for a substantial reduction in costs and a substantial increase in throughput and accuracy. With more and more organisms being sequenced, a flood of genetic data is inundating the world every day. Progress in genomics has been moving steadily forward due to a revolution in sequencing technology. Addition­ally, other of types-large scale studies in exomics, metagenomics, epig — enomics, and transcriptomics all become reality. Not only do these studies provide the knowledge for basic research, but also they afford immediate application benefits. Scientists across many fields are utilizing these data for the development of better-thriving crops and crop yields and livestock and improved diagnostics, prognostics, and therapies for cancer and other complex diseases.

BGI is on the cutting edge of translating genomics research into molec­ular breeding and disease association studies with belief that agriculture, medicine, drug development, and clinical treatment will eventually enter a new stage for more detailed understanding of the genetic components of all the organisms. BGI is primarily focused on three projects. (1) The Mil­lion Species/Varieties Genomes Project, aims to sequence a million eco­nomically and scientifically important plants, animals, and model organ­isms, including different breeds, varieties, and strains. This project is best represented by our sequencing of the genomes of the Giant panda, potato, macaca, and others, along with multiple resequencing projects. (2) The Million Human Genomes Project focuses on large-scale population and association studies that use whole-genome or whole-exome sequencing strategies. (3) The Million Eco-System Genomes Project has the objective of sequencing the metagenome and cultured microbiome of several differ­ent environments, including microenvironments within the human body [25]. Together they are called 3 M project.

In the following part, each of the following aspects of applications in­cluding de novo sequencing, mate-pair, whole genome or target-region

resequencing, small RNA, transcriptome, RNA seq, epigenomics, and metagenomics, is briefly summarized.

In DNA de novo sequencing, the library with insert size below 800 bp is defined as DNA short fragment library, and it is usually applied in de novo and resequencing research. Skovgaard et al. [26] have applied a combination method of WGS (whole-genome sequencing) and genome copy number analysis to identify the mutations which could suppress the growth deficiency imposed by excessive initiations from the E. coli origin of replication, oriC.

Mate-pair library sequencing is significant beneficial for de novo se­quencing, because the method could decrease gap region and extend scaf­fold length. Reinhardt et al. [27] developed a novel method for de novo ge­nome assembly by analyzing sequencing data from high-throughput short read sequencing technology. They assembled genomes into large scaffolds at a fraction of the traditional cost and without using reference sequence. The assembly of one sample yielded an N50 scaffold size of 531,821 bp with >75% of the predicted genome covered by scaffolds over 100,000 bp.

Whole genome resequencing sequenced the complete DNA sequence of an organism’s genome including the whole chromosomal DNA at a single time and alignment with the reference sequence. Mills et al. [28] constructed a map of unbalanced SVs (genomic structural variants) based on whole genome DNA sequencing data from 185 human genomes with SOLiD platform; the map encompassed 22,025 deletions and 6,000 addi­tional SVs, including insertions and tandem duplications [28]. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact [28].

The whole genome resequencing is an effective way to study the functional gene, but the high cost and massive data are the main problem for most researchers. Target region sequencing is a solution to solve it. Microarray capture is a popular way of target region sequencing, which uses hybridization to arrays containing synthetic oligo-nucleotides match­ing the target DNA sequencing. Gnirke et al. [29] developed a captured method that uses an RNA “baits” to capture target DNA fragments from the “pond” and then uses the Illumina platform to read out the sequence. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper [29].

Fehniger et al. used two platforms, Illumina GA and ABI SOLiD, to define the miRNA transcriptomes of resting and cytokine-activated prima­ry murine NK (natural killer) cells [30]. The identified 302 known and 21 novel mature miRNAs were analyzed by unique bioinformatics pipeline from small RNA libraries of NK cell. These miRNAs are overexpressed in broad range and exhibit isomiR complexity, and a subset is differentially expressed following cytokine activation, which were the clue to identify the identification of miRNAs by the Illumina GA and SOLiD instruments [30].

The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other noncoding RNA produced in one or a population of cells. In these years, next-generation sequencing technology is used to study the transcriptome compares with DNA microarray technology in the past. The S. mediterranea transcriptome could be sequenced by an ef­ficient sequencing strategy which designed by Adamidi et al. [31]. The catalog of assembled transcripts and the identified peptides in this study dramatically expand and refine planarian gene annotation, which is dem­onstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns.

RNA-seq is a new method in RNA sequencing to study mRNA ex­pression. It is similar to transcriptome sequencing in sample preparation, except the enzyme. In order to estimate the technical variance, Marioni et al. [32] analyzed a kidney RNA samples on both Illumina platform and Affymetrix arrays. The additional analyses such as low-expressed genes, alternative splice variants, and novel transcripts were found on Illumina platform. Bradford et al. [33] compared the data of RNA-seq library on the SOLiD platform and Affymetrix Exon 1.0ST arrays and found a high degree of correspondence between the two platforms in terms of exon-lev­el fold changes and detection. And the greatest detection correspondence was seen when the background error rate is extremely low in RNA-seq. The difference between RNA-seq and transcriptome on SOLiD is not so obvious as Illumina.

There are two kinds of application of epigenetic, Chromatin immuno — precipitation and methylation analysis. Chromatin immunoprecipitation (ChIP) is an immunoprecipitation technique which is used to study the in­teraction between protein and DNA in a cell, and the histone modifies would be found by the specific location in genome. Based on next-generation sequencing technology, Johnson et al. [34] developed a large-scale chro­matin immunoprecipitation assay to identify motif, especially noncanoni­cal NRSF-binding motif. The data displays sharp resolution of binding position (±50 bp), which is important to infer new candidate interaction for the high sensitivity and specificity (ROC (receiver operator characteristic) area >0.96) and statistical confidence (P < 10-4). Another important appli­cation in epigenetic is DNA methylation analysis. DNA methylation exists typically in vertebrates at CpG sites; the methylation caused the conver­sion of the cytosine to 5-methylcytosine. Chung presented a whole methy — lome sequencing to study the difference between two kinds of bisulfite conversion methods (in solution versus in gel) by SOLiD platform [35].

The world class genome projects include the 1000 genome project, and the human ENCODE project, the human Microbiome (HMP) project, to name a few. BGI takes an active role in these and many more ongoing projects like 1000 Animal and Plant Genome project, the MetaHIT proj­ect, Yanhuang project, LUCAMP (Diabetes-associated Genes and Varia­tions Study), ICGC (international cancer genome project), Ancient human genome, 1000 Mendelian Disorders Project, Genome 10 K Project, and so forth [25]. These internationally collaborated genome projects greatly enhanced genomics study and applications in healthcare and other fields.

To manage multiple projects including large and complex ones with up to tens of thousands of samples, a superior and sophisticated project management system is required handling information processing from the very beginning of sample labeling and storage to library construction, multiplexing, sequencing, and informatics analysis. Research-oriented bioinformatics analysis and followup experiment processed are not in­cluded. Although automation techniques’ adoption has greatly simplified bioexperiment human interferences, all other procedures carried out by human power have to be managed. BGI has developed BMS system and Cloud service for efficient information exchange and project management. The behavior management mainly follows Japan 5S onsite model. Ad­ditionally, BGI has passed ISO9001 and CSPro (authorized by Illumina) QC system and is currently taking (Clinical Laboratory Improvement Amendments) CLIA and (American Society for Histocompatibility and Immunogenetics) AShI tests. Quick, standard, and open reflection system guarantees an efficient troubleshooting pathway and high performance, for example, instrument design failure of Truseq v3 flowcell resulting in bub­ble appearance (which is defined as “bottom-middle-swatch” phenomenon by Illumina) and random N in reads. This potentially hazards sequencing quality, GC composition as well as throughput. It not only effects a small area where the bubble locates resulting in reading N but also effects the focus of the place nearby, including the whole swatch, and the adjacent swatch. Filtering parameters have to be determined to ensure quality raw data for bioinformatics processing. Lead by the NGS tech group, joint meetings were called for analyzing and troubleshooting this problem, to discuss strategies to best minimize effect in terms of cost and project time, to construct communication channel, to statistically summarize compen­sation, in order to provide best project management strategies in this time. Some reagent QC examples are summaried in Liu et al. [36].

BGI is establishing their cloud services. Combined with advanced NGS technologies with multiple choices, a plug-and-run informatics ser­vice is handy and affordable. A series of softwares are available including BLAST, SOAP, and SOAP SNP for sequence alignment and pipelines for RNAseq data. Also SNP calling programs such as Hecate and Gaea are about to be released. Big-data studies from the whole spectrum of life and biomedical sciences now can be shared and published on a new journal Gi- gaSicence cofounded by BGI and Biomed Central. It has a novel publica­tion format: each piece of data links to a standard manuscript publication with an extensive database which hosts all associated data, data analysis tools, and cloud-computing resources. The scope covers not just omic type data and the fields of high-throughput biology currently serviced by large public repositories but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biol­ogy, and other new types of large-scale sharable data.

EXPANDED ANNOTATION COVERAGE

The methods described to compensate for the incomplete annotation cov­erage of Chlamydomonas reinhardtii genes resulted in the addition of a vast number of unique annotations to the genome. While there is a strong overlap between pre-existing annotations and those assigned by inference, many new terms have also been added. The annotations derived by orthology,

image088

however, are not mixed with the annotations attained directly to decrease the possibility of false positive associations of functional terms that may distort the analysis, and to permit a comparison with the functional terms derived directly from the Chlamydomonas annotation.

STARCH SYNTHESIS UNDER NITROGEN LIMITATIONS

While several genes associated with the preparatory steps in starch synthe­sis are up-regulated in the — N case, the genes encoding for key enzymes AGPase and starch synthase were repressed (Table 1). The degradative side of starch metabolism, specifically a-amylase which hydrolyzes starch to glucose, was also strongly repressed during nitrogen limitations. When coupled to the increased but still overall low starch contents in the — N case (Table 3), these findings suggest that the -N cells accumulated starch by repressing starch degradation. It is also notable that pyruvate kinase (log2FC = -0.21) and the three-enzyme pyruvate dehydrogenase complex for converting glucose to acetyl-CoA (to supply fatty acid synthesis) were up-regulated during nitrogen limitation (Figure 5D).

AB SOLID SYSTEM

(Sequencing by Oligo Ligation Detection) SOLiD was purchased by Ap­plied Biosystems in 2006. The sequencer adopts the technology of two — base sequencing based on ligation sequencing. On a SOLiD flowcell, the libraries can be sequenced by 8 base-probe ligation which contains liga­tion site (the first base), cleavage site (the fifth base), and 4 different fluo­rescent dyes (linked to the last base) [10]. The fluorescent signal will be recorded during the probes complementary to the template strand and van­ished by the cleavage of probes’ last 3 bases. And the sequence of the frag­ment can be deduced after 5 round of sequencing using ladder primer sets.

The read length of SOLiD was initially 35 bp reads and the output was 3 G data per run. Owing to two-base sequencing method, SOLiD could reach a high accuracy of 99.85% after filtering. At the end of 2007, ABI released the first SOLiD system. In late 2010, the SOLiD 5500xl sequenc­ing system was released. From SOLiD to SOLiD 5500xl, five upgrades were released by ABI in just three years. The SOLiD 5500xl realized im­proved read length, accuracy, and data output of 85 bp, 99.99%, and 30 G per run, respectively. A complete run could be finished within 7 days. The sequencing cost is about $40 * 10-9 per base estimated from reagent use only by BGI users. But the short read length and resequencing only in applications is still its major shortcoming [13]. Application of SOLiD in­cludes whole genome resequencing, targeted resequencing, transcriptome research (including gene expression profiling, small RNA analysis, and whole transcriptome analysis), and epigenome (like ChIP-Seq and methyl — ation). Like other NGS systems, SOLiD’s computational infrastructure is expensive and not trivial to use; it requires an air-conditioned data center, computing cluster, skilled personnel in computing, distributed memory cluster, fast networks, and batch queue system. Operating system used by most researchers is GNU/LINUX. Each solid sequencer run takes 7 days and generates around 4 TB of raw data. More data will be generated after bioinformatics analysis. This information is listed and compared with other NGS systems in Tables 1(a), 1(b), and 1(c). Automation can be used in library preparations, for example, Tecan system which integrated a Co — varis A and Roche 454 REM e system [14].

ALGAL FUNCTIONAL ANNOTATION TOOL: A WEB-BASED ANALYSIS SUITE TO FUNCTIONALLY INTERPRET LARGE GENE LISTS USING INTEGRATED ANNOTATION AND EXPRESSION DATA

DAVID LOPEZ, DAVID CASERO, SHAWN J. COKUS,

SABEEHA S. MERCHANT, and MATTEO PELLEGRINI

11.1 BACKGROUND

Next-generation sequencers are revolutionizing our ability to sequence the genomes of new algae efficiently and in a cost effective manner. Several assembly tools have been developed that take short read data and assemble it into large continuous fragments of DNA. Gene prediction tools are also available which identify coding structures within these fragments. The resulting transcripts can then be analyzed to generate predicted protein sequences. The function of these protein sequences are subsequently de­termined by searching for close homologs in protein databases and trans­ferring the annotation between the two proteins. While some versions of the previously described data processing pipeline have become common­place in genome projects, the resulting functional annotation is typically fairly minimal and includes only limited biological pathway information and protein structure annotation. In contrast, the integration of a variety of pathway, function and protein databases allows for the generation of much richer and more valuable annotations for each protein.

A second challenge is the use of these protein-level annotations to in­terpret the output of genome-scale profiling experiments. High-throughput

genomic techniques, such as RNA-seq experiments, produce measure­ments of large numbers of genes relevant to the biological processes being studied. In order to interpret the biological relevance of these gene lists, which commonly range in size from hundreds to thousands of genes, the members must be functionally classified into biological pathways and cel­lular mechanisms. Traditionally, the genes within these lists are examined using independent annotation databases to assign functions and pathways. Several of these annotation databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [1], MetaCyc [2], and Pfam [3], include a rich set of functional data useful for these purposes.

However, presently researchers must explore these different knowl­edge bases separately, which requires a substantial amount of time and effort. Furthermore, without systematic integration of annotation data, it may be difficult to arrive at a cohesive biological picture. In addition, many of these annotation databases were designed to accommodate a sin­gle gene search, a methodology not optimal for functionally interpreting the large lists of genes derived from high-throughput genomic techniques. Thus, while modern genomic experiments generate data for many genes in parallel, their output must often still be analyzed on a gene-by-gene basis across different databases. This fragmented analysis approach presents a significant bottleneck in the pipeline of biological discovery.

One approach to solving this problem is integrating information from multiple annotation databases and providing access to the combined bio­logical data from a single comprehensive portal that is equipped with the proper statistical foundations to effectively analyze large gene lists. For example, the DAVID database integrates information from several path­way, ontology, and protein family databases [4]. Similarly, Ingenuity Path­way Analysis (IPA) provides an integrated knowledge base derived from published literature for the human genome [5]. The integrated functional information and annotation terms are then assigned to lists of genes and for some analyses, enrichment tests are performed to determine which bio­logical terms are overrepresented within the group of genes. By combin­ing the information found in a number of knowledge bases and performing the analysis of lists of genes, these tools permit the efficient processing of high-throughput genomic experiments and thus expedite the process of biological discovery. However, most of these integrated databases have

been developed for the analysis of well-annotated and thoroughly studied organisms, and are lacking for many newly genome-enabled organisms.

One large group of organisms for which integrated functional data­bases are lacking are the algae. The algae constitute a branch in the plant kingdom, although they form a polyphyletic group as they do not include all the descendants of their last common ancestor. As many as 10 algal genomes have been sequenced, including those of a red alga and several chlorophyte algae, with several more in the pipeline [6-11]. Algal genomic studies have provided insights into photosymbiosis, evolutionary relation­ships between the different species of algae, as well as their unique prop­erties and adaptations. Recently, there has been a renewed interest in the study of algal biochemistry and biology for their potential use in the devel­opment of renewable biofuels [reviewed in [12]]. This has promoted the study of varied biochemical processes in diverse algae, such as hydrogen metabolism, fermentation, lipid biosynthesis, photosynthesis and nutrient assimilation [13-20]. One of the most studied algae is Chlamydomonas reinhardtii. It has a sequenced genome that has been assembled into large scaffolds that are placed on to chromosomes [6]. For many years, Chlam­ydomonas has served as a reference organism for the study of photosynthe­sis, photoreceptors, chloroplast biology and diseases involving flagellar dysfunction [21-25]. Its transcriptome has recently been profiled by RNA — seq experiments under various conditions of nutrient deprivation [[26,27], unpublished data (Castruita M., et al.)].

While Chlamydomonas has been extensively characterized experimen­tally, annotation of its genome is still approximate. Although KEGG cat­egorizes some C. reinhardtii gene models into biological pathways, other databases — such as Reactome [28] — do not directly provide information for proteins of this green alga. Complicating the analysis of Chlamydomo — nas genes is the fact that there are two assemblies of the genome in use (version 3 and version 4) and multiple sets of gene models have been de­veloped that are catalogued under diverse identifiers: Joint Genome Insti­tute (JGI) FM3.1 protein IDs for the version 3 assembly, and JGI version FM4 protein IDs and Augustus version 5 IDs for the version 4 assembly [11,29]. The differences between these assemblies are significant; for ex­ample, the version 3 assembly contains 1,557 continuous segments of se­quence while the fourth version contains 88. Although the version 3 assembly is superseded by version 4, users presently access version 3 because of the richer user-based functional annotations. In addition, other sets of gene predictions have been generated using a variety of additional data, includ­ing ESTs and RNA-seq data, to more accurately delineate start and stop positions and improve upon existing gene models. One such gene predic­tion set is Augustus u10.2. As such, there are a variety of gene models between different assemblies being simultaneously used by researchers, presenting complications in genomics studies. To facilitate the analysis of Chlamydomonas genome-scale data, we developed the Algal Functional Annotation Tool, which provides a comprehensive analysis suite for func­tionally interpreting C. reinhardtii genes across all available protein identi­fiers. This web-based tool provides an integrative data-mining environ­ment that assigns pathway, ontology, and protein family terms to proteins of C. reinhardtii and enables term enrichment analysis for lists of genes. Expression data for several experimental conditions are also integrated into the tool, allowing the determination of overrepresented differentially expressed conditions. Additionally, a gene similarity search tool allows for genes with similar expression patterns to be identified based on expression levels across these conditions.

11.2 CONSTRUCTION AND CONTENT

EXAMPLE: SULFUR-RELATED GENES

Using a filtered list of C. reinhardtii genes derived from transcriptome se­quencing of the green alga under sulfur-depleted conditions [26], the Algal Functional Annotation Tool found enrichment for annotations related to sulfur metabolism, cysteine and methionine metabolism, and sulfur com­pound biosynthesis. For each annotation, the results may be expanded to reveal the genes containing that particular annotation. Furthermore, there is significant overlap between terms directly assigned to C. reinhardtii pro­teins and those inferred from A. thaliana orthology. Visualization of the sulfur metabolism KEGG pathway shows that a majority of the enzymes involved in this biological process is in the sample list, and the reactions they catalyze may be seen on the pathway map. The results for any enrich­ment analysis may be downloaded as a tab-delimited text file. Taking a gene found to be associated with the KEGG pathway ‘Sulfur metabolism’ by this enrichment analysis (JGI v. 3 ID 206154) as a starting input into the gene similarity search tool, the genes corresponding to sulfate transport­er, methionine synthase reductase, and cysteine dioxygenase were found within the top 15 results using the correlation metric between log counts.