ALGAL FUNCTIONAL ANNOTATION TOOL: A WEB-BASED ANALYSIS SUITE TO FUNCTIONALLY INTERPRET LARGE GENE LISTS USING INTEGRATED ANNOTATION AND EXPRESSION DATA

DAVID LOPEZ, DAVID CASERO, SHAWN J. COKUS,

SABEEHA S. MERCHANT, and MATTEO PELLEGRINI

11.1 BACKGROUND

Next-generation sequencers are revolutionizing our ability to sequence the genomes of new algae efficiently and in a cost effective manner. Several assembly tools have been developed that take short read data and assemble it into large continuous fragments of DNA. Gene prediction tools are also available which identify coding structures within these fragments. The resulting transcripts can then be analyzed to generate predicted protein sequences. The function of these protein sequences are subsequently de­termined by searching for close homologs in protein databases and trans­ferring the annotation between the two proteins. While some versions of the previously described data processing pipeline have become common­place in genome projects, the resulting functional annotation is typically fairly minimal and includes only limited biological pathway information and protein structure annotation. In contrast, the integration of a variety of pathway, function and protein databases allows for the generation of much richer and more valuable annotations for each protein.

A second challenge is the use of these protein-level annotations to in­terpret the output of genome-scale profiling experiments. High-throughput

genomic techniques, such as RNA-seq experiments, produce measure­ments of large numbers of genes relevant to the biological processes being studied. In order to interpret the biological relevance of these gene lists, which commonly range in size from hundreds to thousands of genes, the members must be functionally classified into biological pathways and cel­lular mechanisms. Traditionally, the genes within these lists are examined using independent annotation databases to assign functions and pathways. Several of these annotation databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [1], MetaCyc [2], and Pfam [3], include a rich set of functional data useful for these purposes.

However, presently researchers must explore these different knowl­edge bases separately, which requires a substantial amount of time and effort. Furthermore, without systematic integration of annotation data, it may be difficult to arrive at a cohesive biological picture. In addition, many of these annotation databases were designed to accommodate a sin­gle gene search, a methodology not optimal for functionally interpreting the large lists of genes derived from high-throughput genomic techniques. Thus, while modern genomic experiments generate data for many genes in parallel, their output must often still be analyzed on a gene-by-gene basis across different databases. This fragmented analysis approach presents a significant bottleneck in the pipeline of biological discovery.

One approach to solving this problem is integrating information from multiple annotation databases and providing access to the combined bio­logical data from a single comprehensive portal that is equipped with the proper statistical foundations to effectively analyze large gene lists. For example, the DAVID database integrates information from several path­way, ontology, and protein family databases [4]. Similarly, Ingenuity Path­way Analysis (IPA) provides an integrated knowledge base derived from published literature for the human genome [5]. The integrated functional information and annotation terms are then assigned to lists of genes and for some analyses, enrichment tests are performed to determine which bio­logical terms are overrepresented within the group of genes. By combin­ing the information found in a number of knowledge bases and performing the analysis of lists of genes, these tools permit the efficient processing of high-throughput genomic experiments and thus expedite the process of biological discovery. However, most of these integrated databases have

been developed for the analysis of well-annotated and thoroughly studied organisms, and are lacking for many newly genome-enabled organisms.

One large group of organisms for which integrated functional data­bases are lacking are the algae. The algae constitute a branch in the plant kingdom, although they form a polyphyletic group as they do not include all the descendants of their last common ancestor. As many as 10 algal genomes have been sequenced, including those of a red alga and several chlorophyte algae, with several more in the pipeline [6-11]. Algal genomic studies have provided insights into photosymbiosis, evolutionary relation­ships between the different species of algae, as well as their unique prop­erties and adaptations. Recently, there has been a renewed interest in the study of algal biochemistry and biology for their potential use in the devel­opment of renewable biofuels [reviewed in [12]]. This has promoted the study of varied biochemical processes in diverse algae, such as hydrogen metabolism, fermentation, lipid biosynthesis, photosynthesis and nutrient assimilation [13-20]. One of the most studied algae is Chlamydomonas reinhardtii. It has a sequenced genome that has been assembled into large scaffolds that are placed on to chromosomes [6]. For many years, Chlam­ydomonas has served as a reference organism for the study of photosynthe­sis, photoreceptors, chloroplast biology and diseases involving flagellar dysfunction [21-25]. Its transcriptome has recently been profiled by RNA — seq experiments under various conditions of nutrient deprivation [[26,27], unpublished data (Castruita M., et al.)].

While Chlamydomonas has been extensively characterized experimen­tally, annotation of its genome is still approximate. Although KEGG cat­egorizes some C. reinhardtii gene models into biological pathways, other databases — such as Reactome [28] — do not directly provide information for proteins of this green alga. Complicating the analysis of Chlamydomo — nas genes is the fact that there are two assemblies of the genome in use (version 3 and version 4) and multiple sets of gene models have been de­veloped that are catalogued under diverse identifiers: Joint Genome Insti­tute (JGI) FM3.1 protein IDs for the version 3 assembly, and JGI version FM4 protein IDs and Augustus version 5 IDs for the version 4 assembly [11,29]. The differences between these assemblies are significant; for ex­ample, the version 3 assembly contains 1,557 continuous segments of se­quence while the fourth version contains 88. Although the version 3 assembly is superseded by version 4, users presently access version 3 because of the richer user-based functional annotations. In addition, other sets of gene predictions have been generated using a variety of additional data, includ­ing ESTs and RNA-seq data, to more accurately delineate start and stop positions and improve upon existing gene models. One such gene predic­tion set is Augustus u10.2. As such, there are a variety of gene models between different assemblies being simultaneously used by researchers, presenting complications in genomics studies. To facilitate the analysis of Chlamydomonas genome-scale data, we developed the Algal Functional Annotation Tool, which provides a comprehensive analysis suite for func­tionally interpreting C. reinhardtii genes across all available protein identi­fiers. This web-based tool provides an integrative data-mining environ­ment that assigns pathway, ontology, and protein family terms to proteins of C. reinhardtii and enables term enrichment analysis for lists of genes. Expression data for several experimental conditions are also integrated into the tool, allowing the determination of overrepresented differentially expressed conditions. Additionally, a gene similarity search tool allows for genes with similar expression patterns to be identified based on expression levels across these conditions.

11.2 CONSTRUCTION AND CONTENT