CAT and dbCAN

With more and more bioenergy-related genomes of plants and microbes as well as environmental metage­nomes sequenced, there is an urgent need for automated CAZyme annotation. Although such annotation will not reach a quality as accurate as the expert annotation from CAZyDB, it is expected to be much faster and users can control the annotation at their will. Moreover, nowadays all newly sequenced genomes are relying on generic protein domain/family databases such as Pfam (Finn et al., 2006), InterPro (Hunter et al., 2009), and conserved domain database (CDD, Marchler-Bauer et al., 2009) for automatic genome annotation. Clearly annotation from these databases is often too general and too far from the exact function; the precisely actual function still needs to be determined by experimental approaches. However, most genome annotators are still interested in such genome-scale annotation, as it can give them a quick summary about what the genome encodes, how large the gene families are and how that compares to other genomes.

In fact, even CAZyDB’s manual annotation (assign proteins to existing CAZyme families) on newly sequenced genomes is unlikely to be 100% correct. Considering every new genome contains a high percent­age of proteins that are not experimentally studied, the manual curation is still largely based upon additional bioinformatics analysis such as BLAST search against public protein sequence databases (e. g. UniProt (Bairoch et al., 2005)) and domain databases (e. g. Pfam) and inspection of top matches.

With these in mind, automated CAZyme annotation is still very useful, e. g. particularly for a quick and general overview of how many CAZymes and what CAZymes a newly sequenced genome has. Using the annotated CAZyme proteins and classification scheme in CAZyDB as the foundation, two bioinformatics efforts have been published since 2010, both supporting automated CAZyme annotation, given a protein sequence dataset predicted from a genome/metagenome. The CAZyme

Analysis Toolkit (CAT) (Park et al., 2010) allows a BLAST search against CAZyme proteins annotated by CAZyDB and also a Pfam domain-based search. The simple BLAST search suffers from the inability to accurately annotate the prevalent multidomain CAZyme proteins. The Pfam domain-based search can solve this problem. These Pfam domains are either given by CAZyDB in the CAZyme family Web pages or identified to corre­spond to CAZyme family using an association rule built by CAT. However, there are only 142 (46%) of over 300 CAZyme families linked to Pfam domains by CAZyDB. In fact, many of the Pfam domains were originally created after CAZyDB.

We recently developed dbCAN (Yin et al., 2012) to define a signature domain model for all CAZyme fam­ilies. Aside from the 142 CAZyme families annotated with a Pfam domain, we managed to associate other CAZyme families to functional domains in a broader and general protein domain database CDD. This way, we were able to find a CDD domain for 248 CAZyme families. For the remaining families, we performed a literature curation by reading relevant biochemical papers that are linked to these families by CAZyDB. In the end, we extracted the domain regions in all the mem­ber proteins annotated in CAZyDB and built a multiple sequence alignment (MSA) for each of the CAZyme fam­ily. These MSAs were further processed and represented by hidden Markov models (HMMs), statistical models widely used in the bioinformatics field to represent pro­tein sequence alignments, e. g. by Pfam.

As of June 2012, dbCAN has 320 HMMs representing 317 CAZyme families and three cellulosome modules. We provide all these HMMs freely to the public so that they can run domain-based tool hmmscan of the HMMER 3.0 package (hmmer. org) to annotate their genomes/metagenomes in a local computer, exactly the way that people perform Pfam, InterPro or CDD annota­tion. To help users who do not know how to run hmmscan on a Linux PC, we offer the web server (http://csbl. bmb. uga. edu/dbCAN/annotate. php) so that people can sub­mit their sequences for annotation on the web. The 320 CAZyme family-specific HMMs are our key contribution to the carbohydrate research community and ideally should be included in the general protein domain/family database such as Pfam in the future.

In addition to the Web server, dbCAN also provides a database where precomputed CAZyme homologs in a number of protein databases are showed on the Web. Particularly, starting from the 320 dbCAN HMMs, we scanned public metagenome datasets such as NCBI — env-nr, CAMERA (Seshadri et al., 2007), JGI metagenomes (Markowitz et al., 2012), human gut metagenomes (Meta-HIT) (Qin et al., 2010) and cow rumen metage­nomes (Hess et al., 2011) as well as plant (Goodstein et al., 2012), bacterial and fungal genomes. Tests on

Arabidopsis thaliana (plant) and C. thermocellum (bacteria) using CAZyDB as the positive set suggest that the auto­mated CAZyme annotation achieved a fairly good accu­racy (A. thaliana: sensitivity = 96.3%, precision = 78.8% and average = 87.6%; C. thermocellum: sensitivity = 99.3% and precision = 89.4%). Particularly the sensitivity is over 95% for both organisms, meaning dbCAN annotation tends not to lose true CAZyme proteins.

FOLy Database

Inspired by CAZyDB, Levasseur et al. developed a new database named FOLy, for the classification of ligni — nases in fungi (Levasseur et al., 2008), as these enzymes are critical for breaking down lignins in the biomass but are not included in CAZyDB. Similar to CAZyDB, FOLyDB started from biochemically characterized pro­teins or structures to recruit homologs from GenBank, UniProt and PDB databases. Based on sequence similar­ity, three lignin oxidase families and seven lignin deg­rading auxiliary enzyme families were created, each containing biochemically characterized proteins together with their sequence homologs. Similarly, FOLyDB is featured with expert manual curation of continuingly published literature to include more characterized pro­teins in order to create new families and populate the database. Like CAZyDB, it is not designed for automated genome annotation but BLAST and Pfam domain-based search against annotated proteins in FOLyDB has been widely used to annotate newly sequenced genomes for ligninases.