CAZy Database

For bioenergy research, CAZymes are obviously the most important enzymes. The CAZyDB team started to classify and annotate CAZyme proteins from Gen — Bank, UniProt and PDB to protein families since the early 1990s. It is the original database that defined over 300 CAZyme protein families throughout the past 20 years and the most comprehensive database providing high-quality manual annotation by extracting associated knowledge from the literature (Cantarel et al.,

2009) . Its Web site regularly updates every few weeks, mainly by assigning new proteins in public databases to existing CAZyme families by sequence similarity search or creating new families if there are new bio­chemically characterized CAZyme proteins (supported by published papers) that do not belong to existing CAZyme families. Sometimes the functional annotation information (e. g. known activities) for some families is also updated if relevant literature came out.

Currently the database comprises five classes of pro­tein families: in addition to GHs and GTs, there are three other classes, carbohydrate esterases (CEs), polysaccha­ride lyases (PLs) and carbohydrate binding modules (CBMs). As aforementioned, GTs are used for building polysaccharides or glycolconjugates, while GH for breaking them. CE and PL are also used for breaking car­bohydrate molecules while using different mechanisms or cutting different chemical bonds. CBMs, as indicated by names, are structural modules used for recognizing and binding different sugars. Currently the five classes contain 94 GT families, 131 GH families, 16 CE families, 22 PL families and 64 CBM families. Each class also has an unclassified family, meaning proteins are annotated to belong to a certain class but are not able to be assigned to any existing families in that class. Each family is named with the class name followed by a sequential number, e. g. GT2. Note such name does not indicate any biochemical activity of each family. The reason is that these families are defined solely based on sequence similarity: there are many cases that one family contains

TABLE 6.2 Bioenergy-Related Databases

Database

Description

References

General

CAZyDB

General carbohydrate active enzyme database; a classification scheme with five classes (GH, GT, CE, PL and CBM) and over 300 families; supported by the biochemical literature; links to proteins in GenBank, UniProt and structures in PDB; subfamilies for selected families; Enzyme commission annotation and biochemically curated function annotation

(Cantarel et al., 2009)

CAT

CAZyme analysis toolkit; allow CAZyme annotation of user submitted data using BLAST (basic local alignment search tool) and Pfam-based search

(Park et al., 2010)

dbCAN

Web server for automated CAZyme annotation; allow submission of predicted protein sequences from newly sequenced genomes and metagenomes and return a table and graphical diagram to show the matched CAZyme domains; users could also download HMMs representing all CAZyme domains to perform annotation locally by running HMMER3 (hmmer. org)

(Yin et al., 2012)

Plant biomass formation

Survey of databases for cell wall synthesis

XTH World

A paper reviewed nine public databases

Xyloglucosyl transferase gene nomenclature; gene structure and literature in Arabidopsis, rice and tomato

(Cao et al., 2010)

MAIZEWALL

Cell wall gene catalogue, expression data and literature in maize

(Guillaumie et al., 2007)

coreCarb

PlaNet family tool; Arabidopsis CAZyme proteins; sequence, expression, regulon (coexpressed genes), phylogeny data

(Mutwil et al., 2009)

GolgiP

Web server for prediction of Golgi localized proteins

(Chou et al., 2010)

Purdue cell wall genomics

General classification scheme for plant cell wall synthesis; with mutants and spectrotype information; also including lignin synthesis and modification and NDP-sugar synthesis genes and signaling proteins etc.; phylogeny for gene families in Arabidopsis, rice, maize and sorghum; literature

(Yong et al., 2005)

UC-Riverside cell wall navigator

Similar classification scheme for plant cell wall synthesis proteins to Purdue cell wall database but not including lignin-related and signaling proteins; including sequence, literature and microarray expression data; primarily for Arabidopsis and rice, but also generally from UniProt

(Girke et al., 2004)

Stanford cellulose

Designed for the CesA (cellulose synthase) superfamily and homologs; deprecated web site

(Richmond and Somerville, 2000)

Rice GT

Part of the rice phylogenomics database; rice GT protein phylogeny, sequence, expression, mutants, ortholog, BLAST

(Cao et al., 2008)

Csl families

Web site supplemental to (Yin et al., 2009); protein sequences, alignments and phylogeny of the CesA superfamily in fully sequenced plant and algal genomes

(Yin et al., 2009)

pDAWG

CAZymes in fully sequenced plant and algal genomes based on search against CAZyDB; phylogeny, predicted subcellular localization and protein—protein interaction data; BLAST

(Mao et al., 2009)

PPI for cell wall

Protein—protein interaction graphs for cell wall-related proteins in

Arabidopsis

(Zhou et al., 2010a)

PlaNet

General coexpression database for seven plant organisms and comparison among them; highest reciprocal rank based coexpression and clustering using Heuristic Cluster Chiseling Algorithm (HCCA, Mutwil et al.); queried gene-centered display of coexpression network

(Mutwil et al., 2011)

(Continued)

100

TABLE 6.2

Bioenergy-Related Databases

6. DATABASES FOR BIOENERGY-RELATED ENZYMES.—cont’d

Database

Description

References

Cell wall

coexpression

database

Biclustering coexpression analysis of cell wall-related genes from Purdue cell wall genomics database; coexpression modules and graphs generated using Cytoscape (Shannon et al., 2003); cis — regulatory elements identified in promoter regions of genes of a same module

(Wang et al., 2012)

ATTED

General coexpression database and predicted cis-regulatory elements for Arabidopsis and rice; mutual rank based; also identified conserved coexpression links and referred to protein—protein interaction data

(Obayashi et al., 2009)

Biomass

degradation

GAS db

Glycosyl hydrolase AnnotationS (GAS) database; GH data identified from UniProt and JGI metagenomes based on CAZyDB and Pfam search; featured with the graphical domain diagrams and comparison between two selected bacteria

(Zhou et al., 2010b)

FOLyDB

Fungal enzymes for lignin degradation; 10 families of lignin oxidases and auxiliary enzymes; proteins from GenBank, UniProt and PDB

(Levasseur et al., 2008)

PeroxiBase

General peroxidase database including peroxidases (EC 1.11.1.x) from over 1000 organisms; lignin-related peroxidases are a subset of the database

(Fawal et al., 2012)

LccED

General laccase database and their homologs in the multicopper oxidase superfamily

(Sirim et al., 2011)

Misc

Biofuel feedstock genomic resource (BFGR)

Database of 54 plant organisms with sequenced genomes or significant amount of EST (expressed sequence tag) data; integrated data including expression, ortholog and paralog, pathway prediction, and functional information

(Childs et al., 2012)

BESC-KB

Knowledgebase for the Bioenergy Science Center of DOE; a web portal to a number computational tools and databases dedicated for bioenergy research and developed within the center

(Syed et al., 2012)

Pathway-genome database of poplar

Populus trichocarpa metabolic pathways generated automatically through the Pathway Tool; currently the NDP-sugar biosynthetic pathways were manually curated by experts

(Nag et al., 2012)

JGIIMG/M

Joint Genome Institute’s integrated microbial genomes and metagenomes web site

(Markowitz et al., 2012)

Phytozome

JGI’s plant genome web site; currently most sequenced plant genomes are available in this web site

(Goodstein et al., 2012)

proteins characterized with different biochemical activ­ities. Recent efforts from the CAZyDB team suggest that further classification of family into subfamilies could be useful as subfamily may contain proteins with the same activity (Stam et al., 2006; Lombard et al., 2010; Aspeborg et al., 2012).

CAZyDB’s annotation also evolved in the past 20 years. Among the 327 CAZyme families as of December 2012, there are 10 depreciated families; they were created during the life course of CAZyDB but later were deleted since they were shown not related to carbo­hydrate metabolism or due to some other reasons. How­ever, to keep the existing nomenclature system unchanged, these family names remain in the system but indicated to be deleted on the Web pages for these families. Other examples include CE10 family, whose

Web page was not updated since 2002 because after the family was created it was shown that most CE10 family members do not take carbohydrates as substrate; CBM33 was thought to be a carbohydrate active binding module but later shown likely to be an oxidase family.

For a decade, CAZyDB provides an HTML page for each family to list member proteins and associated func­tional information. In recent updates, CAZyDB added a Web page for each genome, providing a list of GenBank protein accession numbers of that genome together with the CAZyme family assignment for each protein, which is termed "CAZyome" of an organism. So far, there are almost 2400 genomes spanning from eukaryotes to prokaryotes and viruses annotated in CAZyDB. It is said that such genome-scale annotation of CAZyme proteins is done semiautomatically (Coutinho and

Henrissat, 2011). A backend automated domain module — based search is performed first and then manual curation will be conducted to remove false positives or include false negatives. Obviously this process is rather accurate and of high quality but time consuming because it is done manually and requires expert knowledge. Indeed such genome annotation can only be done by the collab­oration with the CAZyDB team, which is often invisible to and out of the control of the users, e. g. people who sequenced the genome. Over the past 10 years, the CAZyDB team has done expert CAZyme annotation for dozens of genome sequencing projects that led to a lot of collaborative genome annotation papers.