Как выбрать гостиницу для кошек
14 декабря, 2021
For bioenergy research, CAZymes are obviously the most important enzymes. The CAZyDB team started to classify and annotate CAZyme proteins from Gen — Bank, UniProt and PDB to protein families since the early 1990s. It is the original database that defined over 300 CAZyme protein families throughout the past 20 years and the most comprehensive database providing high-quality manual annotation by extracting associated knowledge from the literature (Cantarel et al.,
2009) . Its Web site regularly updates every few weeks, mainly by assigning new proteins in public databases to existing CAZyme families by sequence similarity search or creating new families if there are new biochemically characterized CAZyme proteins (supported by published papers) that do not belong to existing CAZyme families. Sometimes the functional annotation information (e. g. known activities) for some families is also updated if relevant literature came out.
Currently the database comprises five classes of protein families: in addition to GHs and GTs, there are three other classes, carbohydrate esterases (CEs), polysaccharide lyases (PLs) and carbohydrate binding modules (CBMs). As aforementioned, GTs are used for building polysaccharides or glycolconjugates, while GH for breaking them. CE and PL are also used for breaking carbohydrate molecules while using different mechanisms or cutting different chemical bonds. CBMs, as indicated by names, are structural modules used for recognizing and binding different sugars. Currently the five classes contain 94 GT families, 131 GH families, 16 CE families, 22 PL families and 64 CBM families. Each class also has an unclassified family, meaning proteins are annotated to belong to a certain class but are not able to be assigned to any existing families in that class. Each family is named with the class name followed by a sequential number, e. g. GT2. Note such name does not indicate any biochemical activity of each family. The reason is that these families are defined solely based on sequence similarity: there are many cases that one family contains
TABLE 6.2 Bioenergy-Related Databases
(Continued) |
100 TABLE 6.2 |
Bioenergy-Related Databases |
6. DATABASES FOR BIOENERGY-RELATED ENZYMES.—cont’d |
|
Database |
Description |
References |
|
Cell wall coexpression database |
Biclustering coexpression analysis of cell wall-related genes from Purdue cell wall genomics database; coexpression modules and graphs generated using Cytoscape (Shannon et al., 2003); cis — regulatory elements identified in promoter regions of genes of a same module |
(Wang et al., 2012) |
|
ATTED |
General coexpression database and predicted cis-regulatory elements for Arabidopsis and rice; mutual rank based; also identified conserved coexpression links and referred to protein—protein interaction data |
(Obayashi et al., 2009) |
|
Biomass degradation |
GAS db |
Glycosyl hydrolase AnnotationS (GAS) database; GH data identified from UniProt and JGI metagenomes based on CAZyDB and Pfam search; featured with the graphical domain diagrams and comparison between two selected bacteria |
(Zhou et al., 2010b) |
FOLyDB |
Fungal enzymes for lignin degradation; 10 families of lignin oxidases and auxiliary enzymes; proteins from GenBank, UniProt and PDB |
(Levasseur et al., 2008) |
|
PeroxiBase |
General peroxidase database including peroxidases (EC 1.11.1.x) from over 1000 organisms; lignin-related peroxidases are a subset of the database |
(Fawal et al., 2012) |
|
LccED |
General laccase database and their homologs in the multicopper oxidase superfamily |
(Sirim et al., 2011) |
|
Misc |
Biofuel feedstock genomic resource (BFGR) |
Database of 54 plant organisms with sequenced genomes or significant amount of EST (expressed sequence tag) data; integrated data including expression, ortholog and paralog, pathway prediction, and functional information |
(Childs et al., 2012) |
BESC-KB |
Knowledgebase for the Bioenergy Science Center of DOE; a web portal to a number computational tools and databases dedicated for bioenergy research and developed within the center |
(Syed et al., 2012) |
|
Pathway-genome database of poplar |
Populus trichocarpa metabolic pathways generated automatically through the Pathway Tool; currently the NDP-sugar biosynthetic pathways were manually curated by experts |
(Nag et al., 2012) |
|
JGIIMG/M |
Joint Genome Institute’s integrated microbial genomes and metagenomes web site |
(Markowitz et al., 2012) |
|
Phytozome |
JGI’s plant genome web site; currently most sequenced plant genomes are available in this web site |
(Goodstein et al., 2012) |
proteins characterized with different biochemical activities. Recent efforts from the CAZyDB team suggest that further classification of family into subfamilies could be useful as subfamily may contain proteins with the same activity (Stam et al., 2006; Lombard et al., 2010; Aspeborg et al., 2012).
CAZyDB’s annotation also evolved in the past 20 years. Among the 327 CAZyme families as of December 2012, there are 10 depreciated families; they were created during the life course of CAZyDB but later were deleted since they were shown not related to carbohydrate metabolism or due to some other reasons. However, to keep the existing nomenclature system unchanged, these family names remain in the system but indicated to be deleted on the Web pages for these families. Other examples include CE10 family, whose
Web page was not updated since 2002 because after the family was created it was shown that most CE10 family members do not take carbohydrates as substrate; CBM33 was thought to be a carbohydrate active binding module but later shown likely to be an oxidase family.
For a decade, CAZyDB provides an HTML page for each family to list member proteins and associated functional information. In recent updates, CAZyDB added a Web page for each genome, providing a list of GenBank protein accession numbers of that genome together with the CAZyme family assignment for each protein, which is termed "CAZyome" of an organism. So far, there are almost 2400 genomes spanning from eukaryotes to prokaryotes and viruses annotated in CAZyDB. It is said that such genome-scale annotation of CAZyme proteins is done semiautomatically (Coutinho and
Henrissat, 2011). A backend automated domain module — based search is performed first and then manual curation will be conducted to remove false positives or include false negatives. Obviously this process is rather accurate and of high quality but time consuming because it is done manually and requires expert knowledge. Indeed such genome annotation can only be done by the collaboration with the CAZyDB team, which is often invisible to and out of the control of the users, e. g. people who sequenced the genome. Over the past 10 years, the CAZyDB team has done expert CAZyme annotation for dozens of genome sequencing projects that led to a lot of collaborative genome annotation papers.