CAT and dbCAN

08.08.2015 BIOENERGY. RESEARCH:. ADVANCES AND. APPLICATIONS

With more and more bioenergy-related genomes of plants and microbes as well as environmental metagenomes sequenced, there is an urgent need for automated CAZyme annotation. Although such annotation will not reach a quality as accurate as the expert annotation from CAZyDB, it is expected to be much faster and users can control the annotation at their will. Moreover, nowadays all newly sequenced genomes are relying on generic protein domain/family databases such as Pfam (Finn et al., 2006), InterPro (Hunter et al., 2009), and conserved domain database (CDD, Marchler-Bauer et al., 2009) for automatic genome annotation. Clearly annotation from these databases is often too general and too far from the exact function; the precisely actual function still needs to be determined by experimental approaches. However, most genome annotators are still interested in such genome-scale annotation, as it can give them a quick summary about what the genome encodes, how large the gene families are and how that compares to other genomes.

In fact, even CAZyDB’s manual annotation (assign proteins to existing CAZyme families) on newly sequenced genomes is unlikely to be 100% correct. Considering every new genome contains a high percentage of proteins that are not experimentally studied, the manual curation is still largely based upon additional bioinformatics analysis such as BLAST search against public protein sequence databases (e. g. UniProt (Bairoch et al., 2005)) and domain databases (e. g. Pfam) and inspection of top matches.

With these in mind, automated CAZyme annotation is still very useful, e. g. particularly for a quick and general overview of how many CAZymes and what CAZymes a newly sequenced genome has. Using the annotated CAZyme proteins and classification scheme in CAZyDB as the foundation, two bioinformatics efforts have been published since 2010, both supporting automated CAZyme annotation, given a protein sequence dataset predicted from a genome/metagenome. The CAZyme

Analysis Toolkit (CAT) (Park et al., 2010) allows a BLAST search against CAZyme proteins annotated by CAZyDB and also a Pfam domain-based search. The simple BLAST search suffers from the inability to accurately annotate the prevalent multidomain CAZyme proteins. The Pfam domain-based search can solve this problem. These Pfam domains are either given by CAZyDB in the CAZyme family Web pages or identified to correspond to CAZyme family using an association rule built by CAT. However, there are only 142 (46%) of over 300 CAZyme families linked to Pfam domains by CAZyDB. In fact, many of the Pfam domains were originally created after CAZyDB.

We recently developed dbCAN (Yin et al., 2012) to define a signature domain model for all CAZyme families. Aside from the 142 CAZyme families annotated with a Pfam domain, we managed to associate other CAZyme families to functional domains in a broader and general protein domain database CDD. This way, we were able to find a CDD domain for 248 CAZyme families. For the remaining families, we performed a literature curation by reading relevant biochemical papers that are linked to these families by CAZyDB. In the end, we extracted the domain regions in all the member proteins annotated in CAZyDB and built a multiple sequence alignment (MSA) for each of the CAZyme family. These MSAs were further processed and represented by hidden Markov models (HMMs), statistical models widely used in the bioinformatics field to represent protein sequence alignments, e. g. by Pfam.

As of June 2012, dbCAN has 320 HMMs representing 317 CAZyme families and three cellulosome modules. We provide all these HMMs freely to the public so that they can run domain-based tool hmmscan of the HMMER 3.0 package (hmmer. org) to annotate their genomes/metagenomes in a local computer, exactly the way that people perform Pfam, InterPro or CDD annotation. To help users who do not know how to run hmmscan on a Linux PC, we offer the web server (http://csbl. bmb. uga. edu/dbCAN/annotate. php) so that people can submit their sequences for annotation on the web. The 320 CAZyme family-specific HMMs are our key contribution to the carbohydrate research community and ideally should be included in the general protein domain/family database such as Pfam in the future.

In addition to the Web server, dbCAN also provides a database where precomputed CAZyme homologs in a number of protein databases are showed on the Web. Particularly, starting from the 320 dbCAN HMMs, we scanned public metagenome datasets such as NCBI — env-nr, CAMERA (Seshadri et al., 2007), JGI metagenomes (Markowitz et al., 2012), human gut metagenomes (Meta-HIT) (Qin et al., 2010) and cow rumen metagenomes (Hess et al., 2011) as well as plant (Goodstein et al., 2012), bacterial and fungal genomes. Tests on

Arabidopsis thaliana (plant) and C. thermocellum (bacteria) using CAZyDB as the positive set suggest that the automated CAZyme annotation achieved a fairly good accuracy (A. thaliana: sensitivity = 96.3%, precision = 78.8% and average = 87.6%; C. thermocellum: sensitivity = 99.3% and precision = 89.4%). Particularly the sensitivity is over 95% for both organisms, meaning dbCAN annotation tends not to lose true CAZyme proteins.

FOLy Database

Inspired by CAZyDB, Levasseur et al. developed a new database named FOLy, for the classification of ligni — nases in fungi (Levasseur et al., 2008), as these enzymes are critical for breaking down lignins in the biomass but are not included in CAZyDB. Similar to CAZyDB, FOLyDB started from biochemically characterized proteins or structures to recruit homologs from GenBank, UniProt and PDB databases. Based on sequence similarity, three lignin oxidase families and seven lignin degrading auxiliary enzyme families were created, each containing biochemically characterized proteins together with their sequence homologs. Similarly, FOLyDB is featured with expert manual curation of continuingly published literature to include more characterized proteins in order to create new families and populate the database. Like CAZyDB, it is not designed for automated genome annotation but BLAST and Pfam domain-based search against annotated proteins in FOLyDB has been widely used to annotate newly sequenced genomes for ligninases.

Календарь
Май 2024

Пн Вт Ср Чт Пт Сб Вс

1 2 3 4 5

6 7 8 9 10 11 12

13 14 15 16 17 18 19

20 21 22 23 24 25 26

27 28 29 30 31

« Дек

Май 2024
Пн	Вт	Ср	Чт	Пт	Сб	Вс
	1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Рубрики

Последние записи
Как выбрать гостиницу для кошек
14 декабря, 2021
Сегодня каждый, кто собирается в отпуск и не знает, с кем оставить своего котика или кошку, может во[...]
Грузы для тентованных грузовиков в Челябинске
1 декабря, 2021
Оказавшись в час пик вблизи любой оживленной автомагистрали, вы увидите большое количество тентованн[...]
Нихромовая проволока: пару слов об отличительных особенностях
21 сентября, 2021
На современном рынке предложено много видов проволоки и одним из самых востребованных считается пр[...]
Заявки на доставку груза фургонами и термобудками по всей России
17 сентября, 2021
Рядом с супермаркетами, торговыми центрами и небольшими магазинами шаговой доступности можно увидеть[...]
Информация о характеристиках газовых резаков
7 сентября, 2021
Основное предназначение устройств газовой резки металломатериалов - смешение горючих элементов с кис[...]

Свежие записи

CAT and dbCAN

Календарь

Рубрики

Последние записи

Как выбрать гостиницу для кошек

Грузы для тентованных грузовиков в Челябинске

Нихромовая проволока: пару слов об отличительных особенностях

Заявки на доставку груза фургонами и термобудками по всей России

Информация о характеристиках газовых резаков

Свежие записи