FUTURE PERSPECTIVES

As a perspective for the future development of bioenergy-related databases, we ask: what do we need from newly developed databases? Nucleic Acids Research publishes a prestigious annual Database Spe­cial Issue since 20 years ago. Most databases published there for a particular class of proteins such as plant TFs (Guo et al., 2008), peroxidases (Fawal et al., 2012), transporters (Saier et al., 2006) and peptidases (Rawlings et al., 2012), all provide the following data or functional­ities: (1) a general classification of targeted protein families, manually collected references, a list of charac­terized proteins curated from the literature and/or pre­dicted member proteins; (2) secondary data derived from further in-depth bioinformatics analysis, such as computer-based functional annotation (e. g. Gene Ontology or protein domain annotation), sequence alignment, phylogenetic trees, predicted protein struc­tures etc.; (3) simple Web-based BLAST search against the sequence database and text search using keywords;

(4) long-term maintenance to update regularly with new data; and (5) plenty of documentation such as help, FAQ or tutorial pages. These could be considered as criteria for a good protein family database.

Although the plant biomass formation-related data­bases listed in Table 6.2 are all very useful, none of them have sufficiently integrated various functional omics data. Biologists working on one model plant often want to take advantage of these data to study their inter­ested genes, e. g. investigate fully sequenced plant ge­nomes to look for orthologs, or transcriptome data (microarray and RNA-seq) for expression profiles or look for coexpressed genes and go to the upstream re­gions for candidate cis-regulatory motifs; all these ana­lyses have to be done using individual bioinformatics tools or servers, which often requires expert knowledge to run or to interpret the results. In addition, many of the databases are outdated and none of them have included all CWR genes. For example, Purdue’s database is an excellent resource, but it does not include many of the newly characterized CWR genes such as TF family NAC, WRKY, MYB members shown to control lignin synthesis; many of the newly characterized CAZyme families such as GT43, 61, 75, transporters for NDP — sugars and monolignols; miRNAs; DUF (domain of un­known function) families etc. It includes neither much annotation data nor any search functionalities.

Therefore, the future plant CWR gene databases should aim to include all experimentally characterized CWR genes from any organisms, associated sequences and functional descriptions collected from the published literature, e. g. those listed in Table 6.1. Such character­ized gene list could be highly useful for annotating sequenced bioenergy plants such as switchgrass, poplar, maize, sorghum and Eucalyptus grandis. The CWR gene repertories for these organisms will be highly valuable for the bioenergy research community as people are trying to select candidate CWR genes to knock down or overexpress for developing transgenic plants in these model organisms. Gene families for CWR genes and other extensive secondary bioinformatics data should also be included in the databases, particularly phylog — eny (used to identify orthologs from homologs), pre­dicted cis-regulatory element, conserved coexpression network modules of known CWR genes, expression profiling, coexpressed gene list including noncoding RNAs, genomic location, gene neighborhood, epige — nomics, protein—protein interactions, structures, subcel­lular locations, single-nucleotide polymorphism, indels, etc. Similar databases should also be developed for plant CAZymes and include the above bioinformatics-derived data types. The reason is that CAZyDB now only covered 2 (A. thaliana and Oryza sativa) of over 40 sequenced plant and green algal genomes, not to mention there are more incomplete genomes and tran — scriptomes (ESTs and RNA-seq data).

References