Switchgrass Draft Genome Sequencing Efforts

With the per base cost of sequencing on the rapid decline, draft genome sequencing endeavors are becoming more and more feasible in complex eukaryotes, and a high quality sequence-based reconstruction of an organisms genome is an invaluable tool to deciphering the underlying biology and performing intergenome comparisons at the population level or between species. As of this writing, there are 41 plant genera that have a reference sequence deposited on the phytozome genome browser (www. phytozome. net) (Goodstein et al. 2012, Table 1), with only two reported attempts at polyploid genomes such as bread wheat (Brenchley et al. 2012) and switchgrass (unpublished). As of this writing, a Panicum virgatum version 0.0 preliminary release of genotype AP13 is available via phytozome (www. phytozome. net/panicumvirgatum. php). The draft genome dataset consists of 15-fold coverage of the estimated 1.6Gbp genome size as a contig only dataset (summarized in Table 2) that consists of approximately 1,358 Mbp arranged in ~410k contigs (N50 of 4.2kb-83,229 contigs). 65,878 protein coding loci were identified, with 4,193 suspected with splice variation. A subset of contigs that aligned to the Setaria italica (Foxtail millet) coding sequence were aligned to the foxtail millet genome and referenced as such on the phytozome site. These data represent the first de novo assembly of the switchgrass genome and certainly provide a conduit for gene discovery and analysis of the effective gene space of the switchgrass genome, yet underscore the need for new technologies and approaches for deciphering large and complex genomes. Recent reports of hybrid 2nd and 3rd generation sequencing technologies such as Illumina’s HiSeq and Pacific Biosciences RS molecule sequencer (PacBio correction and assembly) (Koren et al. 2012) suggest that longer, accurate reads are becoming possible and scaffolding and super-scaffolding efforts can be augmented by this approach. However, in a polyploid situation, short — reads resulting from the Illumina sequencer may not necessarily accurately correct a long read (5-10kb) from a PacBio molecule sequencer with the correct sub-genome placement in regions of the genome that share high sequence identity. The approach may not be sensitive enough to detect the difference between sequencing errors and subgenome specific SNPs. Another emerging technology, named Moleculo where genomic DNA is fragmented into 10kb segments, clonally amplified, sheared and marked with a unique barcode and sequenced with the Illumina technology (http: // www. illumina. com/technology/moleculo-technology. ilmn), and assembled with proprietary bioinformatics creates long, synthetic reads. This approach holds promise for accurate reconstruction of longer reads and better chance of proper subgenome placement. A more costly, but traditional approach is to pursue a physical mapping approach and minimal tile path sequencing

Table 1. Plant genomes sequenced to date

Species

Common name

Reference

Manihot esculenta

Cassava

Unpublished

Ricinus communis

Castor bean

Chan et al. 2010

Linum usitatissimum

Flax

Wang et al. 2012

Populus trichocarpa

Poplar

Tuskan et al. 2006

Medicago trucatula

Barrel medic

Young et al. 2011

Phaseolus vulgaris

Common bean

Unpublished

Glycine max

Soybean

Schmutz et al. 2010

Cucumis sativus

Cucumber

Huang et al. 2009

Prunus persica

Peach

Unpublished

Malus domestica

Apple

Velasco et al. 2010

Fragaria vesca

Strawberry

Shulaev et al. 2010

Arabidopsis thaliana

Thale cress

Arabidopsis genome initiative. 2000

Arabidopsis lyrata

Lyre-leaved rock cress

Hu et al. 2011

Capsella rubella

Shepards purse

Unpublished

Brassica rapa

Turnip

Wang et al. 2011

Thellungiella halophila

Thellungiella

Oh et al. 2010

Carica papaya

Papaya

Ming et al. 2008

Gossypium raimondii

Cotton

Paterson et al. 2012

Theobroma cacao

Cacao

Argout et al. 2010

Citrus sinensis

Sweet orange

Xu et al. 2012

Citrus clementina

Clementine

Unpublished

Eucalyptus grandis

Eucalyptus

Myburg et al. 2011

Vitis vinifera

Grapevine

Jaillon et al. 2007

Solanum tuberosum

Potato

Xu et al. 2011

Solanum lycopersicum

Tomato

Tomato Genome Consortium 2012

Mimulus guttatus

Monkey flower

Unpublished

Aquilegia coerulea

Colorado blue columbine

Unpublished

Sorghum bicolor

Sorghum

Paterson et al. 2009

Zea mays

Maize

Messing et al. 2004

Setaria italica

Foxtail millet

Zhang et al. 2012

Panicum virgatum

Switchgrass

Unpublished

Oryza sativa

Rice

Yu et al. 2002

Brachypodium distachyon

Purple false brome

Brachypodium Initiative 2010

Selaginella moellendorffii

Spikemoss

Banks et al. 2011

Physcomitrella patens

Moss

Rensing et al. 2008

Chlamydomonas reinhardtii

Green alga

Li et al. 2003

Volvox carteri

Volvox

Prochnik et al. 2010

Coccomyxa subellipsoidea C-169

Microalgae

Blanc et al. 2012

Micromonas pusilla CCMP1545

Phytoplankton

Unpublished

Micromonas pusilla RCC299

Phytoplankton

Unpublished

Ostreococcus lucimarinus

Algae

Palenik et al. 2007

Table 2. Current status of the switchgrass genome initiative

Genome:

1,358 Mbp in 410,030 contigs

N50:

83,229 > 4.2kb

1,601 scaffolds > 50kb in size

Loci:

65,878

Alternative transcripts:

4,193

Primary transcripts:

47,302 complete genes

using the hierarchal BAC-by-BAC approach supplemented with a mix of 2nd generation sequencing. With this approach, it would be prudent to assess the ability to readily separate homeologous genomic segments. The future of the reference genome sequence for switchgrass is uncertain, but as sequencing and advanced capture technologies evolve, we will be better positioned to unravel and understand more about the composition and arrangement of the switchgrass genome.