DE NOVO TRANSCRIPTOME ASSEMBLY, ANNOTATION, AND EXPRESSION

In order to produce statistically reliable and comparable RNA-Seq data, cDNA library construction and sequencing was performed for each of the duplicate+N reactors, and each of the duplicate — N reactors. Over 88 mil­lion raw sequencing reads were generated and subjected to quality score and length based trimming; resulting in a high quality (HQ) read data set of 87.09 million sequences (average phred score of 35) with an average read length of 77 bp. By incorporating a multiple k-mer based de novo transcriptome assembly strategy (k-mers 23, 33, 63, and 83) [23], HQ reads were assembled into 56,550 transcripts with an average length of 1,459 bp and a read coverage of 1,444* (Figure 2C). Generated transcripts were subjected to searches against the National Center for Biotechnology Information’s (NCBI) nonredundant and plant refseq databases [24], and the majority of transcripts showed significant mat-ches to other closely related green microalgae species (Figure 2A, B) including C. variabilis (~85% of all transcripts), C. reinhardtii (~2.6%), and V. carteri (~3.4%) (Figure 2A). With additional annotations by using KEGG services and Gene Ontology (GO), a total of 23,520 transcripts were associated with at least one GO term, and 4,667 transcripts were assigned with enzyme commission (EC) numbers. Overall, 14,957 transcripts had KO identifiers and were annotated as putative genes and protein families. This assembly provided a reliable, well-annotated transcriptome for downstream RNA — Seq data analysis.

Following the transcriptome assembly and annotation, HQ reads ob­tained from each experimental condition were individually mapped to the generated assembly in order to determine the transcript abundances as RPKM values. To determine fold change differences among+N and — N transcripts, non-normalized read counts were fed into the DESeq package (v1.5.1) and variance and mean dependencies were accounted for [25]. Based on the negative binomial distribution model used in DESeq pack­age, 25,896 transcripts out of the total 56,550 non-redundant transcripts were up-regulated under the — N condition. Plotting transcript fold changes levels shows a high correlation among the biologically replicated sequenc­ing runs as indicated by Euclidean distances (Figure 2D). Overall, 15,987 transcripts had significant differential regulation (q < 0.05) Figure 3A. A complete table of fold changes with significance level for all genes as­sessed is presented in Additional file 3.

We further investigated the alignment of HQ reads to the reference genomes of C. reinhardtii and V. carteri in order to improve and extend our transcriptomic analysis to the detection of splicing events and alternative isoform formation (Figure 3B, C). Although the majority of annotated or-

Подпись: Transcriptomic Analysis of the Oleaginous Microalga 331Подпись: В Подпись:image093A

Transcript l«r>glh (bp)

FIGURE 2: De novo assembly and mapping results. (A, B) Top-hit species distribution for BLASTX matches for the N. oleoabundcms transcriptome; (C) Cumulative transcript length frequency distribution of the N. oleoabundcms transcriptome assembly; (D) Heat map demonstrating the top 100 most differentially expressed genes in the biological replicates of+N and-N conditions.

image094

FIGURE 3: (A) MvA plot contrasting gene expression levels between the-N and+N scenarios based on reads mapped to the/V.

Подпись: 332 Advances in Biofuel Production: Algae and Aquatic Plantso/eoflij/wfifawxtranscriptome. The x-axis represents the mean expression level at the gene scale, and the у-axis represents the log2 fold change from-N to+N. Negative fold changes indicate up-regulation of-N genes. Lighter gray dots are genes that are significant at a false discovery rate of 5%; (B) MvA plot for reads mapped to the C. reinhardtii genome; and (C) MvA plot for reads mapped to the V. carteri genome.

thologs were identified from these closely related microalgae species, very poor mappings (i. e. <5% of reads) were observed between the RNA-Seq data of N. oleoabundans and the genomes of C. reinhardtii and Volvox carteri. As a result, the number of transcripts annotated and evaluated for differ­ential expression was suboptimal, and genomes from these most closely related organisms were not used for gene expression analysis.