DISCUSSION OF NGS APPLICATIONS

Fast progress in DNA sequencing technology has made for a substantial reduction in costs and a substantial increase in throughput and accuracy. With more and more organisms being sequenced, a flood of genetic data is inundating the world every day. Progress in genomics has been moving steadily forward due to a revolution in sequencing technology. Addition­ally, other of types-large scale studies in exomics, metagenomics, epig — enomics, and transcriptomics all become reality. Not only do these studies provide the knowledge for basic research, but also they afford immediate application benefits. Scientists across many fields are utilizing these data for the development of better-thriving crops and crop yields and livestock and improved diagnostics, prognostics, and therapies for cancer and other complex diseases.

BGI is on the cutting edge of translating genomics research into molec­ular breeding and disease association studies with belief that agriculture, medicine, drug development, and clinical treatment will eventually enter a new stage for more detailed understanding of the genetic components of all the organisms. BGI is primarily focused on three projects. (1) The Mil­lion Species/Varieties Genomes Project, aims to sequence a million eco­nomically and scientifically important plants, animals, and model organ­isms, including different breeds, varieties, and strains. This project is best represented by our sequencing of the genomes of the Giant panda, potato, macaca, and others, along with multiple resequencing projects. (2) The Million Human Genomes Project focuses on large-scale population and association studies that use whole-genome or whole-exome sequencing strategies. (3) The Million Eco-System Genomes Project has the objective of sequencing the metagenome and cultured microbiome of several differ­ent environments, including microenvironments within the human body [25]. Together they are called 3 M project.

In the following part, each of the following aspects of applications in­cluding de novo sequencing, mate-pair, whole genome or target-region

resequencing, small RNA, transcriptome, RNA seq, epigenomics, and metagenomics, is briefly summarized.

In DNA de novo sequencing, the library with insert size below 800 bp is defined as DNA short fragment library, and it is usually applied in de novo and resequencing research. Skovgaard et al. [26] have applied a combination method of WGS (whole-genome sequencing) and genome copy number analysis to identify the mutations which could suppress the growth deficiency imposed by excessive initiations from the E. coli origin of replication, oriC.

Mate-pair library sequencing is significant beneficial for de novo se­quencing, because the method could decrease gap region and extend scaf­fold length. Reinhardt et al. [27] developed a novel method for de novo ge­nome assembly by analyzing sequencing data from high-throughput short read sequencing technology. They assembled genomes into large scaffolds at a fraction of the traditional cost and without using reference sequence. The assembly of one sample yielded an N50 scaffold size of 531,821 bp with >75% of the predicted genome covered by scaffolds over 100,000 bp.

Whole genome resequencing sequenced the complete DNA sequence of an organism’s genome including the whole chromosomal DNA at a single time and alignment with the reference sequence. Mills et al. [28] constructed a map of unbalanced SVs (genomic structural variants) based on whole genome DNA sequencing data from 185 human genomes with SOLiD platform; the map encompassed 22,025 deletions and 6,000 addi­tional SVs, including insertions and tandem duplications [28]. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analyzing their origin and functional impact [28].

The whole genome resequencing is an effective way to study the functional gene, but the high cost and massive data are the main problem for most researchers. Target region sequencing is a solution to solve it. Microarray capture is a popular way of target region sequencing, which uses hybridization to arrays containing synthetic oligo-nucleotides match­ing the target DNA sequencing. Gnirke et al. [29] developed a captured method that uses an RNA “baits” to capture target DNA fragments from the “pond” and then uses the Illumina platform to read out the sequence. About 90% of uniquely aligning bases fell on or near bait sequence; up to 50% lay on exons proper [29].

Fehniger et al. used two platforms, Illumina GA and ABI SOLiD, to define the miRNA transcriptomes of resting and cytokine-activated prima­ry murine NK (natural killer) cells [30]. The identified 302 known and 21 novel mature miRNAs were analyzed by unique bioinformatics pipeline from small RNA libraries of NK cell. These miRNAs are overexpressed in broad range and exhibit isomiR complexity, and a subset is differentially expressed following cytokine activation, which were the clue to identify the identification of miRNAs by the Illumina GA and SOLiD instruments [30].

The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other noncoding RNA produced in one or a population of cells. In these years, next-generation sequencing technology is used to study the transcriptome compares with DNA microarray technology in the past. The S. mediterranea transcriptome could be sequenced by an ef­ficient sequencing strategy which designed by Adamidi et al. [31]. The catalog of assembled transcripts and the identified peptides in this study dramatically expand and refine planarian gene annotation, which is dem­onstrated by validation of several previously unknown transcripts with stem cell-dependent expression patterns.

RNA-seq is a new method in RNA sequencing to study mRNA ex­pression. It is similar to transcriptome sequencing in sample preparation, except the enzyme. In order to estimate the technical variance, Marioni et al. [32] analyzed a kidney RNA samples on both Illumina platform and Affymetrix arrays. The additional analyses such as low-expressed genes, alternative splice variants, and novel transcripts were found on Illumina platform. Bradford et al. [33] compared the data of RNA-seq library on the SOLiD platform and Affymetrix Exon 1.0ST arrays and found a high degree of correspondence between the two platforms in terms of exon-lev­el fold changes and detection. And the greatest detection correspondence was seen when the background error rate is extremely low in RNA-seq. The difference between RNA-seq and transcriptome on SOLiD is not so obvious as Illumina.

There are two kinds of application of epigenetic, Chromatin immuno — precipitation and methylation analysis. Chromatin immunoprecipitation (ChIP) is an immunoprecipitation technique which is used to study the in­teraction between protein and DNA in a cell, and the histone modifies would be found by the specific location in genome. Based on next-generation sequencing technology, Johnson et al. [34] developed a large-scale chro­matin immunoprecipitation assay to identify motif, especially noncanoni­cal NRSF-binding motif. The data displays sharp resolution of binding position (±50 bp), which is important to infer new candidate interaction for the high sensitivity and specificity (ROC (receiver operator characteristic) area >0.96) and statistical confidence (P < 10-4). Another important appli­cation in epigenetic is DNA methylation analysis. DNA methylation exists typically in vertebrates at CpG sites; the methylation caused the conver­sion of the cytosine to 5-methylcytosine. Chung presented a whole methy — lome sequencing to study the difference between two kinds of bisulfite conversion methods (in solution versus in gel) by SOLiD platform [35].

The world class genome projects include the 1000 genome project, and the human ENCODE project, the human Microbiome (HMP) project, to name a few. BGI takes an active role in these and many more ongoing projects like 1000 Animal and Plant Genome project, the MetaHIT proj­ect, Yanhuang project, LUCAMP (Diabetes-associated Genes and Varia­tions Study), ICGC (international cancer genome project), Ancient human genome, 1000 Mendelian Disorders Project, Genome 10 K Project, and so forth [25]. These internationally collaborated genome projects greatly enhanced genomics study and applications in healthcare and other fields.

To manage multiple projects including large and complex ones with up to tens of thousands of samples, a superior and sophisticated project management system is required handling information processing from the very beginning of sample labeling and storage to library construction, multiplexing, sequencing, and informatics analysis. Research-oriented bioinformatics analysis and followup experiment processed are not in­cluded. Although automation techniques’ adoption has greatly simplified bioexperiment human interferences, all other procedures carried out by human power have to be managed. BGI has developed BMS system and Cloud service for efficient information exchange and project management. The behavior management mainly follows Japan 5S onsite model. Ad­ditionally, BGI has passed ISO9001 and CSPro (authorized by Illumina) QC system and is currently taking (Clinical Laboratory Improvement Amendments) CLIA and (American Society for Histocompatibility and Immunogenetics) AShI tests. Quick, standard, and open reflection system guarantees an efficient troubleshooting pathway and high performance, for example, instrument design failure of Truseq v3 flowcell resulting in bub­ble appearance (which is defined as “bottom-middle-swatch” phenomenon by Illumina) and random N in reads. This potentially hazards sequencing quality, GC composition as well as throughput. It not only effects a small area where the bubble locates resulting in reading N but also effects the focus of the place nearby, including the whole swatch, and the adjacent swatch. Filtering parameters have to be determined to ensure quality raw data for bioinformatics processing. Lead by the NGS tech group, joint meetings were called for analyzing and troubleshooting this problem, to discuss strategies to best minimize effect in terms of cost and project time, to construct communication channel, to statistically summarize compen­sation, in order to provide best project management strategies in this time. Some reagent QC examples are summaried in Liu et al. [36].

BGI is establishing their cloud services. Combined with advanced NGS technologies with multiple choices, a plug-and-run informatics ser­vice is handy and affordable. A series of softwares are available including BLAST, SOAP, and SOAP SNP for sequence alignment and pipelines for RNAseq data. Also SNP calling programs such as Hecate and Gaea are about to be released. Big-data studies from the whole spectrum of life and biomedical sciences now can be shared and published on a new journal Gi- gaSicence cofounded by BGI and Biomed Central. It has a novel publica­tion format: each piece of data links to a standard manuscript publication with an extensive database which hosts all associated data, data analysis tools, and cloud-computing resources. The scope covers not just omic type data and the fields of high-throughput biology currently serviced by large public repositories but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biol­ogy, and other new types of large-scale sharable data.