Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods" I am Ge Gao from the Center for Bioinformatics, Peking University. Let's start this week topic. In this week, based on the prior learning we’ll take non-coding RNA for example to show you how to further explore the biological problems based on RNA-Seq data produced by RNA-Seq or other transcriptome sequencing technologies. First, we’ll briefly introduce the relevant background As we mentioned before, transcriptome can be considered as a snapshot of gene expression profile in a particular moment of cells. Therefore, relevant researches usually involve two aspects of qualitative and quantitative.. The former is to identify all the expressed transcripts. The latter will determine the expression of each transcript. In the last lesson, we focused on the RNA-Seq data analysis and briefly introduced how to achieve the above two based on the deep sequencing data. However, getting the expression information is only the first step of the transcriptome studies. In order to understand the regulation of gene expression in an organism, we also need to carry on further data mining for expression profile data to discover new biological knowledge. Specifically, data mining process requires multiple sets of data for calling differentially expressed genes. Data mining can discover the biological molecules affecting specific biological characteristics by clustering and classifying genes based on gene expression patterns. It can also analyze the pathway in the overall level. In this section, we will mainly focus on the identification of differentially expressed genes and clustering methods. We’ll introduce the pathway analysis in subsequent lessons. In this process, we need to frequently use statistical inference. Different from the probability process which studies the particular samples from known population, statistics use part of samples to infer the properties of population. In this process, we often need to refer to known biological knowledge and preliminary analysis results, and repeat the iterative improvement. In fact, this process is often referred to as data mining based on statistical learning. Unlike most processes standardizing the data processing, in the data mining, the existing biological knowledge, also called domain knowledge, is crucial for data processing, algorithm model selection and even parameter setting. In the follow-up sections, we will take the identification of non-coding RNA and functional annotation for example to demonstrate relevant data mining methods for transcriptome data. As we mentioned in the last lesson, transcriptome contains both mRNA already familiar and non-coding RNA discovered in recent years such as non protein-coding miRNA and long non-coding RNA (lncRNA). These RNA transcripts have synergistic effects together to regulate a number of important physiological processes such as cell growth, development and apoptosis. he non-coding RNA refers to the RNA molecule that can exercise its biological function in the form of RNA with no need of translation. The corresponding DNA region in the genome is often called non-coding RNA genes or RNA genes briefly. The ribosome RNAs and tRNAs discovered in early studies serve for the maintenance of basic metabolic processes in cells. Therefore, they are constantly expressed in various cells, tissues, and organs, and function as housekeeping genes. The functions of non-coding RNAs discovered in the past ten years are mostly transcriptional and/or translational regulation on other genes. These non-coding RNAs regulate their targets using various mechanisms, This results in an expression pattern that is always organ-, tissue-, or cell-specific. Also, the existence of non-coding RNAs is widespread in the genome. The American ENCODE project has discovered in the human genome exist not only protein coding genes, but also abundant non-coding genes. Over 80% of genomic DNA can be transcribed into RNA. It has thus been estimated that there are about 30 thousand non-coding genes in human genome, similar to that of coding genes. Therefore, the discovery and study of non-coding elements in genome is listed first in the review by Science of progress in the first 10 years of 21st century. These non-coding regulatory RNAs exist widespreadly, from plants to animals and human. They exert their important regulatory functions in various physiological and pathological conditions. For example, the microRNAs often have a length of 21~23 nucleotides. The mature microRNAs (mature miRNAs) are produced by protein Dicer processing the pre-miRNAs in stem-loop structure. These mature miRNAs recognize their specific target RNAs by base-pairing to down-regulate their expressions. In this way, they can regulate specific biological processes. The miRNAs play a key regulatory role in the tumorigenesis and development of various tumors. They can also be used as the marker for diagnosis and development of disease. Therefore, miRNAs have been adopted by several pharmaceutical companies as clinical target for various diseases, such as tumours, heart disease, AIDS, and herpes. A series of drugs have been developed, some of which have been or are currently undergoing clinical trials. The non-coding RNAs are, however, far more than “small” RNAs such as the miRNAs. Long non-coding RNAs (lncRNAs) can be as long as dozens of (or hundreds of) bases. Also, these long non-coding RNAs, i.e. lncRNAs, are similar to protein-coding mRNAs as they have multiple exons, alternative splicing, and polyA tail. Similar to miRNAs, these long non-coding RNAs also play a role in the regulation of various physiological and pathological processes. For example, the long non-coding RNA Xist is a deterministic regulatory factor for the X-inactivation (X chromosome inactivation) Binding to the second X chromosome in the female cells, Xist initiates the X-inactivation, resulting in the dosage compensation of X chromosome between male and female. Another example is the non-coding RNA SCA8 on the antisense strand of the protein-coding gene KLHL1 in human. A variant in this RNA has been reported to correlate directly with spinal cerebella ataxia. As RNA-Seq and other high-throughput techniques are widely applied, it has been estimated that there are more than 2000 lncRNAs in human genome. Although studies have shown their role in regulation and tuning of various biological processes in cell for development and stress, most lncRNAs still have their biological functions and their mechanisms are still unknown. From this we meet two questions: First, how many non-coding RNAs (especially long non-coding RNAs) in the genome? Second, what are their functions? We will discuss them in detail in the following unit. Good, we have had a brief introduction of basic ideas for transcriptome data mining and some background knowledge of non-coding RNAs. Starting from next unit, we will explain the relevant data mining methods step by step.