Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. Let's continue this week's topic. In this week, based on the previous learning, we’ll take a real biological problem as an example to demonstrate how to apply bioinformatics methods and experimental techniques to address biological problems. As mentioned before, the generation of new genes is an important source of biological evolutionary novelty. In general, in the evolution we usually divide homologous genes into orthologs and paralogs according to their different origins. The former is produced by the speciation event. The latter is the product of the gene duplication. Generally, the functions of orthologs in different species are similar. This is exactly the basis on which we can use model organisms to study human biology. However, the paralogs produced by gene duplication usually have different functions. This is because gene sequence variation between duplicates will emerge as the evolution progresses. The sequence variation will further lead to functional differentiation, and ultimately generate the new genes. Therefore, the study of the relationship between gene sequence variation in replicates and functional differentiation will not only help to clarify the biological function of different residues in genes, but also provide the basis for subsequent functional studies and bioengineering research. It also provides an important clue for understanding the key issues of evolutionary biology, such as the mechanism of the origination of new genes. In morphology, the early development of organisms is very conservative. There are obvious morphological and structural similarities among embryos from different species. In such a conservative process of early development, are there new genes produced by the gene duplication? If there are, how do these new genes function? Next, we’ll apply the bioinformatics methods to explore these issues. To answer this question, we first compare multiple species to find the duplicates between different species. Specifically, we first compare all the sequences in an all-against-all way. As it is needed to run fast enough, we use BLAST first, then apply dynamic programming algorithm to the similar sequences found by BLAST to obtain the accurate pairwise alignment, and calculate the final similarity score. On this basis, we can use the neighbor-joining (NJ) method to construct the gene tree for each gene clusters whose genes are similar to each other. Then we compare the gene tree and species tree to differentiate orthologs from paralogs. Then we need to use the expression data to screen for the duplicated genes involved in early development. Here we use specialized high-throughput expression databases ArrayExpress and GEO. Please note that, as the expression-related sequencing data in NCBI SRA also has a copy in GEO, we don’t need to check the SRA database any more. Next, we need to further screen for the duplicated genes with functions differentiated, based on the functional annotation. In other words, we have found duplicated genes involved in early development in former stage And furthermore, we want to find duplicated gene pairs whose functions have differentiated among them In fact, for a better screen, we will not only use the Gene Ontology and KEGG based on large-scale studies, but also refer to the MGI database and the OMIM database which are based on small-scale trials. MGI is for mouse. It includes a large number of phenotypic data of gene knock-out experiments. OMIM is a collection of relationships between the human genes and human genetic diseases discovered by the Institute of Medical Genetics. Such information can further help to screen for related genes. After this step of screening, we get the final list. We can get the final pipeline by concatenating the steps above. Finally, after running the pipeline, we found seven duplicated gene pairs that are most likely to be involved in the early development and have their functions differentiated. At the top of the list is the DNA de novo methytransferase DNMT3. DNMT3 is responsible for determining the DNA methylation region on the genome, leading to the gene silencing in these regions. It is one of the core regulatory factors in epigenetics. Its loss will lead to serious disorders in early development and cause a variety of diseases. DNMT3 is thus a hotspot in cell biologists’ research and contributes to our cooperation with Professor Gang Pei’s research group. In this course, we’re honored to invite Professor Gang Pei to introduce the relevant research background from the biological point of view. Next, you're welcomed to watch the next video.