[MUSIC] Welcome back to Peking University MOOC Bioinformatics Introduction and Methods. I'm Liping Wei from the Center for Bioinformatics at Peking University. Earlier in this MOOC, we have learned about the exciting new development of next generation sequencing technologies. We can now sequence one person's whole genome with about $3,000 in about a day. These personal genomes hold great promises for future of personalized medicine. However, each person's genome has about three million single nucleotide variations, as well as many other types of genetic variations. So, how do we predict the functional [INAUDIBLE] of these genetic variations? This is the subject of this week's lectures. In the first unit, let's take a close look at this problem. Let's first take a look at an example in real life. On May 14th, 2013 the New York Times published an article named My Medical Choice. Movie star, Angelina Jolie, revealed that she has a mutation in her BRCA1 gene and that her mother died early from breast cancer. So to reduce her own risk of getting breast cancer, she made the drastic decision to have both of her breasts surgically removed. This revelation sent shock waves all over the world. Many people wondered and argued whether she had made the right choice. So do you think Angelina made the right decision to remove her breasts? Please take a moment to really think about this question, as her decision is a complicated one, that touches upon many core issues in human genetics. We have created a short online survey about this. So please fill in the survey with your honest opinion and later we will share the anonymous survey results with all of you. The core of Angelina Jolie's decision touches upon an important bioinformatics question. GIven that she has a genetic mutation in BRCA1, what is the conditional probability that she will develop breast cancer? Even with her mutation, there is a chance that she may be cancer free. These two probabilities add up to one. Before we talk about how to calculate these numbers, let's first look at where our genetic variations come from. We all know that we inherit genetic variations from our parents. If our mother carries a particular mutation allele, there is 50% chance that we may inherit it from her. Each of us also have 70-something de novo single-nucleotide mutations that neither of our parents have. Many severe childhood neurological disorders such as Rett syndrome are caused by de novo mutations that disrupted critical genes. The rest of us should consider ourselves lucky that our de novo mutations weren't so damaging. Each of also has somatic mutations, that are accumulated during cell divisions. An extreme example of this is cancer, where somatic mutations cause cells to grow uncontrollably. In this MOOC we only focus on mutations, not somatic mutations. Next year, we will add lectures on somatic mutations. There are many types of genetic variations in the human genome. Different genetic variations may have very different effects on the genome and on the phenotype. The most severe is chromosomal aneuploidy, meaning that a person has one or more copies of actual chromosomes or missing chromosomes. The best known example is Down Syndrome, which is caused by three copies of chromosome 21 or Trisomy 21 which almost always have been de novo. We never hear about Trisomy 1 or Trisomy 2 in humans, because these chromosomes are so large and contain so many genes that having abnormal copies of them is embryonically lethal. Another type of microscopic or submicroscopic genetic variation in the human genome is Structural variations or SVs. They include deletions where a segment of a chromosome is missing. And duplications that include the tandem duplications where a segment of a chromosome is duplicated right next to the original copy. As well as interspersed duplications, where a segment of our chromosome is duplicated to somewhere else on the genome. Deletions and duplications are often called copy number variations or CNVs. There are two additional types of insertions. Many mobile elements are inserted in each of our genomes. A novel sequences, such as virus gnomes are sometimes inserted in our genomes. Finally, two other types of genetic structural variation that cause disruptions of the genome but not genomic balance include inversions where a segment of a chromosome is inverted. And translocations where a segment of a chromosome is moved to somewhere else on the genome. Short insertions and deletions of one or a few, up to a thousand or several thousands nucleotides sometimes called indels. Indels may happen in intergenic or intronic regions, or they may happen within protein coding regions. Within protein coding regions, if an indel involves three or multiplicity of three nucleotides, it will only add or delete one or several codons without causing a frameshift. Otherwise, it may cause a frameshift that may result in drastic change in the protein sequence. For example, here is an original DNA sequence, and the amino acid sequence it encodes. The transcriptional and translational machinery will read the first three nucleotides C A T to make a histidine. And the next three nucleotides T C A to make a serine, and the C A C to make a histidine. However, if nucleotide C is deleted, it causes a frameshift. Now the translation of machinery would read A T T to make an I solution, C A C to make a histidine, and A C G to make a thridine. You see, a frameshift completely changed the amino acid sequence. Often it would also cause a premature stop codon. The new protein may not be stable enough to exist at all. At a smaller scale, but with much higher frequency, there are single nucleotide variations. On average, in a person's genome, there is about 3 million SNVs, roughly equivalent to 1 SNV in every 1,000 nucleotides. SNVs that are known to affect functions and phenotypes the most are located in the promoter regions or protein coding regions of genes. Variations in UTR and intronic regions sometimes also affect function. Most variants are located in intergenic regions between genes. Some of them fall non-coding RNA that are transcribed but not translated. SNVs within coding regions tend to have larger effects than other variations and those have been studied the most. In the more severe cases, SNV can cause a premature stop codon that terminates a protein early. In this example shown, here the cytosine nucleotide is changed to thymine. As a result the codon C U G that used to encode amino acid of glutamine, now it becomes T A G which is a stop codon resulting in premature termination of the protein. This SNV is called a nonsense. In the second example, this adenine nucleotide is changed to cytosine. As a result, the codon C A T that used to encode the amino acid histidine now it becomes C C T which encodes protein. This SNV is called a non-synonymous or missense SNV. Because the codons are degenerative, many SNVs do not cause amino acid changes. They are called synonymous, or same sense or silent SNVs. Some SNVs at or near splice injunctions may affect splicing. And finally, some SNVs change a stop codon to a codon encoding amino acid, resulting in a lessening of the protein which may have altered or disrupted stability, structure and function. Because a nonsense SNV causes premature termination of a protein, it is usually predicted to be damaging, even though there are exceptions where paralogous proteins or alternative pathways can compensate for the loss of a protein. Synonymous, intronic, and intergenic variations are often ignored. However, according to GWAS studies, 88% of trait-associated variations of weak effect are non-coding. Although, individually their functional effects may not be as obvious, because these regions are so large, their total effects cannot be neglected, especially, a mild traits. However, they remain under-studied, and better methods are still needed. Hopefully in a year or two, some methods may be mature enough for us to teach here. Most research so far had focused on missense SNVs. A big reason is because known deleterious mutations are enriched in missense mutations. About 50% of all known mutations of Mendelian disorders are missense mutations. However, it is important to note here that there might be ascertainment biases because an important discovery tends to attract more research in the same direction. Even so, many missense mutations clearly have important functional roles. They are the focus of this week's lectures. However, not all missense SNVs cause phenotypic changes. For instance, BRCA1 was the first gene associated with breast cancer in 1990 based on linkage analysis of large pedigrees of early onset familial breast cancer. BRCA1 has a total of 238 known missense mutations, 163 are present only in patients, 62 are present only in healthy persons, and 13 in both patients and healthy persons. Furthermore, even missense variations seen only in patients are not all causal. If we look more broadly, analysis of the whole genomes of over 1,000 healthy individuals in the 1,000 Genome Project revealed that, on average, a healthy individual carries over 3 million SNPs, over 361,000 indels, almost 16,000 deletions, over 400 duplications, and almost 5,000 mobile element insertions. Within protein-coding regions, on average, a healthy individual carries large divisions that disrupt about 150 genes over 1,000 stop coding SNPs, 77 stop losses, over 900 small frameshift indels, over 700 small in-frame indels, nearly 70,000 non-synonymous SNPs and 60,000 synonymous SNPs. So the questions are, what features differentiate disease-causing variants from neutral ones? How can we predict whether a variation is disease-causing? Unlike sequence alignment and sequence database search, the questions here remain largely unsolved. And there's still lots of active researches going on, including your TA own research. So, maybe in a year or two we will be teaching his new method or yours. Let's use the last two slides of unit one to look at the nomenclature. First, when the minor allele has a frequency less than 1% in the general population, we usually called it a mutation. Otherwise, it is usually called a polymorphism. Sometimes the cut off of 5% is used, but you get the idea. Mutations and polymorphisms together are called variations or variants. People may have different things in mind when they talk about the functional or phenotypic effects of human genetic variations. Often people are referring to disease causing versus normal. In evolutionary terms, they may be thinking about deleterious, meaning causing a reduction in fitness versus neutral meaning causing no changes in fitness. Sometimes the phenotypic differences are personal trait differences such as height, curliness of hair etc. Sometimes the effects of genetic variations are studied in animal models or cell lines and changes in animal phenotypes or cellular phenotypes such as cell growth are reported. Often changes in protein functions, such as enzymatic activity in protein structure, such as food and stability are studied. These effects at different levels are correlated. It is important to keep in mind that these correlations are statistical and stochastic, not deterministic. Observed protein functional and structural changes or cellular and animal model changes do not always lead to phenotypic changes. On the other hand, please keep in mind that your experimental studies give you observations, not the truth. For instance, if you do not observe functional changes in your experiments of a genetic variation, it does not necessarily mean that it has no phenotypic effect. No functional assay is 100% comprehensive or accurate. So we have to look at this question from a statistical perspective. Genetic variations that change protein structure are more likely to cause protein function changes, which are more likely to cause cellular and animal phenotypic changes, which are more likely to be associated with diseases that reduce fitness or change personal traits. Finally, I'd like to mention that, in these lectures, we focus on human genetic variations. However, most of the concepts and matters can also be applied to other organisms. Over the past 40 years, hundred of millions of genetic variations have been identified. In the next unit, we will take a look at the main databases of genetic variations. I look forward to seeing you then. [MUSIC]