Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods". I am Ge Gao from the Center for Bioinformatics, Peking University. In the next few months, I will teach this MOOC with my colleague, Dr. Liping Wei. Now let's start our second topic: Sequence Alignment. First let's review some basic concepts. Bioinformatics as a young interdisciplinary field has enjoyed rapid development in the past
twenty years. Numerous handy software tools and databases have been developed and made available on the Internet for the biological research community. According to the statistics on the Canadian website http://bioinformatics.ca, as of July 2013 there had been over 1400 online [bioinformatics] tools freely available to biologists. Most of them have nice and friendly user interfaces and detailed tutorials. Users could access them easily via web browsers and analyze their data with only a few clicks. So why do we need a course that teaches "principles" and "methods" rather than just "usages of tools"? There have been many useful online tools. The reason is simple: computers are not biologists. They cannot understand the biological problem you want to solve. These tools are just programs that operate on input data by predefined pipelines that were all designed under certain assumptions. If your input data and/or the problem you want to solve are not consistent with these assumptions, mistakes would occur. This is a news piece published in 2006 in Science magazine. Two data columns were mistakenly swapped in the computer program, leading to the retractions of five papers, including three in Science, one in PNAS, and one in Journal of Molecular Biology. This was very sad. But, maybe they used an obscure software so it was difficult to catch the mistake? Maybe it would be safe to use popular software that are well validated? This is another news in Genome Biology sometime ago reporting another case. The authors of the problematic paper in this case used BLAST, one of the most widely used tools in bioinformatics which we will talk about next week, and discovered adenylyl cyclases in plants. This"discovery" was published in Nature but retracted one year later due to incorrect conclusions drawn from bad analyses. Therefore, it is dangerous to use the tools blindly without a solid understanding of their assumptions and principles. We want to teach you in this MOOC not only the popular bioinformatics tools, but also the underlying principles and "ideas" so that you can take advantage of these powerful tools better while avoiding possible pitfalls. Or, as this Siemens commercial says, "Know the principle. Use the power. This is how." You can use the power better only if you know the principle. We will introduce each method from the following aspects. First of all, the "Biology". What is the biological problem this method tries to address? Why do we need this method? Secondly, the "Data". What input data and parameters are needed to run this method? The third part, Modeling, will tell you how the biological problem can be formulated into a computational problem. Last but not least, we will discuss the algorithm itself, its performance, and its constraints and
limitations. Now let's look at the problem of sequence alignment. Let's first look at the biological question behind sequence alignment, that is, how similar are two genes or proteins? Can we tell it by comparing their sequences of nucleotides or amino acids? This simple looking idea is actually very useful in biological researches. Why? This is Pairwise Sequence Alignment, the alignment between two sequences. There are several tools to choose from. We will choose the first one. Let's click it. Okay, now we can see this page. It looks simple enough. The tip tells us to fill with two protein sequences. Let's fill in the blanks with the sequences of human haemoglobin subunit alpha and subunit
beta. OK, this is how it looks after we filled in the blanks. Please note that the greater-than sign in the first line is followed by the name of the sequence. For example, the name for the alpha subunit is HBA_HUMAN. The name for the beta subunit is HBB_HUMAN. Starting from the second line we have the protein sequence made up by 20 amino acids.
Simple enough. Let's just ignore Step 2 for now because it says that the default settings will fulfill the needs of most users. Now let's press the Submit button! Now the result is here! Let's take a look. The results are a little complex, so we will check them one by one. Let's look at the bottom half first. You can see that in addition to the two input sequences, there is an extra line below them that describes the alignment itself, and is called a "markup line" or "alignment string". Let's take a closer look at this extra line. This "markup line" consists mainly of vertical bars, colons, dots, and whitespaces. It is easy to see that the vertical bars represent alignments of the same residues, such as the M-to-M and V-to-V alignments. What about the colons and dots? You can find out immediately that they cannot denote alignments of the same residues, because they are not vertical bars. The residues are different, too. Let's take a look at the first colon for the S and T. They denote a substitution [from S to T]. The colons and dots are used to denote the level of similarity between two aligned residues that
are not identical. The colons denote aligned residues that are similar, whereas the dots denote those that are
not that similar. Specifically, the similarity between a pair of amino acids is evaluated using the substitution
matrix, which [here in this case] is the BLOSUM62 matrix at the upper-right corner. For example, the score of substituting S with T in the substitution matrix BLOSUM62 is 1, so
you see a colon here. The score of substituting A with E is -1 which is less than 0, so a dot was used That's how the colons and dots are used. Not difficult, right? Let's look at the result more carefully. You can see that all the substitution of S with T and all the substitution of T with S are denoted by colons. In fact, the substitution matrix is symmetric with respect to the diagonal. In other words, the substitution of S with T and the substitution of T with S will have the same
score. The direction of substitution doesn't matter. It is a symmetric [matrix]. You can also see that all the substitution for S have the same score, which means that the substitution scores are related, and only related, to the two residues
involved. The substitution matrix is context-free. You can see that the first substitution of T with S is preceded by K, while the second is
preceded by L. Their scores, however, are the same, and both have a colon displayed [in the markup line]. In fact, the substitution score of a pair of aligned residues is independent of other pairs of
residues. These seemingly trivial features are in fact very important, as we will see later. Finally let's look at the gaps. From the view of evolution, gaps denote insertions and/or deletions of genomic fragments during the course of evolution, often called "indels". An insertion in one sequence can be regarded as a deletion in the other sequence. Indels often have some effects on the function of sequences. So gaps in an alignment usually
receive negative scores, called the "Gap
penalty". So gaps in an alignment usually receive negative scores, called the "Gap penalty". Since an event of insertion or deletion often involve multiple residues, a gap often has a
length of more than one residue. This is different from substitutions. Gap penalty is often implemented as a linear combination of gap opening penalty and gap extending penalty which were usually given different
penalty scores. Let's take the penalty score for the second gap as an example. As suggested by the formula at the lower-right corner, opening a new gap will receive a penalty
score of 10. Extending it will receive a penalty score of 0.5. So the total penalty score is 10.5. As for the last gap, its length is five. So the penalty score is 10+0.5*4 (or 10+0.5*(5-1)) = 12. Finally, subtracting the sum of gap penalties from the sum of substitution scores will give you the final score, 292.5, as shown in the result marked with a red line. Some students might wonder why there is a score of 0.5. The reason is that it is extending a gap, rather than opening a new gap. We have used this example to illustrate some basic ideas involved in the most simple pairwise
alignment. Here are several summary questions. They are not assignments, but you are encouraged to think about them and discuss your answers and ideas in the online forum. That's all for Unit 1. In Unit 2 we will illustrate how to use algorithms to do such sequence alignment. Thank you! See you at the next unit!