Hello everybody. Welcome to the second week of this Coursera course in Experimental Methods in Systems Biology. This week we're going to be covering mRNA Sequencing, which involves the technology of high through-put or next generations Sequencing. So, at first I'll just go through an outline of what all these lectures will entail. And these lectures will be broken down into three separate parts of me doing PowerPoint slide based lectures. And then we're going to go into the lab and see how some of these mRNA sequencing and next generation sequencing experiments actually take place in a practical setting. So first, I will just talk a little bit about the purpose of mRNA sequencing and why one does these experiments. And then talk about some of the first generation technologies to look at transcriptomes, mainly microarrays. And a little bit about Sequencing that was used to obtain the first human genome, and then the revolutions that happened in high throughput sequencing, the second and now third generation sequencing technologies that exist. Then, I'll go into very much greater detail on a technology which is used the the company Illumina. It's a second generation technology. So, we'll go through all of the steps and the theory behind how Illumina next generation sequencing works lastly we'll talk a little bit about how you quantify the resulting data that you get from an mRNA sequencing experiment. And then as I mentioned we'll go into the lab and actually see In Illumina next generation mRNA sequencing experiment go on. So, first of all, what is mRNAseq, or mRNA sequencing? So essentially what it is, it allows you to quantify the transcriptome, look at what are the levels of all the expressed transcripts in your sample. So, when I say quantification, it has to be qualified with the fact, that there are some PCR steps involved in this technology. So, if you remember from week one, we talked about how PCR can distort and amplify the levels of transcripts, better originally present in your samples. So, we have to remember that caveat when we're dealing with RNA sequencing data. It also allows you to quantify splice variance, so if you have a gene which is being expressed but then can be differentially spliced into different isophorms, mRNA sequencing allows you to identify the ratios of those spliced isophorms. It allows you also to identify new spliced variants. So, and we'll look into how mRNA sequencing actually allows you to do... Such things. mRNA sequencing wasn't the first way to look at the entire transcriptome. This was done by a technology called the microarray, which we went over a little bit in week one, but I'll just remind you here, a little bit about what it does. Because it's still in use quite a bit, and And, you know, is a viable option to mRNA sequencing in some context. So, it's important to understand how it works and what are the advantages and disadvantages of it. So, one of the major advantages is that microarray based technologies to measure the transcriptom are still currently cheaper. That are an mRNA sequencing experiment. Although, as I'll explain on the next slide, this may not be for very long. Some disadvantages of microarrays are that there's lots of different ways of doing a microarray. So, you see the picture here on the left, Is a slide-based one, there's bead-based ones, every manufacturer has different prosets on their microarrays, etc., and although there's lots of bioinformatics infrastructure to be able to allow comparisons between these different technologies It still is a bit, can be a bit tedious and difficult to compare across platforms, especially quantitatively. Another thing is that you have to already know what you're looking for, and that's because you have to know what probes to put down on the microarray, so Therefore it can be hard to identify novel transcripts or splice variants, something which mRNA sequencing can do in a very straightforward manner. Furthermore, quantifying alternative splicing can be difficult. It's not impossible because if you have the right probes on your microarray this can be done but, again, you already have to know what you're looking for there. Probably in the long run, the pure sequence data that's provided by mRNA sequencing will probably prove more reliable and easy to share than Microarray data, but at the current time there's still a choice and an option as to which one Microarrays are RNA sequencing you should use for your experiment. So, this is just some data that I showed last week, but I'll show it again just to show how rapidly the cost of doing high throughput sequencing Is coming down over time. So right now, 2014, we're almost at the level where you can sequence an entire genome for $1,000. And I'm sure that, in just a few more years, we'll reach that target based on this current trajectory. So, even though microarrays are currently cheaper, most likely in a few years mRNA sequencing will be extremely inexpensive and will probably be the data standard for looking at transcriptome. So, even though it might be more expensive now if your data are in the format of mRNA Sequencing data, then you might be in a better position in a few years when it most likely take over from erase. So, in order to understand MrnA sequencing, it's important to understand the evolution of sequencing technologies and the first generation sequencing technologies that were developed or so called sanger sequencing methods. Which are based on a particular chemistry, which is called chain termination chemistry, So the sequencing reaction happens as followed, where you have a template which you're interested in figuring out the sequence of, and you need some sequencing primer which is there to start the DNA polymerase reaction of copying that template. So the reaction happens of course in the five prime the three prime direction and you have in your reaction DNA polymer race, but also a mixture of fluorescently labeled NTPs, but these are not your typical bases, they are so called biodoxi bases, which when they happen to be incorporated to a strand. They cannot be further elongated by the DNA because they are missing an additional reactive group on their end so that they can't continue that phosphate backbone. So, at the end of this reaction what you are left with Is a group of, a mixture of elongated product which are differing by one base pair and each different size then has a different florescence on it according to whichever base pair got incorporated at the time at which it stopped the dideoxy based pairs. And each one has a different color, so you can have, of course, A, T, C, or G in each one as a unique color, which can then be read out using capillary electrophoresis and Looking at the fluoresce of the size, of each particular size. Just a few properties, benefits and drawbacks of this method of sequencing. First of all, the throughput is very low. It's one strand at a time, and you need to have a primer, so you need to know a little bit about the strand that you're trying to sequence. You do get a very long read length, usually around 700 base pairs or so. The chemistry as I mentioned is this chain termination chemistry, where you have to stop the sequencing reaction in order to figure out which was the last base pair that was incorporated. It's very, very accurate, because you're dealing with, during large populations of each size of alago, you have a very good idea of which base pair was incorporated even if there is a few errors made, DNA palm races are, are, very accurate so, you're going to have lots of the right color incorporated there. The main drawback is it's very slow and expensive on a genome scale. Because you can only do one strand at a time, you could imagine that trying to apply the 700 base pair reaction across the genome would be very time consuming and expensive. One of the major ideas to Address this throughput problem of first generation Sanger sequencing was to do it in parallel. So let's, instead of doing one at a time, let's do many, many DNA strands at once. And also, to introduce some kind of automation to the process. So people did this. They tried to do 96 and 384 well plate format for Sanger sequencing. And of course, they were succesful in doing this. So, as I mentioned before, in the previous slide That people are using automated capillary electrophoresis that, you know, enables another level of automation of the Sanger sequencing. But this increase still wasn't really sufficient to have a major impact on sequencing abilities. So even with all of these processes, it still took 13 years to complete the first human genome sequencing project. And it cost $2.7 billion, so this was clearly a problem if you wanted to sequence even just a few people. So, people just try to do essentially more of the. Same. Let's just increase the number of wells per location. Instead of using a 384 well plate, let's go down to micro and nano well plates. You know, let's use microfabrication technologies to run more and more of these reactions and parallel. And, you know, that's fine. People could do this too. Really, the main drawback was the chemistry. We needed a more innovative chemistry in order to move from this Chain termination chemistry to something different. And the reason for that is because with this Chain termination chemistry, you have several labeled DNA strands That in order to observe what base pairs were incorporated into each strand, you need to separate them by length. And the smaller you go, the harder it gets to actually separate these by length. So we needed a new way of sequencing that didn't depend on separation by length. Length. And that's where the major innovation for next-generation sequencing came in, which I'll talk about in the next lecture.