Okay, hello what we want to talk about now is we'll probably take us more than one session here. So we're going to talk about ECC, error correction code memory using Hamming code in particular for SECDED, single error correction double error detection. We'll, we're going to start out with why we need this, because it's somewhat complicated and you might be wondering why do we need that. [LAUGH] Why should we care I guess [LAUGH]. So let's talk about first, where this fits in the overall memory subsystem for a real time embedded system. So we've seen this slide before, but I just like to use it to show you what we're talking about. So we're talking about the working memory here. And that has of course a lot of important information on it while the machine is up and running. Your boot image gets transferred there out of Mv RAM into here, and then you normally jump to that image and execute it. You got DMA Buffer Pools for IO. You've got of course your code, your stack, and your data in here. And so there's a lot of important information. And it's your working copy, so if it somehow is corrupted, you could have the code. If the code corrupted, you could issue a bad instruction. One bad instruction could be fatal to your system. And in fact, that that sort of thing has happened. I don't know that anyone truly knows what happened to the spacecraft called Mars Observer. But a single command caused it to spin out of control and be lost. And so this same thing would be true for data. Data could be something like how long an engine burns or something like that. So I say it could be something pretty important. You might get lucky and it might not be anything too important, but it's a roll of the dice. So if data is corrupted somehow, in other words bits get flipped, modified, somehow. Generally this is a problem. And this is what ECC is all about. So most often in general purpose computing, we don't worry about this too much. Enterprise systems do worry about this. So financial systems, databases, they usually incorporate error correction code memory as well as embedded systems, and mission critical systems. So really the two big users of error correction code memory are mission critical embedded systems and enterprise systems. And so that's the main thing I wanted to point out here is where it fits in the overall system. And the other thing is that the working memory is usually the largest volatile memory that we have, much larger than say, cash. And cash really is a copy of what's in the main memory anyway. So we could probably recover from a cache corruption more readily. Although we probably do want to at least protect cash with something like simple parities. So maybe just add a parity bit here. So odd or even parity. Usually it's not a full ECC for cache. We could put parody here to odd or even. So we'd really like mission critical systems have no single point of failure anywhere in the design. All mission critical systems should be tolerant to any sort of single fault. And should ideally fail safe on a double fault. So remember those goals as we move forward through the various methods to protect our system from failure. Okay, so what's the problem? Why would we worry bout memory getting corrupted? I mean what would cause that? Well, almost all the root causes are environmental. And it turns out there's one that we don't think about too much, which is namely the Earth has a magnetic field. And so all over the Earth we have this geomagnetic field, which you're probably aware of. And there parallel field lines that come in and out of the Earth. And at where they re-enter the Earth basically at the magnetic North and magnetic South Pole. We have the northern lights. And maybe I'll review why that's true, but that's part of this story [LAUGH]. So what happens is the sun is out here somewhere, right? So, So here's the sun way far away. Here's the Earth. And the sun also has a magnetic field. But it's magnetic field is giant, so giant that it's basically like a big spiral of arms coming out in all different directions. But some of them actually intersect with the Earth. And protons and electrons come out on these magnetic field lines from the sun and we call that the solar wind, right? And those proton and electrons, some of them, actually intersect with our geo-magnetic field and they get trapped. And so there's electrons trapped by our geo-magnetic field all over, as well as protons and electrons. And the Northern Lights in fact are where they come into the atmosphere. So we've got a thin atmosphere up here. This of course is not to scale [LAUGH]. And the interaction at the atmosphere with the protons and electrons creates Northern Light's features. This certainly isn't a science class, this is probably a class that has much better detail on exactly how this works. But I did work on this myself for space station not coming up with the model, but taking a model that was implemented in Fortran IV and re-implementing it in C++ and Ada, so I learned quite a bit about it. And so the trapped radiation actually, once it gets trapped here, it turns out it stays trapped and it spirals around these field lines. And believe it or not, it bounces back and forth. So each field line that a particle gets trapped on, because I guess it comes in with a velocity, right? And there's a certain velocity where it comes in here and it gets trapped and starts spiraling. And it spirals and the field lines become closer together and get stronger and that's what causes it to balance. So some of them balance before they interact with the atmosphere. That's why the Northern Lights kind of come and go right? It kind of depends on how much solar wind activity is, how much input there is into this system from the sun. Basically it's a dynamic system, right, with geomagnetic trapped particles, charged particles. So the particles have to be charged in order to get trapped in the geomagnetic field. And it actually, it turns out this is fortunate because it also shields the Earth from direct radiation, direct proton electron radiation. So of course, one of the issues is if you put a satellite up here outside the atmosphere and it's out here, it's basically. Here's your satellite [LAUGH], going around the Earth. It's basically going through the magnetic field around the earth and it's going through all sorts of proton electron flux. Much more so than you would find on the surface of the Earth because the surface of the Earth is basically shielded by not only a geo-magnetic field. But the atmosphere which would tend to attenuate this kind of radiation. And it turns out that even the altitude of, say, a commercial jet aircraft is high enough that it's higher up in the atmosphere. And there's more charged particle radiation at 30,000 feet than there is on the surface. And it's also not zero on the surface, so there is a problem with it on the surface. But generally we don't worry about it unless you're building an enterprise system where you just can't tolerate any possibility of memory corruption while the system is running. Most PCs, if we had something go wrong and we had a blue screen or it froze up, we just reboot it, right? So that's why we don't worry about having ECC on lower cost systems that aren't mission critical or enterprise. There is another kind of source too, which is namely just background cosmic radiation means that there's things like an iron particle that can occasionally, that's sort of most heavy atomic mass. Cosmic radiation just comes right through the Earth. And any of these particles, especially the charged ones can cause your series of bits, your 01000011 can cause one of these to flip. So it can cause it to go from a 1 to a 0. And, it's reparable. We can flip it back so it's an upset, right? It's not a what's called a latch-up. So latch-ups are more serious. Latch-ups are where the gate is actually damaged. The flip flop if you remember back to digital design, if you've had digital design, we got Jake. Any flip locks, etc, that are basically feedback digital circuits that can hold the value presented to them up one or zero and that's why we build registers out of, and even simple memories like SRAM and so forth, but DRAM is more complicated. So SRAM, DRAM, they can be affected by this as well as registers as well as cache, so really anything that is a working memory of any sort could have SEUs occur and cause bit flips because of this environment. So the other thing I'd like to point out is that, so that's why this occurs. Cosmic rays and charged particles cause SEUs. There are also potentially issues with EMI, EMC, so just the electromagnetic interference incompatibility environment, which could be challenging if you're under the hood of a car or anywhere where there's other electrical components. So all these environmental factors could cause a single bit error, and so a single bit error is basically likely, so it's very likely to happen. The interesting thing is double bit errors are far less likely. And triple, quadruple bid errors, multibit errors beyond double errors are very unlikely. So the question is, why is that, right? So what we could do is, let's see. So looking for a place to draw here. So let me see, I think I can. Can I erase this? [LAUGH] So well, let me just draw the title slide, what the heck. So if I had a chain link fence and it's a lot of diagonal elements, right? And here we go, our chain link fence, rough chain link fence. If I sit there and I drive golf balls at a chain link fence and let's say the golf ball size is such that it's substantially smaller than the chain link fence, so most of them are going to go through, right? I'm going to try that golf ball at that chain link fence and a lot of times it's going to go through it. But we know what's going to happen as eventually one of them is going to hit a link, right? And it could hit right on the center of where two links cross, but probably more often it's going to hit a link like here or over here. So we can think of memory as the chain link fence and the charge particles as the golf balls. And if I'm worried about the, so let's say when one of these hits the fence, it somehow impacts the fence so that we can't trust that link anymore. And if I keep hitting golf balls out of that kind of a background rate, which is basically what's going to happen when you're orbiting the Earth or flying at altitude, or even on the surface of the Earth. It's very unlikely that I'm going to hit the same link twice in a row, right? Or even within the same set of ten or 20 or 100 golf balls, I'm probably going to hit other links, right? Or have a whole bunch of them just go through as well. So computing this probability would be harder, so this is just a qualitative argument here, right? So if we take the probability of hitting anyone single link, if it was uniform, it depends on how many links are in the whole fence and how often I hit golf balls and all sorts of things like that. But what we would say is that chances of hitting this again a second time are low, at least for a while. Eventually, I will hit that same link again if I keep hitting golf balls over a long period of time, but we have an interesting capability. With SECDED, with ECC, we can basically repair the link. So let's say this link got hit here and it somehow gets damaged. Well, we can come in and we can wipe out that, we can restore it back, right? And now if it gets hit a second time, we get it restored back again. So the real problem is if it gets hit twice in a row without us essentially coming in and repairing it, then it's going to break, right? Well, so the analogy maybe doesn't work there because the bit doesn't actually break. What happens is the word, if it gets multiple SEUs in the same word or the same byte, we're going to have a double bit error. And we can't. Prepare a double bit error. All we can do is basically rebuild the entire fence to complete the analogy. So hopefully that analogy works for you. That's just to introduce the problem. So the problem is because the geomagnetic field charged particles, trapped radiation, the solar wind. It's a little bit of a problem on the surface of the Earth. It's a bigger problem at altitude. It's an even bigger problem in orbit or especially for really high altitude, orbiting satellites like Geo Synchronous versus low Earth orbit. It's due to protons, electrons. It's due to also background cosmic radiation. Bits can be flipped based on electromagnetic interference or compatibility issues within the system. So we have a number of sources and the good news is they're just upsets so they can be repaired. If we can come up with an algorithm that can detect the bit was flipped and then an algorithm that can determine which bit was flipped and the correction is simply flip it back. Thank you very much.