I just want to briefly discuss hash tables because they are quite costly related to words and birthday problems, but we won't spend a lot of time in lecture on this more information in book. So, we talked just briefly about the balls and urns model and, of course, this is very heavily studied in classical combinatorics. And so, we talked about, well, what's the probability that no urn has more than one ball. What's the probability that no urn's empty? You might want to know how many empty urns there are. How many urns have k balls and so forth. These so called occupancy problems, are well studied, dating back many many decades. So, in the classical result is simply from the binomial distribution. The probability, and we looked at this briefly when we talked about asymptotics. The probability that a given value occurs k times in a random word of length N, is (N choose k) (1/M) to the k (1- 1/M) to the N- k. That is, you get your value, k times the probability of 1 over M, and the other values are not yours, with probability (1- 1/M) to the N- k. So, given N balls and M urns, the probability given urn has k balls. Now we looked at the Poisson approximation that works for the situation where N/M is alpha and it's fixed and k is a small constant. That's the so-called Poisson approximation. And we showed how to come up with that, approximation in when we talked about asymptotic. So, and this is a very good approximation. So for example, this plot gives for this is for alpha = 10. So N = 10,000, M = 1,000. That's a plot of the binomial distribution. So it's going to center to that 10 and that's where the peak is. And this is the Poisson approximation for that same case with alpha = 10 and the curves are nearly identical. So the Poisson approximation are little bit easier to work with. So that's why, so often used. So an application using hashing algorithm so people in computer science are very familiar with this algorithms. I'll just briefly describe them for people. In math that maybe you haven't seen them. They were very well covered in books on algorithms like our or book. The idea is we want an efficient way to put key value pairs in a symbol table. And we talked about this for binary search trees. And you want to search the table for the pair of corresponding to a given key. So the hashing strategy is to come up with a function that maps each key onto a random value between zero and M-1. And then develop a collision strategy to figure out what to do when two keys hash to the same value. And the basic algorithms that we'll talk about one's called separate chaining where we essentially represent urn with linked list. Another one is called linear probing where we just use an array, we never let. To get in the same place where we scan through the empty spots on the collision. And we'll talk briefly about both of those algorithms. And the model that we use is this idea that the hash function, we assume what's called the Uniform Hashing Assumption, that the hash function maps each key into a rad value between zero and n minus 1. And it's been shown to be not so difficult for typical types of keys to get hash functions that reasonably approximate this uniform assumption. So, that's the setup. So hashing with separate chaining is really easy to implement and this is just a diagram from our algorithms book that show It was a hash table of size 5. And so the keys are letters. And then the hash values are supposed to be random values between 0 and 5. It's just a word. So the sequence of hash values is just a random word. And then we store the letters in the list index by the hash values, so those are the earns and the keys, and the associated values are the balls. So, that's easy to implement and, of course, we're going to be interested in how long these lists are. So again, This is just the balls into urns model for hashing with separate chaining. And I'm probably spending too much time on animations, but [SOUND]. So, it what we want to know is when we're going to search for an item in this we're going to have to go through all the balls in the urn, so we want to know the average number of balls in each urn. Well that's obvious, there's N balls, there's M urns. Our average is N/M balls in each urn. And actually, that doesn't depend on anything. It's not very helpful at all. If all your balls fall in one urn, still your average number of balls per urn is N/M. Doesn't have to be random, even. So that observation is not much help. So, and we know that the probability that a given urn has k balls, that's the occupancy distribution. Generally, we're going to have, try to keep the number of urns large enough that our number of balls per urn is a constant, like alpha. So we know that. But the programmer wants to know is something about what's the chances that they're distributed evenly. So we have to do a little more math to provide that answer. So here's what a programmer might say. So finally make sure that M is big enough, so that N/M is less than some constant alpha. And it turns out to be not difficult to do that. Then the average number of probes for searches is going to be less than alpha, I know that. But what's the chance that some search will use way too many probes? Under the uniform hashing assumptions, say, 5 alpha probes. Well, that's actually not too difficult to calculate. So what it is is we know the probability for k probes. We can sum that for all k bigger than 5 alpha. And the trick is to use Stirling's formula to bound k factorial. And then if you do that then we get something that converges very rapidly. So something like e to the minus alpha times u over 5 to the 5 alpha. And say for alpha equals 10, that's button of 15 zeros, N to the minus 15. So you can say to the programmer that if you take alpha equals 10, this is extremely, extremely unlikely to happen. And that's the kind of information the programmer needs. So that's hashing with separate chaining, that's a typical calculation using classical occupancy distribution in some of the asymptotics that we did in lecture four. The other hashing method is called linear probing. And in this method, we throw the balls into the air. But we only leave room for one ball. And everything's fine until we get a collision where, and then we take two balls, and we know from the breathing column that going to happen relatively soon, so when it happen we just can skip to the right until we find an empty urn. You don't want to do this as the table start to get full as we will see still we're going to want to know that have analysis that tell us that tell us the average number of collisions when we use this algorithm. And it's a relatively easy algorithm to implement as well. Found in standard algorithm books as well. So that is the key question, what is the average number of probes to find one of the keys, if you use this algorithm, as we would run your program. Well this was proven over 50 years ago, by Knuth, to be the sum from k bigger than 0 of N over N, N- 1 over N, down tot N- k + 1 over N. Which is really, well the proof is given in ten pages in the textbook. It really was a landmark result. People were thinking this was too difficult a problem to really address and Knuth was able to show this result. And if you let the table get full, if you let N get to M and it's around the nugent function. And if you don't let the table get full. Which we don't, in practice. And I say don't let it get more than half full, then it's going to be one over one minus alpha. And if you let it get more that half full it's about two proofs. So in practice that's very important resolve. Now, Just want to tell a little bit of a story. So Knuth books has four volumes now and more planned. And in volume three, there's one footnote. And Knuth taught that good writers don't use footnotes in technical writing. But he said he couldn't resist putting in this one footnote that said that he formulated this derivation. It was the first non-trivial algorithm he had every analyzed satisfactorily. And it says it had a strong influence on the structure of these books. And it really is where the analysis of algorithms began, was when Knuth solved this problem. Now when Philippe and I wrote the Introduction to the Analysis of Algorithms textbook and we were learning about Intellitec combinatorics. It seemed reasonable, that we should be able to analyze when you're probing with the symbolic method. So, we can resist putting in one foot note, in our book, either. And that one said that we don't know the answer to this exercise. Now we couldn't figure out how to use the symbolic method to analyze linear probing. It seemed like their answer was so simple and so similar to birthday coupon collector, that we should be able to get it. And really we put this in as a challenge to students and researchers. Really, you know, we should be able to do this one. And it took some years, but eventually, and you can read about this on, in the second edition of the book turn out this problem has a very, a deep connections to properties of commentarial objects. Like, random graphs. The Gambler Ruin problem. Path links in trees and many other classic algorithms. And it's explained by A distribution called an Airy law. And very fascinating amount of mathematics to explain this. And Knuth was one of the people who solved it. And Phillipe and some co-authors solved that, as well. Although, actually, still we don't quite exactly, precisely know the answer to this exercise, we know quite a bit. So there's actually still a footnote left in the second edition. But linear probing is worth studying to see both the potential and the limitations of what we know about analysis of algorithms at this point. One of the foundations mentioned here is properties of mappings and that's what we're going to talk about next.