Next we're going to talk about the analysis of path length in binary trees and this is an important problem to study because, that's the quantity that we use to predict performance in the analysis of the binary search algorithm that I talked about before. So, here's the definition that I gave before for a binary tree in the level and the depth. When we talk about path length, what we what to compute is the average length of a path to a node that's in the tree, say an internal node. So to do that we compute the total internal path link and then dividing it by n gives that average cost. And so to compute the total, you just count the number of nodes at each level. So level one there's 2 nodes, the level 2 there's 4, level 3 there's 3, 4 there's 1, 5 there's 1, and you add that all up. That's called the internal path length of a binary tree, it's the sum over all, nodes of length of path from the root to that node, or the sum of k times the number at k. For binary trees we also talk about external path length, and that's just the same thing for external nodes. And also, that's a quantity of interest in studying the binary search algorithm. So these things, there's relationships among these that are easy to prove. It's all recursive and there's a lot of notation to be able to talk about these things so let's look at some simple relationships among them. So we're talking about binary trees and we have a size function. And just a different size function for external nodes. And then left and right subtrees we'll indicate with tL and tR. IPL stands for the internal path length. XPL stands for the external path length. So that's all the notation. And again, these are simple quantities to define, as they did on the previous slide. So here's some recursive relationships. So one thing is, the number of internal nodes in a tree is the number of internal nodes in the left, plus the number of internal nodes on the right, plus one for the root. So that's fine. External nodes, it's just the root is not an external node. So the number of external nodes in the tree, is the number of external nodes on the left plus the number of external nodes on the right. Internal path length, so this is an interesting one. If you want the internal path length of the whole tree you take the internal path length on the left and the internal path length on the right and add those together and then what you are is you're off by one on every single node except the root. So if you add t minus one then you're adding this link to all the nodes. And so that's a recursive relationship for the external path length and for internal path length and for external path length it's similar. External path length on the left, external path length on the right but you're off by one for every node because you didn't take into account the nodes that connect the subtrees to the root. So those are simple, but very useful, recursive relationships. And then from those relationships, you can prove easy things, like the number of external nodes is equal to the number of internal nodes plus 1, and that's just by induction. So just plug in from this formula here. The number of external nodes is the sum of the external nodes in the two parts. By the inductive hypothesis, that's equal to tl plus 1 plus tr plus 1. And then using the recursive relationship for internal notes, we're left with internal nodes plus 1. There's lots of ways to prove that. Here's another one. External path length is equal to internal path length plus 2 times the number of internal nodes in the tree. And again that's inductive right from the simple recursive relationships that we talked about. So, and again, there's a lot of ways to prove those things, too. Just using these as an exercise for getting used to the idea of path length, and its relationship with a path length and the sub trees, to the path length in the whole trees. That is critical to the analysis that we are going to do. Okay, so here's our first problem. We have a random binary tree so that's a binary tree where all tree shapes are equally likely with probability one over, catalan numbers one over N plus. What's the average path length of a random binary tree? That's some kind of description of the tree shapes that we saw. How far are all the nodes from the root and how far is an average node from the root? That's sort of what we're saying with that. So, these are the quantities that we're going to work with. So Q and K is the number of trees with N nodes and internal path link of K. TN is the number of trees that's the cattlen numbers and to accumulate costs is the total internal path link on all trees. So that we get the average by dividing the cumulated cost by the number of trees. That's the way that we compute average values of parameters. So this is for two, so in this case, both of the trees with two nodes have internal path length of 1. So the total accumulated mature path length is 1, the average internal path length is 1, that's still not so interesting. Now this is a little more interesting. So there's five trees with three nodes. Four of them have internal path length, three. 0 to the root, one to one below. And two more to the next one. So that's three. Four of them have that structure. One of them has internal path length. Two the nodes are both just one from the root. So Q32 equals one. Is one tree size two has an internal path length to one and there's four trees to size three that have internal path length three. And so, that's a bug, it's t3=5, so the average internal path length is 14 over 5, it's 2.8. So that's the figures for three and here's the corresponding figures for four. Total path lengths in all of these trees is 74 and there's 14 of them. So the average expected internal path length is 5.2 in this trees at size 4. And then you can take that and say that the average distance to an internal node, divide that by 4 again to get about how far a node is from the root. So that's the quantities that we're after. It's always good to do small cases like this to make sure that you have a good fix on the exact quantities that you're analyzing. Expected path link of a random binary tree. So, here's the analytic common [INAUDIBLE] deviation from this, and these are the things we can defined so far. So the counting GF is just the Catalans and if that's all the information we know about the Catalan numbers. The cumulative cost generating function is sum over for all trees internal path length times z to the tree size. And that's also equal to the sum on N, sum on k and the path length k and so forth. But from now on we're just going to go with cumulative counting. And we know that the average is going to be the coefficient of z to the n and the Q divided by the coefficient of z to the n and the T. Now the coefficient of z to the n and the Q is the total internal path length of all trees the size n. If you divide that by the number of trees the size n then you get the average internal path length. And so that's the cumulative counting method that we're going to use to analyze this quantity. And so now what we have to do is analyze that cumulative cost generating function and that's what we'll do next. So we have the counting GF and we have the cumulative GF. And we're going to use this recursive relationship slightly different but pretty similar to the one that we just talked about. The internal path length of a tree is equal to the internal path length of it's two sub trees plus the number of nodes in the two sub trees. And so from that definition we can immediately write down this decomposition. This first one's for the empty tree and this other one is for the roof. But if we have that function, we look at this decomposition, ipl(t), is exactly equal to that and the size of T is exactly equal to that. And so we have that double sum for the cumulative cost. And that decomposition or looking the other way that construction now we can rearrange terms. And you could see for example ipl(tl)z to the phi l times E to tr, that's Q(z)T(z). So that's one of the terms that we find in there. And the other ones where these tl's come into effect we get T'(z)T(z). This formula has those four terms, and using these two equations it immediately reduces to that simple equation. Relating the counting generating function in the cumulative generating function. The recursive equation that relates those two generating functions. Extremely simple argument. It's actually possible to develop this symbolically and we'll talk about derivations like that in part two. But it's worthwhile having the explicit decomposition in front of us to see how directly we get to the equation that matters, which is the equation relating the generating functions that we can then solve and analyze. Okay, just to get it all on one slide, I'll put those steps down here. So now we know that Q of (z) = 2 of z T of (z) times (Q of (z)) + zT prime. We know what T and T prime are, the only thing we don't know is Q. So we can solve that for Q. And then we can just. T(z), remember that's the Catalan, and we know what T(n) is. And T prime is just the derivative of that, which has two terms in it. But there is one thing that simplifies calculations a lot. 1- 2zT(z) If you multiply this by 2z it's just the square root of 1- 4z so that takes care of the denominator. So there is still a little bit of algebra multiplying these two things together but there's lots of cancellations because the square root of 1- 4z times the square root of 1- 4z is just 1- 4z, so I'm not going to pretend. I'm going to admit some of the algebra. But the final result is pretty simple. So that's an explicit equation for the cumulative counting generating function. And what do we need? The coefficient of z to the N in that is the internal path length of all the trees of size N. And just looking at that you can see that it's going to be 4 to the n. Next term is lower in magnitude. So the take all those binary trees and add up all their internal path lengths and you get four to the vn surprisingly simple. It's not exactly four to vn but it it's four to vn and that's why we want to do precise before, because if you take that and divide it by the Catalan, all you're left with is N squared of pi N. We divided two huge quantities, but since we were working with precise results, we get a precise answer. Average internal path link of a random, binary tree, is asymptotic to n square root of pi n. And that's a very accurate estimate [COUGH] for the average internal path link. And we could do it, we could get it exactly, but doing it asymptotically is very compelling, because it's a simple an surprising and unexpected result. So, now let's look at the same thing for random minor researches. So, that one explains, or starts to explain the shape of a Catalan tree the nodes go down square root of n. Which in a 10,000 node tree goes down 100 a pretty good size depth in that's average so it goes down a few hundred. What about binary search trees? Well here's the same calculation for binary search trees. But now we're counting permutations that result in a binary search tree with N nodes and internal path length k. And now it's a little easy because the counting factor is N factorial, we know that. But still the cumulated cost is the same way. It's just that we have to count the number of permutations. So, in this case, some of the trees have internal path length four, five, and six have before, but the weighting is different. So all of these twelve permutations give four. There's only four of them that give five that's these four, and the remaining eight give six. So the total internal path length of the trees that you get from all of the 24 permutations 74. If you divide that by 24, you get 4.833, which is less than what you get for entries because the balanced ones are weighted much more likely. That's the same setup and now we can use analytic in the same way to analyze. The path length of binary search trees and get a comparison showing us the difference between these two models. So start as usual, but now we have permutations. And so our cost is the length of the permutation. But now, when we say, ipl, we mean, its the ipl of a permutation. Which doesn't make sense, except if you say that the permutation, what we want is the internal path link to the BST, you've got from that permutation by inserting it into a initially empty tree. And again, the number of permutations of size N is N factorial, and that saves us a step in the calculation, as you'll see. Accumulated cost is the total ipl of binary search trees built from all permutations. So now our counting EGF is a simple one, that's the basic EGF for permutations, it's just 1/1- z. We have an N factorial involved, that's our number of permutations, so that cancels out. Our cumulative cost, EGF, it's the same symbols as before, except now we have permutations, not trees, and ipl has that different meaning. And so to get the expected internal path length of a binary search tree built from a random permutation, we're going to take N factorial coefficiency, the N in that, divided by N factorial, and those N factorials cancel out. So it's just almost as treating it as an ordinary generating function gives us the divide by N factorial for free. It's just a little calculation trick because we're using exponential generating functions, which means that we divide by N factorial. But our probability space is permutations, so we want to divide by N factorial, so it serves both purposes and saves us the step of dividing. We just look for the coefficient of z to the N and C(z). So, next, we have to look at developing a generating function equation for that C(z), using the recursive definition of the binary tree structure, according to the way that we construct trees. So same basic steps as before, it's just that we're going to come up with a different equation because we have a different structure. So again, this is just a summary. What we're after is an equation that C(z) has to satisfy. And then, what we're going to do is use this relationship that we talked about before that gives us all the permutations that now lead to the same tree. And so, that decomposition from a permutation P to a permutation that ipl(p) is, those are all the permutations that give rise to that. And ipl(p), it's got that recursive relationship as before. We can express everything there in terms of pL and pR because of this decomposition that we already talked about. And this one's even simpler than the other one. There's just one trick that often works with exponential generating functions. This (|pL| + |pR| + 1) factorial causes in the denominator, causes a little inconvenience in simplifying this formula. But we have z to that same quantity. So what we do is differentiate to cancel out that one factor. Then you'll see there's a pL + pR factorial on the bottom. And then we can mix that with the binomial coefficient. So this is a trick that often works with exponential generating functions. Differentiate, then cancel the pL + pL in the binomial coefficient, and you're left with a pL factorial, pR factorial. And now you've got a very simple double sum that completely decomposes and separates. The [COUGH] p of zb/p factorial is just P(z), and P'(z) is the one that takes care of pL/pL factorial, it's (pL-1) factorial, so use this equation. And then again, we get a very simple differential equation now for the cumulative cost exponential generating function for path length in BSTs. And decomposition's important. It's gotta be precise, it's gotta be correct. But once you have that decomposition, you automatically get a relationship, that the exponential generating function for the accumulative cost in the counting EGF that you have to satisfy. [COUGH] And again, solving for C(x) gives, it's just simplifying that, substituting in for P(z) gives a very simple equation. So recognize that equation, we saw that in the first lecture actually. That's the equation that came up with solving the quick sort recurrence. I guess it came up in the generating function lecture, pretty much the same equation. So, just with that observation, we solve that because it's a first order differential equation that has a simple integration factor and was not difficult to solve. So now we can summarize the analysis of expected length in a BST built from a random permutation on this slide. Start with the definitional equation for the cumulative generating function, substitute in the construction of the decomposition. Differentiate to simplify. Substitute to get further simplification. And that is a generating function that has the solution 2/(1- z) squared, log of 1 / 1-z- 2z / (1-z) squared. And those are elementary series that now we can expand to find the coefficients, as we did before, to show that the average internal path length in a binary search tree is asymptotic to 2 and natural log n. So a log n as a factor, if you divide by n, log n to the average node in a binary search tree, square root of n in a binary catalan tree. Since the same equation comes up and people who have had algorithm courses know that there's a bijection between quicksort and binary search trees that explains this. We could have analyzed binary search trees just by taking advantage of this bijection. So that is in quicksort, you have the first entry in the permutation is the partitioning element and the smaller ones and larger ones are mixed. And then after the partitioning, you do them independently. And that's pretty much the same in a binary search tree. The first one is the root, and then you do the left and the right independently. So, you can show that the average number it compares for quicksort is exactly the average external path length to BST built from a random permutation. And that's an interesting bijection to know, to take advantage of. Now this same approach works for a lot of parameters of trees, and there's exercises that compute the number of leaves of trees and other things like that in the text. For finding the height of trees, it's much more intricate. It's a different approach because it doesn't break down really as well in that, but that's an interesting problem, because it's a natural thing to want to know what's the furthest node from the root in a tree. That tells us more information about the shape of the tree. So the height derivation is described in the text. And actually for both binary search trees and trees, the height was an open problem for quite a while. So just to summarize what we know about the shapes of these two different tree structures, we looked at random binary trees and binary search trees built from random permutation. Those are typical shapes of those trees. The average path length, that's the average distance to a node in a random tree, for binary trees, it's square root of pi N. For BSTs from random permutation, it's 2 ln N. And again, in the book, there's some description of, although both of these derivations are quite intricate, the height now is known for random binary trees to be twice that average. And for random BSTs, it's a little more than twice. It's a constant times ln N, where the constant is about 4.31. And, again, there's lots of things that you might want to study about trees, and I've only really talked in detail about path length. But you can find plenty of examples in the book that show that the same approach works for many other tree parameters as well and certainly don't have time to talk about all of them in this lecture. That's a summary of the study of path length and trees.