Next we're going to talk about strings and tries. These are combinatorial objects that have really come under scrutiny because of computational applications. But they're very well suited to the analytic techniques that we've been talking about. Again, we're in the second half of the class, and this is lecture eight. And we're going to look at these combinatorial objects as unlabeled classes, mostly, so we'll be talking about OGFs. And again, we'll give some examples in lecture, but there are many more in chapter eight of the book. So first thing we're going to talk about is bitstrings with restrictions. And we did some examples of this as our motivating examples for analytic combinatorics in lecture five. But what we'll talk about today is answering questions like this. So this is a number of random bitstrings of 70 or 80 bits. And I've got highlighted in there the first occurrence of 000 in each bitstring. So a natural question is, well, what's the expected wait time? That is, how many random bits do we need to see before we get to 000 in a random bitstring? Or what's the, some of these bit strings have no 000, I think. Well, maybe not. So, but if you look at shorter ones, like this distance, then there'd be no 000. So the question is, what's the probability that you don't have 000 in an N-bit random bitstring? And questions like these actually have lots of important practical applications, where the zeroes and ones correspond to electrical signals, and certain kinds of devices can't take long strings of offs, for example. So let's review what we talked about for the symbolic method for bit strings. It's one of the very simplest examples of analytic combinatorics. How many binary strings are there with N bits? Of course, there're 2 to the N. And we can do that with the symbolic method by constructing this class of all binary strings B as a sequence of zero or one bits. And then our normal symbolic transfer, take Z0 plus Z1 to 2Z, and sequence of anything of 1 over 1 minus that. So that tells us that the OGF that enumerates the number of binary strings is b(Z) = 1/1- 2Z. And so the coefficient of Z to the N in that is 2 to the N. That's our basic construction for binary strings using the symbolic method. And then we had an alternate way of expressing the same thing. If you say that a bitstring is either empty, or it's a bit followed by a bitstring, a recursive definition, then you get down to the same result. And this one generalizes to help us answer questions like, how many N-bit strings have no two consecutive 0s? And again, using the same reasoning, we can say that a bitstring with no two consecutive 0s is either empty, or it's a single 0, or it's a 1 or a 01 followed by a bitstring with no two consecutive 0s. That's the natural construction for this class of combinatorial objects. And then that translates through the normal translation theorem to this OGF equation, 1 + Z (Z + Z squared times b00(Z). And then we can solve that, and we get a polynomial, a ratio of two polynomials for the generating functions. And that's a classic example in asymptotic analysis. The coefficient of Z to the N in that function is going to be, well in this case it's exactly the Fibonacci numbers. But in general, if it's any polynomials, you look at the largest root of the polynomial in the denominator. And that gives an asymptotic expression for the coefficient of B to the N in that generating function. So simply by finding the appropriate root of the polynomial, we get the asymptotics. And for any polynomial, it's going to be the form a constant times the root to the Nth power, unless there happen to be multiple roots. So in this case, it's the golden ratio, phi to the Nth power, and then the constant also is explicitly determined. That's example that we did in chapter five. So first thing we want to do today is extend that, how many binary strings have no runs of P consecutive 0s? And that's going to extend in a natural way, very much like the derivation that I just did. So the construction then is a string with no runs of P consecutive 0s, is a string of 0s of length less than P followed by either an empty string or a 1 followed by a string with no runs of P consecutive 0s. So that's just a generalization of the construction that I did on the last slide. And again, that immediately translates a string of 0s of length less than P is 1 + z plus so forth, up to z to the P- 1, and then E translates to 1 and zBp(z). And so solving that equation gives Bp(z) equals again a ratio of two polynomials. And again, the coefficient of z to the N in that is asymptotic to a constant times largest root to the Nth power, where that dominant root is something we can calculate, and also the coefficient is explicitly available. So this no runs of P consecutive 0s comes down to finding the dominant root of that polynomial when 1- 2z plus z to the (P + 1). And this is using the sage mathematical system to just find those roots. And you can use any system that you want. And I didn't calculate the constants, but let's look at, so anyway, those are those results. Now let's look at what we can infer from that result. So this is a summary of where we were for consecutive 0s. The OGF, SP(z), is going to be the sum of N, the number of bitstings of length N with no runs of P consecutive 0s times z to the N. And that, we have an explicit formula for that OGF, 1- z to the P/1- 2z + z to the (P + 1). Now this is similar to the situation we had with permutations. We can convert that into a probability by evaluating an appropriate value of the argument. If you look at Sp(z/2) 2, then the Z to the N becomes 0/2 to the N. So what that is is the probability that a bitstring of length N has no runs of P 0s. So it's a PGF just by evaluating the oracle, the argument at Z over 2. So now if we take Z=1 then that's just summing the, what is that? That's the Sum on N of the number of bitstrings of length N with no runs of P 0s divided by 2 to the N. Which is the sum on N then the probability that the first N bits have no runs of P 0s. And that's the same that the sum of the probability that the end of the first run of P 0s is bigger than N. So the number of bitstrings of length N with no runs divided by 2 to N, this is the probability that the first N bits have no runs of P 0s. And that's the same as the probability that the position of the end of the first run of P 0s is bigger than N. But that's exactly the average position of the end of the first run of P 0s. So that's the average wait time. So this argument just gives us these two theorems. The first one is the probability that an N bit, random bit string has no run of P 0s. Is the coefficient Z to the N in SP evaluated Z over 2, that's the second line here. And so [COUGH] going from the solution on the previous slide it's going to be our dominant root divided by 2 to the N. And then the other thing is the expected wait time is just out generating function evaluated at one-half. If you evaluate that generating function at one-half, you'll see that all that's left of 1-2Z cancels out, so it's one-half to the p + 1 in the denominator. So it becomes 2 to the P+1-2. So that's using the symbolic method to get information, explicit version of the generating function. And then evaluating that generating function to get the analysis of the properties of the consecutive 0s in a random bitstring. Now so just to summarize, for small values of p, which are usually what's of interest. So now, generating function when p = 1 is 1-z over 1-2z + z squared. So the approximate probability of no 0 in a random bitstring, if size N is one-half to the N and then 10, it's 0.0010, and 100 is 10 to the -30th, and the average wait time is 2. For three 0s, then we can do the explicit [COUGH] those are the values of the roots and it turns out that the probability of no run of two 0s in 10 bits is about 0.14. And 100 bits its 10 to the -9th and the weight time for the first run of two 0s is about 6. For three 0s, probability of no run of three 0s becoming more and more likely. So for 10 bits, it's about even chance if there's no run of three 0s. For 100 bits, it's still fairly unlikely. Four 0s now 10 bits, three quarters of the time you're not going to have four 0s in a row, you have to wait 30 bits to get your first run of four 0s on average. For 100 bits, it's still pretty unlikely that you won't have four 0s in a row. And then for five 0s, now it's almost 90% chance that you're not going to have five consecutive ze0s oes in 10 random bits. 20% chance you're not going to have five consecutive 0s in a 100 random bits and the wait time is 62. And then for six 0s, it's almost certain and there's even chances in a random string of 100 bits, that you won't have six consecutive 0s, and your wait time is over 100, it's 126. So those are facts that we can derive from this analysis. And actually, this type of statistic is one way to test whether a set of strings is random. In fact, it's always worthwhile to validate these types of mathematical results. And in situations like this, when it's so easy to write code to validate these results we should go ahead and do it. So this is a Java program that takes as argument on size and number of trials. And what it's going to do is, this one actually reads the bits from standard ints. So we separate out what the bits are from analyzing them. So this will read w bits at a time from standard int. So like in that example, w was 75 and there were a bunch of occurrences of 75 bits that were supposed to be random. And then for each p it will check whether there's p consecutive 0s, and it'll just print out the empirical probabilities. Then this is just the code to just check for p consecutive 0s. So this is a very straightforward program that reads in a bunch of bitstrings and then prints out out of all those bitstrings. What's the empirical probability that you find one, two, three, four, five up to p consecutive 0s? And so for the example that I used on the first slide this is the data that it gives. Actually those are the first line of 10,000 bitstrings of size 100, and for those bitstrings there was the probabilities that [COUGH] zero, one, two, three, four, five, and six 0s occurred or not. And those compare very favorably with what we just saw was predicted by the theory. So we can think that those bits are pass this test for randomness, or we can think of this as validating the theory that we just derived. Predicting for example, that you have a 45% chance of finding say, six consecutive 0s in the random bitstrings. So that's a fine generalization of the simple example that we did for how many bitstrings that don't have two consecutive 0s. Now so check, so now the next thing that we want to do is consider other specified patterns. So say, it's not 000 that we're interested in but say, 001. So now in blue, you can hardly see, but the wait time for 000. That's the one that I did before is in red, and it's an average about 18 for these, but in blue is the wait times for 001. So in the first line, the first occurrence of 001 is starting at the nineth bit. In the second, it's at the forth bit, and so forth. And this time, the expected wait time for 001 in this set of bit strings is only 6. It's much much less. And now we're wondering, are these bit strings really random? What's going on here? Well, the fact is that the pattern, itself, definitely is a factor in what the wait time is. And that's what we're going to look at next. So we showed before that the probability than an N-bit random bitstring doesn't have four zeroes in a row is about 0.96 to the n. So, and the expected wait time is 30. And the question is, does that same thing hold for 0001 or any other pattern of four bits? And the answer is, no. So in every one that we did he has 0001 occurs much earlier than the 0000. So, it's a little intuition behind that. So, let's say, consider the first occurrence of three 0s in a row. Now the next bit could be either 0 or 1. So our two patterns are equally likely, once we found 000. But the thing is, if we were looking for 000, then that would mean that we're looking for four zeroes. And we didn't find it. It would mean that we have 0001 in our bit string. Then we're going to have to wait at least four more bits before we can find 000. But if we were looking for 0001, and we had a mismatch, it'd mean that there's four zeroes, and the next bit could give us a match. So if we're going from left to right, looking for these two patterns, it's not surprising that we're going to find 0001, first. Find three zeros in a row we get a good chance of being the next bit. But if we're looking for four zeros in a row and we find three zeros, then we're going to have to wait a long time for the next bit. That's the intuition behind why it occurs much earlier. So but now we want the answer to this question, what's the probability that a random bit stream doesn't contain that pattern, 0001? It's a little bit counterintuitive that it's not the same for any set of four bits but it's definitely not as we'll see. Or what's the wait time for the first occurrence of 0001? What about other patterns, like 1110 or 1111, and so forth? So, the idea that it's the pattern itself that determines the probability of existence and the expected wait time has to do with a phenomenon called autocorrelation. That's what we're going to look at next. So what we'll do is use the symbolic method to get explicit [COUGH] equations for the generating function for the number of bit strings not containing a specified pattern. And it's remarkably easy to develop such an equation with the symbolic method. So what we'll say is we'll take any pattern, p. So it's a pattern of bits. And S sub p is the binary strings that do not contain that pattern at all. And I will define T sub p to be the binary strings that end in p and have no other occurrence of p inside. So that's two well defined sets of bit strings. So what we're going to do is have two different constructions that relate these classes of bit strings. So the first one is based on the following observations. These two sets are disjointed, that is the T sub p has the pattern in it, S sub p does not have the pattern in it. So there's no string that's in both of them. Empty string doesn't contain a pattern so its in S sub p. The other thing is if we have a string in S sub p, and we add one bit to it, either we're going to get a string that's in S sub p, or that bit will complete the pattern and we'll get a string that's in T sub p, the only occurrence of the pattern. So based on those observations, we have the symbolic equation or construction that relates these classes of strings. S sub p + T sub p is either empty, or it's S sub p with a bit added. So that's the first equation. Here's a second construction that relates these two common material classes. It's based on the idea of what's called an auto-correlation polynomial. That is a property of a pattern. And the idea is we take the pattern in black, and then slide the pattern to the left over itself. And so, well, at the beginning the pattern matches itself in number of trailing bits, uses, and exponents. So we start out with a z to the 0 in the polynomial. And then we slide over one, two, three, four, five positions, and when we slide over five positions, then the trailing bits match the leading bits of the pattern, when the 1010 at the beginning match the trailing four bits. And we slid five positions, so we put z to the 5th. And then two more positions, the two trailing bits match the two leading bits that's z to the 7. So in all the possible slides the only match of the trailing bits with leading bits is at position zero, five and seven. That defines the autocorrelation polynomial of the pattern for this one. It's all our correlation problem is one plus Z to the fifth plus Z to the seventh. And that's going to play a role in the development of another relationship between two combinatarial classes that we just defined. This is the second construction so remember S sub p doesn't contain the pattern. And T sub p ends in the pattern but has no other occurrence of the pattern. And so the idea is, if you take any string in T sub p and then you add a tail corresponding to the tail that you'd get to complete the pattern in the auto-correlation. So the first one's null and the second one, if you add the five bits that you shifted off at the end of the pattern, you get an occurrence of the pattern. And again, if you had the seven bits, you'd get an occurrence of the pattern. So in those three cases, null, one of course corresponding to each bit in the auto correlation polynomial, you'll get a string, that you have a string in S sub p does not contain the pattern. If you add the tail, I'm sorry, if you have a string in T sub p and you add this tails, then you'll get a string in S sub p from knocking off the pattern. So every time the result is the string in S sub p followed by the pattern. So S sub p, times the pattern, is T sub p, times the one place, for each bit in the correlation polynomial. So that's going to give us the combinatorial construction. So we have those two constructions. One, if you add a bit to a string in S, you get either a string in S sub p, And the other one is if you take a string in T and you got a chance for every bit in the order correlation polynomial to make a string in S followed by the pattern. Those are the two constructions. And these use, these two constructions are set up for the symbolic method. They just used the basic operations. So they're immediately correspond to generating function equations. First one, z0 + z1 is 2z, and otherwise, it's just directly corresponds to having a generating functions. In the second one, the pattern has OGF z to the p, so Sp times zP = Tp. And then what's left over here is the correlation polynomial itself. So now we have two equations in s and p that we can solve. And the correlation polynomial of the pattern is part of the answer. So that's the solution of those two simultaneous equations. This is the OGF for the class of binary streams that contain no occurrence of the specified pattern. Now for any particular pattern, we can plug in the correlation polynomial. And again, we have a ratio of two polynomials. We find the dominant root of the one in the denominator and our count is asymptotic to that root to the nth power, multiplied by a constant that we can compute explicitly. So this is a methodology that's going to work for any pattern at all. So, here's the end result for 4-bit patterns. So of all possible 4-bit patterns, actually there's only four different possibilities for the auto-correlation, it turns out. That classifies the 4-bit patterns with those auto-correlations. They all start with one. And well, those are the autocorrelations that can come up. And so those lead to these different OGFs. And so they all have different dominant routes. And I didn't compute the constants here. But they're all pretty close to one. And then you can see that now this one with one, zero, zero, it's a slightly smaller number. When we raise that number to the 10th power, we're going to get a smaller number. And for hundredth power, it's going to be much, much smaller. It's interesting and maybe counter intuitive, but these are the results. You're going to wait about twice as long for four zeros or four ones as you are for any one of those six patterns. Or another way to look at it is that [COUGH] 0000, four zeros, is 100 times more likely to be absent in a 100 bit string than all of these other patterns, 0001, 0110, 0111. So, studying this table is maybe a good way to win some bar bets perhaps. It's kind of surprising that we can, so completely, characterize this problem with analytic common rhetorics. That's the study of bitstreams with restrictions. And these studies extend in various ways.