Now we're going to look at tries, which is a class of common material objects that not really has only come under study in recent years due to computer applications. They're not found in classical common but they're actually a very interesting rich combinatory analytic properties. So I'll spend some time really talking about what tries are and the application before I talk about the analysis. So when we look at the tries just as a binary tree, where the external nodes can be marked. The black ones are called void nodes and then the white ones are nonvoid. So you take a binary tree and mark some of the external nodes to be void. Now there's a rule and that is that you can never have two void nodes that are siblings of each other in a leaf. So that a sibling of a void node is not void, that's the rule, so that's what a trie is. Now that seems kind of arbitrary but you'll see when we look at applications, how these rules play a role. Actually for an exercise, I might give a recursive definition of what a trie is. So the most usual way we think of trie is as representing a set of good strings. Each trie corresponds to a set of bitstring where each nonvoid, external node represents one bitstring. And you get the big string by taking the path from the root to a node, taking 0 when you go left and 1 when you go right. For example, if we choose 00, left, left, right, right, left, we get down to that nonvoid external node. And we say that node represents the bitstring that we got that defines the path that we got there, 00110. Now over here, 1010, that nonvoid node represents the bitstream 1010. The path from the root to a node defines the bitstring. So now the void nodes, I have a different interpretation that derives right from this definition. For example, if we go right four times and then left, we come to a void node. What that means is that no string with that prefix is in the set of strings represented by this trie. The trie represent in this case one two, three, four, five, six, seven bitstrings and in the void node represent the prefix of all the bitstrings that aren't represented. So that's what we think of a try corresponds to set of bitstrings or another way to look at it, is it's all worked out here. So this trie represents as I said, seven different bitstrings, and those are shown at the top. Now on the left is the bitstrings from that set that start with 0, and trie on the right represent the bitstrings of that set start with 1 with the 1 script of. And so recursibly going down, that's another way to see the sets of bitstrings that are represented. Now this only works for sets of bitstrings are said to be prefix free. So that is no member of this set of bit strings is a prefix of another one in that set. We can handle that by using void and not void, internal nodes. But in applications, I'm going to talk about that are typical, it's okay to just work with prefix free sets. So for example, fixed length, all those bit strings are the same length then it's prefix free. Because they're all different, they're all the same length and they're different. You can't have one being prefixed not another, and that's a typical and useful application of tries. So this is a trie that represents those three strings and lots of applications of tries if you look in our algorithms book you'll find trie code for sorting. For simple tables with string keys and for suffix arrays which I'll refer to in a minute. They play a role in classic data compression algorithms Huffman code. And Lempel–Ziv–Welch compression and we'll look at the use of tries to understand decision making collision resolution leader election algorithms. They play a very important role nowadays in network systmes and bioinformatics, Internet search, all kinds of commercial data processing. Very important data structure that's often overlooked that's where I'm taking the time to talk about some of these applications to motivate the analysis because it's not so easy to analyze as we'll see. So here's the basic application which is for symbol tables. Trie represents a set of bitstrings so what we have is, just going from the definition. A search algorithm for determining whether a given bitstring is in the set represented by the trie. And the basic idea is if the leading bit of your key is 0, go to the left, if it's 1, go to the right then use the remainder of the string recursively. If you get to a void external node, it means that the one you're looking for is not in the set represented by the trie. If you get to a nonvoid external node and you're at the end of your bitstring, then you report success, you did find the key. So for example, let's say we're going to search for the bitstring 0011 in this tri. Start with a 0, go to the left. Next one's a 0, go to the left. Next one's a 1, go to the right. Next one's a 1, go to the right. We're at the end of our string and we're on a nonvoid external node. So that string is in the set of business strings represented by our trie. Let's look for 10110. So we start with a 1 and go to the right, 0 go to the left, 1 go to the right, 1 go to the right. We hit a void external node, so that string is not in the set represented by our trie. Very natural search algorithm have the trie represent a set of bitstring. Of course everything can be represented as a bitstring, so this is a natural algorithm for anything represented by in a computer. So of course we're going to want to for an algorithm like this, want to know what's the expected search time under reasonable model of randomness, that's a type of thing that we want to analyze. Now what about inserting new keys or a new bitstring into the set represented by the trie. Well what we'll do is we'll insert by searching until we get to a void external node. So if we wind up at a internal node or nonvoid external node that means that we'll have a prefix free violation and we can deal with that in some way. But insert a new key, it's going to wind up at a void external node. So to insert 01110, we go left for the 0, 1 right for the 1, and now we're at a void external node. So now what we want to do is for each remaining bit in our key that we want to insert we want to add a new internal node. And with one void external child and then the other one corresponding to our bit. So in this case our next bit is 1. So we put if the key start at 010 then it is not in the set of strings, but 011 that could be this one. Then we do it again for another one, and then the next bit is 0. So we put the void external node to the right and nonvoid to the left. So that's we would insert 01110 into this trial. Now there are variants where you just keep track of the tail in some way with pointers and people, and those are well studied, and there's lots of reasons to do that sort of thing. But the simplest version also is very effective and that's what we'll stick with right now. So that's a search algorithm and an insertion algorithm, that gives the basis for simple table using the trie data structure, which is a very useful algorithm. And then a natural question that probably occurred to many of you is, what about these void external nodes? It seems kind of a waste to have all these void external nodes there in this data structure. And so we're going to want to analyze how many there are to make sure that we understand how much space it trie it takes. But it's a very compact data structure and that analysis is certainly interesting and relevant in practice. So here's another application of tries, what we want to do is we have a given string, S. And just for an example, I'll use a genomic string made up of As, Cs, Ts, and Gs. And these things could be huge, it would be billions of letters. And we want to know is a particular substring in our string for this string, is ACCTA in there? And the answer is yes, starting at 0. What about CCT? Yeah, there's plenty of places where CCT occur in this string. What about TGA? No, there's no occurrence of TGA. So we have a specific string that's huge and we want to be able to quickly do substring search. There's all kinds of application in genomics where it's important to be able to do an operation like this quickly. So searching genomic data, and this is also useful in Internet search. When you do a Google search you not only get to the page that you're looking for, but you get context you get where the substring is in that page. And that uses data structure like this and many other applications. So the solution method that I'll talk about is the so-called suffix multiway trie. Generalizing the trie for this problem and so the idea is if you have given string well what were going to do is work with all the suffix of the string. So the original string, ACCTA GGCCT, we leave off the A, then we have CCTA, and so forth. So if the string is of size N, then we have N strings for all the suffixes. And we're going to treat those as different strings, and just insert them into the trie, that's called a suffix trie. You now notice that's prefix free, none of these is a prefix of another because they're all different lengths. It's a prefix for each set, and we have them all end with a character that isn't found anywhere else in the trie. So then the idea is that every internal node of this trie corresponds to some substring of our original string. So to answer the question, is x a substring of x? We use the characters of our query to traverse the trie. So for example, if we're looking for ACCTA then [COUGH] when we get to a nonvoid external node that's tells us that that one is in there and tells us what position it is so we built the trie AC. Now if we see something that starts with AC, then we look starting with position 0 and continue our search and this CCTA is there. If we encounter a void node then so for example, if CCT is there because we found that in the internal node. But something like TGA, we end up at a void node and router there. If we reach the end of the string then we're fine, if we hit a void node we're fine, where a node that is not there. So this is a very simple algorithm to answer this is x a substring of s question, and a fine application of tries. And again the number of void nodes and what does it mean to be a random trie and how long is the search and so forth. All of these kinds of questions are going to be important and relevant in practice. Here's another application of tries. Tries is a model for an algorithm, so called leader election algorithm. This is important in distributed systems, so the idea is you have a group of individuals. And they would need to elect a leader, and what they're going to do is each flip a coin. So it's than distributed, there could be a large number of them, they'll each flip a coin and we'll count 1s as winners and 0s as losers. So everybody that gets a one when they flip a coin survives for the next round and the ones that get zero are gone. So now we just worry about the ones that got ones again they each flip a coin. In this case, the two green ones get a zero, so they're the losers and they're eliminated and only the ones that got one continue to the next round. They each flip a coin, again two are eliminated and three are left, the ones that got ones. Each flip a coin, in this case, they all get one so they all advance to the next round. And now we have a void node, and then again nobody eliminated. Now they each flip a coin, and only the blue one survives, so that's the winner. So this is a very simple method for choosing a leader in a distributed manner among n people. Now there's a possible problem, and that is a procedure might fail. What if they all throw 0. Then in that case they are all losers, there is no winner, procedure might fail. So obviously, we are going to be interested in what's the chance of failure. Well it's the probability that the right most path in a random trie ends in a void node. And what do I mean by a random try? Well it turns out that the model that associates with this, and also works for symbol tables is, it's a trie that you get by inserting infinite length random bitstrings into an initially empty trie. So there's three diverse applications of the trie data structure for a simple table, for substring search and for distributed leader election. And all of this is to motivate studying this in a combinatorial structure. So for distributed leader, what I want to know is what's the number of rounds? Well it's the expected length of the rightmost path in a random trie, that's how many rounds? And then you're probability of success from that you could calculate how effective this method is going to be. So that's a brief description of tries and next we'll look at analysis and tries parameters.