We ended with this right that every, behind every complex system there is a network that defines the components. What we've understood now is the components of these networks are nodes. If you're from mathematics, normally you'll refer to these as vertices. Nodes and vertices, it's the same thing. It's also the terminology graph and network. Normally network you refer to as having nodes and links, whereas a graph, you say it's got vertices and edges but you can use the same thing. Then our interactions between those nodes, all those vertices are the links or the edges. System, the whole graph, you only need to know two things, which is the set of nodes and a set of links that defines our entire complex system. Now, what I will do is start to give you an idea of the first kind of metrics that you can measure once you have a network. Let's say you've got your own data set now, you can think about the nodes, what the nodes are, and you can think about the links. Then you can measure certain things about the network. Soon as you measure those things about the network, it can tell you something already about how robust that network is, how quickly information might spread on the network, how likely it is to have a cascading failure like the power system, and so on. The first one is the degree, which you've come across. The degree of a node is the number of neighbors that it has. The number of immediate neighbors of level 1 that a node has, that's the degree of a node. The average degree of a graph, what do you think that is? Just the average of all nodes. You take the average of all the nodes. Then the degree distribution that was mentioned a few times this morning, did everyone understand what that was? Yeah. Exactly. If you take the degree distribution of this and you make it discrete, then you can say, 0,1, and 2. There's one node with two links and is there's two nodes with one link. That's it. This is k and this is the number, you can change that to a probability as well. That's what we'll cover now. These are some examples. In this case, the degree of A is1, the degree of B is 4. You can also, depending on the type of data you have, depending on the type of network you're looking at, you might have a directed network. Where relationship is not bidirectional. One example you can think about here is something like Facebook, you can have a Facebook network where you have links between your friends. That's usually bidirectional. Although Facebook changed, if you're friends with me, I have to be friends with you on Facebook. But Twitter is very different. You can follow one way, so there you would have a one-way relationship. In Twitter you don't have to follow the person that's following you. In a directed network, you can mention degrees of different type. You can talk about the overall degree. The degree of C is still three because it has three edges. But now you can also define the in-degree and out-degree. The in-degree is the total number of links coming into the node, which is two in this case, and the out-degree is the total number of links going out from the node, which is one. Of course, the total degree should just be the sum of the in and out degree. That's what you expect. Then you normally have these terms source and sink. A source is something that has no incoming nodes, and a sink is something that has no outgoing nodes. Everything is going to that. The power system is usually a directed network. Can anyone think of another type of directed network? Say again. Rhodes, yeah. Singapore is almost like that. But if it's a one way system, you could have like this. Anything else? By water gasoline? Yeah, usually this is one way. Twitter? Yeah. Instagram is also follower-based thing. What about biological networks? Yeah, because not necessarily two-way, so proteins can regulate another protein, but not necessarily backwards. Again, depending on the type of problem that you're thinking about, the data set that you're thinking about, you may represent it as a directed or an undirected network. Then the average degree is pretty straightforward. It's just the sum of all degrees, and you divide that by the total number of nodes to give you the average degree. This is normally denoted by this. If you see that symbol anywhere the rest of the week, this is just meaning the average degree. What about the thing on the right, 2L divided by N? That's another way you can calculate the average degree. L is the number of links, N is the number of nodes. Every link, every edge has two end points, so it increases the degree of the total network by two. You just multiply that by two, times L, you'll have a total number of end points in the network. You divide that by N, that's the same as the average degree. Everyone gets it. Yeah. Then of course, for the in and out degree, why does this hold? The total in-degree, if you sum up all the in-degrees of the network and you sum all the out-degrees, assuming that every edge has a start and endpoint, then it should be the same. Every link has a start point and an endpoint. The total in-degree, the total out-degree should be the same. It's the same thing here. But you're not doing too well because here you're only representing one endpoint as in or out-degree. That's your average degree. If your average degree of the network is very high, what does it mean for disease spread? It's going to be quicker. It's what? It's going to be quicker. It's going to be quicker. Because as soon as you put a disease on a node, every node will have a very, very high number of neighbors. It will spread very, very quickly. That's expectation. That's not totally true, but it's usually are all things being equal, you'd expect it to be faster. Degree distribution. This is another network like I just showed here. You can do this in discrete form. In this case, 60 percent of the network has a degree of one, 10 percent has degree of two, 20 percent has degree 3, and 10 percent has degree four. The sum of this should be one. You can also do this in continuous forms. If you know what the type of network is, you can actually write down an equation which describes the average degree and the degree distribution. That's a simple example. Then you start to get real data and this is one of the hairballs that Peter was talking about. So you look at that and then you think, okay, what does that mean? Well, then what do you can take from that is these measures. The degree distribution and what you'll often see is these things plotted on a log-log scale. That's because you're looking for this power law distribution where it's linear on the log scale. What you're seeing here is the hubs. That's the other term for the nodes with a really, really high number of neighbors. These are the people here, so they have a degree of 100. There's a frequency here which is very, very small. There's only a few of them. That's average degree, that's the degree distribution. Something else that you can measure for the network, which is the average path length or the diameter. So the average path length. What you can do is take every pair of nodes in the network and calculate the number of hops. Remember the Kevin Bacon and [inaudible] that I just explained. If you take all the scientific authors in the world, you make a big network out of that. You can take the average path length or the average number connections between all of it. The actual six degrees of separation, the Milgram experiment, which Peter was referring to the six handshakes away. That is a measure of saying that the average path length of the entire human race is six. I only have to go six hops to reach any other person in the world on average. You think about the average path length that's taking every pair of nodes in the network and working out how many hops you have to get to. If you look at all your Facebook networks, we can work out how many hops away you are from every other person in the room. We can work out the average length of that. If the average path length is short, what does it mean? Well, it means information will transmit very easily to the rest of the network. Obviously, if you're calculating shortest path, if you're in a directed network, you have to take into consideration the arrows. Yeah. [inaudible] What we're talking about at the moment is completely neutral nodes, but you can then make models where the edges have weights. The scientific co-authorship, for example, you might increase the weight of the edge by the number of papers that we published together, or the number of films that we've starred in together, or the number of times that I've mentioned you in one of my posts on my Facebook. You can get this data and then say, what's the likelihood refined form. I'll tell you another story about my students. They're very creative. So one group, they're given the task to find out their own research question. One guy had kissed a girl that he shouldn't have kissed in a party. His experiment was using his Facebook network to find out what is the probability that his girlfriend will find out about this girl, having kissed this girl at the party. He then made a poster of this project and they put it outside in the foyer. Then she found out. But there he was looking at things like, okay, what is these two people in my friendship network? How many times do they communicate with each other on Facebook? If it's a lot, then the probability of transmission is higher. Or you can assume a no model where everyone is equally likely. Or you can say, okay, if we have a lot of friends in common, then we're more likely to communicate than if we don't have a lot of friends in common. All of these things are possible. But what we're talking about now is simple equal nodes and simple equal edges. Just looking at the structure and the topology of the network. That's the average path length. The other thing that you can measure is the network diameter. The diameter of a network is the longest, shortest path. Does that make sense? It takes a bit of time to think about but you can take every pair of nodes in the network and can calculate the shortest path. What I'm looking for is the two nodes that have the longest and shortest path. That would tell me how far the network spreads. The larger the diameter, if I have a network which looks like this, the diameter is the longest shortest path. The shortest path between these two is 1,1,1. The longest shortest path is between node 2 and node 5 which is 1,2,3,4. The larger the diameter of the network, the more spread it is, the longer it will take to get information or disease across the entire network. [inaudible] You're a bit ahead. It is different from between the centroids. Because between the centroids, it is talking about how many shortest paths does this node lie on? You calculate the shortest path between all pairs of nodes and you take a single node and say how many shortest paths does that node lie on. What we're looking at is the [inaudible] high between the centrality, is not looking at also how close we spread across a certain node or how connected this is to [inaudible]. You're measuring between the centrality on a node or an edge. What it's telling you is how many shortest path that node or edge lies on. What that's telling you is what's the likelihood if I transmit a random message across a network? What's the likelihood that it will go through this node? If I'm a sniffer, or if I'm the FBI or if I'm something like this, I have to be a node with high betweenness or an edge with a high betweenness. Because then I'm more likely to be able to intercept information. You've heard of this Tor network? This is a secret network where criminals and the freedom fighters communicate things. You can look at that as a complex network and try and find the nodes with high betweenness centrality and then you might have a chance of intercepting this information, but I'll come back to that later. The diameter is quite easy to calculate the average path length. What you normally do is the following. You sum the paths from all pairs of nodes. So you calculate the distance from node A to node B, C, D, E, F, G and then you go from B to C, D, E, F, G, and so on. Then you normalize this by this factor 2L_max. Now L_max is a normalization factor that's used a lot, but what it represents is if you have a network with n nodes, it's a maximum number of links that you can see in that network which is given by n times n minus 1 divided by 2. Do you all understand how that's calculated? No. It's pretty simple. If I have a network with five nodes, I can connect this one to four. I can connect this one to four, also, principal, but I repeat an edge. I can do the same thing with here, so it's always n minus 1. This is for every n nodes. Then because I'm repeating everything twice, I divide that by two, so it's n times n minus 1. I'm slowly running out, well, quickly running out of time. I don't think I need to show you this again. For reference, it's the same thing. This is a more explicit thing with the average path length calculated. We've got the degree of distribution, we've got the average path length, the diameter. These are all metrics that you can measure from a network now. As soon as you load the network, you can measure these things and you can intuitively say something about how that affects dynamical processes on the network. There's another measure which is usually used and then I'll stop, which is a clustering coefficient. The clustering coefficient tells you how clustered is the network. Every node has a clustering coefficient and what it's measuring is how much all my neighbors connected. If I have this node which we call i, it has four neighbors and none of the neighbors are connected. That means that node is not clustered. The definition of cluster is that your neighbors are also connected. Like this, most of my neighbors are also connected. What fraction of my neighbors are connected? Of all possible connections between my neighbors, what do I see here? Anyone? Sorry. Can you repeat that again. I can think about how many connections are possible between my neighbors. What number of connections are possible between the neighbors of i here? Have I got them also, this one's connected to this, this one's connected to this, this one's connected to this and this one is connected to this. No, one's missing. One missing? Yeah. Three missing. Three missing? Somewhere in between. This one and this one could be connected as well and this one and this one can be connected. The only DIU. The only what? [inaudible] No, it doesn't have to go through this. It's a direct connection between my neighbors. The x can go round, that's no problem. This could be in 3D space, it could be any dimension. [inaudible]. Let me redraw this. I have a node i. These are the neighbors of i. What I will do, is I'll make this dash line. Then I want to say, how many connections are possible between j, k, l, and m? What I can do now, I know that these four are the neighbors of i. I remove i from the equation, and then I have a small network of four nodes. How many possible links can I see between these things? It's the same thing, right? Six. It's actually the same thing, N times N minus 1 over 2. But in this case, the measure is not N because we're not talking about the whole network, we're talking about the neighbors of node i. This is actually, k_i, k_i minus 1, where k_i is the degree of node i here. In this case, it will be 4 times 3 divided by 2 which is 6. Now the clustering coefficient, all it says is what percentage of neighborhood connections do I actually see in the graph out of all the possible ones? If I show you the original one I gave you, again, this is node i. I only see four connections out of a possible six. It's 4 divided by 6, which is 0.6. That node has a clustering coefficient of 0.6. Sixty percent of it's possible neighborhood connections are actually seen. Try again. I'll give you another one. I'll do a smaller one first. In this case, node i has a degree of 3. What I will say is that this is connected, that's it. None of the other neighbors connected. What's the clustering coefficient here? Or what should be a number between 0 and 1? I heard it, I think. Well, the total possible connections between my neighbors is that, so it's 3. I actually see one of them. So it's just 1 divided by 3, so it's 30 percent clustered. That's it. If I write down a more general expression for this, and I say, okay, I know that my degree of node i is k_i, and I say that in the real network what I see are e links. In this case, the one I just showed you, this is equal to 3 and this equal to 1. Then what I'm going to do is e_i divided by k_i k_i minus 1 divided by 2, which is just 2 e_1 k_i minus 1. That will give you the clustering coefficient of a node where e is the number of links that you see between the neighbors. That's it. Which is this. You can just make sure that that's all correct. In this case, we've got four nodes, and i is connected to all of them, and every other node is connected to every other node. So all six links out of a possible six are realized. This case, none of the neighbors are connected. So you've got 0 divided by 6. In this case, three are connected. So you've got 3 divided by 6, which is a half. I can then measure that for every node in the network and calculate the average. What does that tell me about the network? If the average clustering coefficient is high, what does it mean? It's densely packed. You want sparse network is one where you don't see many of the links realized. The clustering coefficient is high. You can also look at how the clustering coefficient is distributed amongst the nodes. It could be that you have a core which is very, very clustered and then periphery which is not clustered. If a disease starts at the end, it actually might not get further than the edge. But if you put it in the highly clustered part, it will spread quick. What I tell my students when they download their Facebook network and analyze that if it's highly clustered, you're in trouble. Just don't misbehave because people will find out rumors about your very, very quickly.