We're on to the code. I warn you now that with this, if you don't have a powerful computer, you probably won't be able to keep up with me, but you should be able to at least start executing. I import the libraries as we did yesterday. Now, I use a few more things. This tqdm is important here because it gives you a progress bar when things are running. It will tell you what percentage the experiments have done. Otherwise, you might be sat there for two days, waiting for the thing to finish, which I did once. Now, I define some functions. Functions are way in any programming language, Python as well, which allows you to make encapsulated code so that you can reuse it again and again and again. These things, I have fail, attack degree, and attack between us, and in the parentheses is a graph. What these functions will do, fail will randomly remove a node from the network from G, attack degree will remove the node with the highest degree from G, and between us will remove the node with the highest between us from the graph. I'll use those things. When you see those terms later on, you understand what they're doing. You can see this comments after every line so you can read through and understand exactly what each command does. Then I talk about artificial networks, or the model networks that we're working with, which is a Barabasi-Albert Model and the random network model. Then I define some experiment parameters in order to be able to compare the two types of network, you obviously have to make sure that network are the same size, and you want to have. What we've got is two parameters for the two models. Can you remember what they are? Ignoring N. Where there's a parameter for their random network model which was P, which is a probability. There's a parameter for the Barabasi-Albert, which is M, which is the number of links that you add for every new node. In order for that comparison to be somewhat fair, what we should do is create networks that end up with the same number of links. Because if we're removing nodes, we're measuring path length, we're measuring the component. If you've got a network with many more links, it's going to be more robust. That's why the B, A, M, and P values are set the way they are. Then I've got something called repetitions. What repetitions means is, in any simulation-based experiment, you don't do the thing once. In order to get confidence in your results and make sure that you actually are seeing a significant difference, you have to make some repetitions. Then what you will measure is these plots of robustness and see if there is a significant difference between the robustness of the BA versus the ER networks. In this case, I'm going to be a bad scientist, and I'm going to do one repetition because otherwise we'll be still watching the screen for a while. What I will do is this is telling me where to stop on the x-axis. I'm going keep removing nodes. In this case, until 95 percent of all the nodes have been removed, so I'm left with five percent. I could keep that to a 100 percent or whatever, but I want to see at least the pattern that's going on. Then all I do is convert that fraction to a number. If it's 95 percent and I have 1,000 nodes, I have 950 nodes that I have to remove. The number of removals will be 950 in this case. Now, what I do is I create my test networks. In NetworkX, I have this very simple command here, which is nx. Remember NetworkX, Barabasi-Albert Graph, network size, which is in this case, it's the number 1,000; m, which is our parameter, and r which is, I have to remember. I'll tell you what r is in a while. Oh no, sorry. This is the number of repetitions that I'm going to create. I'll tell you why r is there in a while, I need to remember. But the point was, if we're doing multiple repetitions, we're going to generate 10 networks, which are random, 10 networks, which are Barabasi-Albert. We're not just going to generate one network. We have to make sure the results are okay across many different instances of the same network. Make sure I run this. So now what I do is I take all the networks that I've just generated, these artificial networks, and I measure the statistics of them. I measure the path length, I measure the diameter, and I will measure the fraction of nodes inside the giant components. These are all functions again, that I will use later on. Whenever you see the term diameter and average path length, it will give me back the diameter of the path network G and the average path length. These are just utilities that I'll use later on. They do have that built into network X, but I found it a bit slow and so I re-implemented it myself, it was a bit faster. Okay, we now have some measures. That's our statistics of the BA networks. What it's printing out is the diameter, which is six, average path length, which is 3.4546, and the fraction of nodes currently inside the giant component, okay? Those are the three measures that we're interested in. I can do the same thing for the Erdos-Renyi, the random networks that I generated, and I see it's similar. I have diameter seven, in this case, an average path length of 3.5 and obviously, initially, all the nodes are in the giant component. Now, this is the thing that's going to do all of the experiments. It's going to take all of the networks that I've just generated. In this case, I only generated two because I only had one repetition, so I have one random network, one Barabasi-Albert network. Then I'm going to remove nodes one by one and measure the statistics at some frequency. I'm going to move along this graph removing nodes and measuring average path length, measuring the fraction of nodes inside the giant component, and measuring the diameter. Now I have a special option which allows me to only measure this every 10 removals or 20 removals or 30 removals. Why do I do that? Because what? To visualize it. Well, I want to measure it, that's the experiment, right? I want to track what's happening, but why don't I do it every time step? There's no need. There might be a need. If I could be bothered I would do it every time step, but it takes a long time. Because in order to calculate the average path length, I have to measure the shortest path between every pair of nodes, If I have a network of size of 1000s, that's 1000 times 1000 paths that I have to calculate. If I had 10 repetitions, that's 10 times, what did I say? 1000 times 1000, so that's one million, it's 10 times a million. Then I have to do it every time step. I said there's 950 time steps, so there's 950 times that big number that I've just mentioned. It gets slow, it gets really slow. There's no real way around that. Let's see, I don't know what my reporting frequency is. I think it's around here somewhere. I'll just try and do this. Here, so I measure every 20 removals. Every 20 removals I will measure the statistics. I'll show you the difference. This is the thing that it's telling me now. This is one of the experiments that it's doing. It's got 13 percent of the way through. I'll be interested to see what your performance is. This is telling me how many iterations per second I get. Why is it getting faster? Because the network's getting smaller. There's less paths to calculate and the diameter is getting smaller. It will always be slowest at the beginning. Okay, all experiments are finished, that's good. Now we go into plotting failure, so there's some code in here which will just plot the graphs. Let's see, yeah, it's done. Okay. Now, we can look at our results. This is the diameter of the network. The diameter of the network initially was six and seven, remember we're only measuring every 20 removals here. It took about, for this one, 20, 40, 60, 80, 100, 20, 40, 60, 80. About 220 nodes were removed before we even affected the diameter. That's just pruning things from the network, and we're talking about failures here, not attacks. Then gradually, we start to increase the path length, and then at some point we reach this maximum and that's when we've got a very long strand of nodes. After that, we start to cut the stranded nodes in half and the whole thing starts to collapse. I wouldn't say there's a conclusive evidence about anything here. It's hard to say which ones are doing better here. Average path length. It's always at the Barabasi-Albert is slightly under so you don't see the big impacts as much. But if we want the average path length to be short like the airline network, then the scale-free property, the scale-free structure is actually helping us against the failures. Then this is our fraction of nodes inside the giant component comparing the random network with the Barabasi-Albert. As we expected, so this is our fraction of nodes removed. Out of 80 percent of removal thing starts all already. Think about 25 percent of the removal, we start to see a deviation. The Barabasi-Albert network is a bit more robust when it comes to the failure. Now we have the attacks. I do the same thing. But now instead of just randomly failing nodes I'm targeting the ones with the highest degree. Did anyone manage to run this? Is anyone running faster than me? Yeah? No, I don't think so. Now we can plot the attacks as well. These are the attacks. What's the difference between the attacks and the failures? With the attacks by 20 percent of all nodes removed, the Barabasi-Albert has basically deteriorated to a long string, or a few long strings. It won't be one, it might be a few. Then after that you start to attack these long strings and the thing falls apart. The diameter is peaking and then dropping, and it's really a big, significant difference between the scale-free in the random network. See the same thing, you always get exactly the same pattern for the average path length, which is what you would expect. Then this is the fraction of nodes inside the giant component. The BA, again, already by 24 percent, the thing is completely fallen apart. Now if you have dataset and you look at the structure of that network, and you look at the degree distribution, you can immediately say something about whether it's robust and what it's robust too. Is it robust to attacks or is it robust to failures? Now we'll try the airline network. Did everyone manage to get that? You mentioned the whole thing and it's okay. Now I've run the same because I have these functions enabled now, I can quite easily run the same experiments on any real network. Basically, if you've got your own data now and you want to assess it for robustness against attacks and failures, it's this piece of code that you need to use, that's it. That's the idea that I want you to try in the second half of today. We'll start to get out your datasets in preparation for next week, and see if we can start to run some experiments on your network. This one is taking a lot more time because we have about 4,000 nodes in this case and about 40,000 edges. But we can see what happens. Now my experiments will finish for the airline network. I can see I did the same thing. I took the airline network, I did random failures, I randomly removed airports, and then I did targeted attacks. I removed those airports with the highest degree. Let's see. Sometimes one of these cells might have changed, I don't know why, but it was the wrong format so I had to change this to code. When it's code, it should be colored and it should have this little in thing there. But I noticed that with a few people. That's what we get. You remember the diameter of our airline network was 13. That was a maximum, the longest, shortest path that we had to make in order to reach any two nodes in the network. Now I do the attacks, and I do failures. The failures remember, I just randomly remove airports. The good news is if airports randomly close down, not really a lot changes to the diameter. Even by the time I've removed 80 percent, I never see the thing increasing above 15. However, if people carefully target the particular airports, those hubs, then already by closing only, what's this? Five percent? I increase the diameter to 25. If the major travel hubs did close down, you would start to see a very significant disruption to your journey. This is the fraction of nodes inside the giant components. You see the same thing with the attacks by 20 percent, well, by 18 percent. If you close down the top 18 percent most connected airports in the world, you have no airline network left. Hardly anything, that means the whole network is completely torn apart. I don't know. That's a great experiment to try next. There's a question about the Iceland volcano. That actually affected a lot of Europe, I think. You could find out which airports were affected by that, which ones closed down, and then you could work out what the average diameter of the network or the average path length was after the closure. It wouldn't be as disruptive as what I've just done, because I've targeted worldwide all the major hubs. All the major ones in the US or the major ones in Europe, 18 percent of 4,000. It's still a reasonable number of airports that you have to close down. But even if you do, I've only already measured every 20 or maybe even more, let me see. Here I did less frequent measures because it's 200. I only measured them after every 200 removals. If you increase that a bit or decrease that a bit, you get more accurate dynamics right at the beginning. You just measure at every closure. We can try that. Let me try that. I'll measure at every one, and I want to do not the failure, so let's just do the attack. Which is this one. I would have to do a bit more coding if I wanted to remove a specific set of airports, but it will be possible, and the way you could just edit the text file and remove all occurrences of those airports. But you could do it in code as well. That would be That's what I did. I tagged by degree. Those tags were all based on the highest degree. But what she's saying is specifically airport names. But that's possible. You just apply the codes. You say remove all nodes from this list. If you gave me the list, I could actually do it. We could try it. Well, let me try this. [inaudible]. This is telling me when I stop. That means I'm going to only run attacks for 10 percent. Ten percent of 4,000, so it's about 400 removals. Of the 400 removals, I measure what happens after every removal. Good. Three hundred and twenty eight actually, removals. Tells me how many iterations. There's going to be slower, because every step now I'm measuring all the path lengths. Let's see. It might take some time. In the meantime, has everyone downloaded Gephi? Who doesn't have Gephi? Three people, 4, 5, 6. Did you try and failed or you're just not try? The Internet was not working. The Internet's not working for anyone or? [inaudible]. Well, why don't we take it in turns. Once he's finished, he'll put his hand up, then somebody else can take the bandwidth, and then we'll do it like that. It's finished. It was faster than I thought. Let's plot this now. The link is not correct [inaudible]. Which link? For Gephi or? [inaudible]. Yeah, I had to go find it elsewhere. Gephi or? Yeah. Let's see. Just try Google and then say Gephi , and then Download. Click "Download". It should be it. Then you'll get an installer for Windows or installer for Mac. Just double-click it, and it's all done. Those of you who have Gephi, you can now open it. What I'll do is I'll show you, first of all what I managed to do this morning with the airline network, and then we'll try and recreate the same thing. Can you see that? Can you read the texts? I can increase the text a little bit. Maybe you can see a little bit better now. We'll try and recreate this. What I did is I did the layout, I did the modularity class, so it finds the clusters and then colors them differently. The layout and the modularity class, they're working in a similar way. What they're looking for are groups which are highly interconnected but less connected to other groups. What you will find is airports that share a lot of the connections between each other. They will be clustered together. Under airports, you might find another group which is also clustered together. They have a lot of flights between them, but less with the other group. What do you spot about this clustering? You've got the pink ones at the bottom, the green ones on the right, the blue ones the top left, there's a black cluster in the middle, there's the orange cluster over here, and there's something very strange by itself over here. It's basically geographical. The green is Europe, more or less. We've got Charles de Gaulle, Frankfurt, Amsterdam. Looks we have some others. What about this one? It's the US. This makes sense because within the US you've got a lot of internal flights. There's going to be a high level of connectivity between the US airports. There's going to be more connections on average. If you follow a path from a US airport, on average, you would likely hit another US airport rather than an international one. The blue ones, it's got Beijing, Hong Kong, I think a few others. It seems to be somewhere towards the east. The orange ones, it's Sydney, it's Australia. For some reason, it clustered Australia separately. What else are we missing? The black ones are pretty special, I think. It's got Dubai, Doha, Singapore. These are international hubs that are really connecting between. You see they actually reside in the middle of the network. The network distinguishes that, so it knows where they are. It's not doing anything geographical here itself. All it's looking for is the relationship between the links. That relationship in terms of internal flights, domestic flights, the international hubs, the connecting points and everything else, it automatically finds this structure. Now we'll try and recreate the same thing from scratch. It's actually not too difficult. Go to "File" and "Open". That's the first thing you need to do. Next link to try, this is just the new Edge list. I hope it's a new one. Let me just touch it. Yeah, it is. So bit.do/el for Edge list and then 2016. Yours downloaded without zero bytes? That's fine. The GIPHY file is fine. The GIPHY should be zero bytes, that's okay. It's the edgelist.CSV. Does everyone have that? There's no zero bytes? If you don't, then you can use this one. Let's try and see how we get on. Open GIPHY. Go to "Open". The thing you're looking for, which I've now just moved, is called edgelist.CSV. Bit.do, I'll write it down. Has everyone got to the edge list, found it. We're ready to open in GIPHY? If you've not already done so, you can click "Open" now. What GIPHY will ask you is what format is your networking? It will try and guess some things, but you should tell it a little bit more information. If you look inside the CSV, all it is is every line is two codes, the start point and the endpoint, and it's directed. You'll find lots of lines and it's just that. That's the only thing that the thing contains. The graph type that we have is directed because we're following the direction. These are routes from A to B. Not A to B and vice versa, it's A to B. The sequence of the nodes is important. That's it. That's the only thing you have to change. We set this to be directed and then click "Okay". [inaudible]. What is what? [inaudible] You selected directed here? Once again click the "Open the edgelist", click "Directed". It's not undirected, it has to be directed. Click "Okay". Do you-all get this? Yeah. Most of you have this. There are generally a few things that you can do with Gephi. This top box over here, this allows you to change the appearance of the edges and the nodes, and that means you can color them differently, you can change their size, you can change their width if it's the edge and so on. You can change that color or size or weight proportional to some variable, so it could be the degree, it could be betweenness, it could be whatever you like. That's what you do with this box in the top-left. Then the bottom-left here, that's your layout algorithms. It will take the network and try and split it apart so that you can see things a bit more clearly. Usually the first thing, this is context, so this is just telling you about the graph. It says it's got 3000 nodes, 40000 edges approximately, and it's a directed graph. That's because we told it it was a directed graph. Now, what you should normally do first is go to this Statistics panel. On the right you see this Filters and Statistics. Have the Statistics things open. Now, in order for us to be able to color nodes according to degree, or between us, or whatever, Gephi first has to calculate those values. That's what the Statistics panel will do. First of all, I'm going to run average degree. You should do the same thing and we can compare whose computer's faster. I think it's mine. I got it. You got it. I'm impressed yet. The average degree is 24 approximately, so you can double-check that that was the same number that we got in network X. If you want to get back the report that it generated, you click on the little eye. It will show you the degree distribution again. The other thing that we want to do is network diameter. Click "Run", you say it's directed. While it's calculating the diameter, it will do something sensible. It will automatically calculate betweenness and closeness because those are both things that require the shortest path calculation. As soon as you click "Okay", it will start to run. This will take a bit longer. I got my betweenness centrality and diameter. I get diameter 13, which is what we got from the one in network X as well, an average path length, and so on. The other thing I will calculate now is modularity. Modularity will try and identify how many groups exist and which nodes belong to which group. This is going to look at the interconnectedness between nodes, and it's going to put them in a group if they have a lot of links in common. I run modularity, click "Okay", and it does this. It says it identifies 26 communities. I only visualized about five or six major ones, but it's going to identify very, very small ones as well. It also tells you the reference and the algorithm that it's using to identify the community, so you can go read how it works. I can do other things if I want, but for now that's all I need, so it's the average degree, the network diameter, and the modularity. Now I can start to make the graph look a bit nicer. The first thing I will do is I will pick one of the layout algorithms and based on experience, this ForceAtlas2 is usually quite good. What you will see is you can click on the information thing. It will tell you something about the quality and the speed at which it does the layout. You can also set some parameters here. There's an option to dissuade hubs. That means the big connected nodes will be dissuaded from the rest of the network so you can clearly see them. Prevent overlap, that will mean that nodes never overlap, so it will take the size of the node into account. I'll try this and I will just click "Run", and then something starts to happen. The layout algorithm is moving these nodes around and doing constraint satisfaction, so it's trying to move nodes and see if that's better, move other nodes see if it's better, and so on. It's now finding all the nodes which lie on the edge and pushing them out from the center. We'll see all the periphery nodes that don't have many connections, they'll get pushed out because they're not very interesting, and what remains in the middle are the more connected nodes. This takes a bit of time. Then from the blackness we should see something start to emerge. Still working, very slow. If it's very slow or you're not very happy with the speed at which it's going, you can press Stop and then you can try a different one. Another one that I know is good is this Yifan Hu Proportional. It's usually quite fast as well, so you can just check that. It says full speed, three qualities, but I'll try. It seems to be reversing what I just did. I have a small disadvantage and that my other calculations are still running, so I'm running two heavy things on my machine, otherwise it would've been finished by now. Okay, that stopped. It's done something, but it's not very satisfactory. You see that one stopped. If I run it again, it will do it a little bit more, but then it will stop. I'll try another one. Now I start to see, this is incremental work. So you apply one layout algorithm, it does something. When you apply another one, it will modify from that position. The things I've ran so far were the ForceAtlas 2, then Yifan Hu Proportional, then Yifan Hu by itself. Now, this wasn't scientific. I just picked three at random. This is not really a scientific procedure that I'm following. You just find things that might work. For certain networks, certain algorithms will work well and others won't. Now I'm thinking, okay, this is looking a little bit better, I can start to see some clusters. Let's try ForceAtlas 2 again. Now I've separated it out. The ForceAtlas algorithm is actually working a little bit better. To recreate exactly what I've recreated before, it's actually quite hard as you can see. I think I'm okay with this now for the time being. I'm going to make it look a little bit cleaner. What I might also want to do for some reason is take out some of the notes. If there's a giant component and some nodes which are disconnected, which can happen in your Facebook network, you might want to just only have the giant component or you might want to only have nodes which have a degree above X or between a centrality above X or something else. That's what the filters do. The filters are here, and you have different sets of filters, so you can filter based on attributes. In your dataset, if the nodes have attributes like age, like sex, like whatever, you can then filter based on that particular attribute. You say I only want the males, I only want people who are 20 or both 20, and so on. In this case, we have no attributes on our nodes. It's just the label. But what I will do is I will take the degree range and say I only want to visualize nodes which are above the airports, which have above 50 flights, for example. In order to do that, I take this filter and I drag it down here and I drop it. Then this gives me this degree range settings and I can filter the minimum degree. It gives you a useful graph at the bottom, so I see which range of the distribution I'm filtering off. I want a bit more than that. Somewhere about there, you see this little thing at the bottom. This is all the nodes from degree 1 to degree 519. I'm going to sample this part of the distribution and ignore all of these other nodes below. In order to apply that, I'll just click Filter and then I remove a lot of the network. I can rerun my layout algorithm now, and now it behaves much better because it has less of the small nodes to deal with. Now what I want to do is color the nodes based on their class, on their cluster. In order to do that, I go back to the appearance panel at the top left. This is size, this is color, so I pick the color. I choose an attribute. In this case, I will choose the modularity class. That is the group that it belongs to. You'll see there already it's telling me which groups I have. I have a pink group, which is classed as group 2. That has 23 percent of the nodes. I have a green class, which is class 1, which is 17 percent, blue which is 14 percent, black which is 12 percent, and then a few smaller ones, seven percent, five percent, and five percent. What those clusters are, I don't know. This is what the algorithm has come up with. You might run a different clustering algorithm, it might categorize them differently. But generally what it's looking for is how many edges do they share in common with the other group members. If I apply this now, you'll see that the colors actually reflects the layout, which is what we saw before. I have this one subgroup, the black subgroup, then I've got a blue one, and then a pink one, and the green one. I have a few small subgroups of red. This one is actually belonging there, the orange ones and this brown group over here. What I can now do is change the size of the nodes. I can change them according to their degree, which will help me see which are the important nodes in those groups. In order to do that, I click on this thing. You can also hover over it. It will tell you color size, label color, label size. You can change the label size according to the degree. Ones which are big will have a bigger label, but that just gets very messy. I click on attribute again. I choose an attribute, and in this case I want to scale the size of the node according to its degree. Now, the min size or the max size are the number of pixels that it will take up. If I make this 600, it means the node with the biggest degree will be 600, and the one with the smallest will be one. You pick something which fits with what looks good. I guess 150 will be okay. It's a bit big. Now, this is where the layout algorithm can start to be beneficial again. I have this Prevent Overlap ticked, I rerun it, and it will separate them out. I still have a fairly dense network. I can do other layout algorithms, and I can change some of the parameters. You see this one, which is called Scaling. Scaling says, with a larger number, you're going to get a more sparse graph. How much [inaudible] you want more, makes it more sparse graph. If I increase to four and then click "Run" again, I get a slightly more sparse graph. Let's make this even bigger, 10. I clicked "Stronger Gravity" by mistake. I don't want stronger gravity. There we go. Are you happy with that? Beautiful. Looks gorgeous. Looks okay. I can view the labels if I want by clicking this one down here. This is my edge thickness. I can change my font size by doing this and so on. Then to make a really nice picture at the end. Remember you have these three tabs a in Gephi. One is a data laboratory, would show you all the data. These are each node. It will tell you the indegree, the outdegree, the total degree, the closeness centrality, the betweenness centrality, the modularity class, and so on. I can sort this by different groups. I can sort it by indegree. I can sort it by label, whatever I want. I can also find a particular node. I have Frankfurt here. If I right-click on this, I say select on Overview, that will center the view of the node in Overview. I go back here. There I am with Frankfurt. Then if I'm very careful with my mouse, I can take this. I can click the hand to grab it and I can drag it out and say okay. Now, I can zoom back in. Now the edge colors are indicating I think the destination node, what it's connected to. This is how Frankfurt is connected to the other clusters around the world. How it's connected to Europe, how it's connected to the US, and so on. You can see it's fairly distributed. From this you will be able to see whether it's a municipal airport mainly serving local flights or if it's an international airport. Then the final thing that you can do is click on Preview. Preview is the thing that will generate the really nice rendered graphs. I usually like the one with the black background. I quick refresh and nothing happens. Try again. If you don't see the preview, you go to Window and click "Preview". That's what I have. The labels at the moment are proportion to the size of the node, so I would changed that. I pick bigger fonts. Then I have the following. Then I can save this as a PDF or as a PNG or whatever I like and then post it on my Facebook and everyone's impressed.