This series of tutorials is going to demonstrate how to use the proteome comparison tool, one of the very first analysis tools that was developed in PATRIC. This tool uses BLAST, which stands for Basic Local Alignment Search Tool. Using the tool, you can compare up to nine genomes to a single reference genome. To find the tool, you go up into the services here, click the down arrow and click on proteome comparison. The proteome comparison tool uses bidirectional best BLAST hit. This happens when you have two genomes and you have two proteins. When you run BLAST against them, with each one, their best hit is each of the genomes. That's what's called a bidirectional best BLAST hit. It's often used as evidence of having orthologs and when you have unidirectional blast hits, that's sometimes identified as paralogs. The way the tool works is you pick one reference genome or other parameters which will show you and compare them with up to nine comparison genomes. This will tell you the proteins that are in the reference genome that the comparison genomes have the best hits against. If comparison genome has a unique set of proteins that are not in the reference genome, you won't see those. In order to see that kind of comparison, you would need to take this comparison genome, use it as the reference, and then resubmit a different job to compare those. I just wanted you to be totally aware that this is in the context of the reference. This particular tutorial is going to talk about the parameters and setting the parameters for the job. On this parameters box, let's click here and see what the advanced parameters are. We have minimum coverage, minimum identity, and BLAST E-value. What does all this mean? Minimum coverage is talking about the query coverage unless they're saying it's 30 percent. The query coverage is a number that describes how much of the query sequence is covered by the target sequence. If they target sequences in the comparison genome and it spans the entire length, so you've got your protein from the reference genome and you blast it against the protein from the comparison genome and they are exactly the same length. That's going to be a 100 percent. The higher this value is, the better it is. You can see that we have it here saved at 30 percent, which is low coverage. If I want to be more selective, I might take it up to something like 70 percent. However, if I'm just fishing or if you're looking for paralogs and orthologs, you might want to keep it at the default value of 30. Another thing that you can see here is the minimum percent identity. The percent identity is a number that describes how similar the query sequences to the target sequence, meaning how many characters in each sequence are identical. You've got the amino acids. Remember we're not talking about genes here, we're talking about proteins. What it's doing is in the reference genome looking at if you have a 100 amino acids across it and there are 100 amino acids in the comparison genome and they match identically. That means you'd have a 100 percent minimum identity. You can play with that number. I think we should be a bit more selective and dial it up a bit, up to 70. But it depends on whatever you're fishing for. You might not want to be too selective. E-value, anybody who's used BLAST knows that it gives you the E-value and the lower the number is, the better it's supposed to be. E-value is actually the expected value. That's a number that describes how many times you would expect a match by chance alone in whatever database you're blasting against. Lower is better here. Higher is better here. Let's see, I might want to take this up to, well, let's do minus 70. It depends on what I'm looking for. If you want to get rid of a lot of noise and drill in very deeply, go for that. If you're paralog hunting, dial it down a bit. Next, you have to describe an output folder. If you have folders in your workspace in PATRIC and you know what they are, you could just start typing. Like if I had Brucella analysis or I can click on this down arrow and it'll show me folders that I have created recently or you could click on this folder here, which opens up a pop-up window where I can create a new folder. I call it proteome comparison demonstration. Create that folder and click okay. Then I have to click it again to see the most recently created folder. The last thing we have to do in the parameters is give it a name. I like to choose names that are pretty specific because I use this tool a lot. It's one of my favorite tools in PATRIC. I'm going to give it a name that describes my reference genome. I usually include a date. I know when I ran it, even though PATRIC tells me when I ran it, but it just helps me filter down. Being able to adjust these advanced parameters is something we've added recently. I like to be able to see what I selected when I did it. So I put in the first minimum coverage, minimum identity, and then E-value. This is the way I do it. You don't have to do it this way. You can name it however you want. But as you use more and more of PATRIC, I find that the way you name your jobs, especially if your workspace is somewhat disorganized, which mine is, this is a good thing to do. In the next video, we'll talk about choosing the reference genome. For your first assignment, I want you to get ready for submitting a job, but not actually submitting it yet. First thing, like I've told you, for every other section of this course, you need to create a folder in your workspace where you're going to store these jobs. You're going to create a number of these jobs. Don't worry, I'm not going to be stingy at the end. I'm going to give you lots of jobs to create just like you always want. At the interface for the tool, I want you to open the advanced parameters. We're going to be submitting different jobs based on the default parameters, but we're also going to be messing around with it a bit. Some of the things we're going to be answering in the course of the assignments is what happens when you increase the query coverage or the sequence identity? Which would help a minimum identity and the E-value. We've got the defaults and will be messing with those. What do you expect will happen with that? Another question I'd like you to be thinking about is, as you ratchet up the stringency on those things, which of those parameters, query coverage, sequence identity or E-value is most likely to eliminate genes from the analysis. Okay, other than creating the folder, this is just something you think about. So start thinking.