Chapter 7 Phylogenetic markers

7.1 SISRS

In your main folder (cd back there if you need to) download the software we will use to analyze your data.

  • Type git clone https://github.com/SchwartzLabURI/SISRS.git

Because you only have to do this once don’t worry about details beyond the idea that you are downloading a program.

Now we need to work with scripts (just like before) to run each step of the analysis.

  • Move into the SISRS folder
  • Move into the scripts folder

Slurm is the name of the software used to keep track of all the jobs you submit.

  • Use nano to open slurm_submissions/1_submit.slurm.
  • Change all the lines marked CHANGE to be relevant to your work.
  • Exit nano and save the script.
  • Submit your first job as sbatch slurm_submissions/1_submit.slurm.
  • You should get an email when your job is done, but you can also check the status of the job using squeue.
  • Be sure the job completed correctly by checking that the folder you specified as your analysis output folder was created AND that the TaxonList.txt file within that folder has the list of your study species.

Continue to edit scripts and submit jobs, changing the information as with the first script. For the ntasks-per-node line use 36 processors any time that information is requested. Wait for each script to report that it has completed before continuing to the next step and do not continue if your recieve any error messages or if the noted files are not present.

  • slurm_submissions/2_submit.slurm | The output of this script can be found in each species folder in the Reads/TrimReads folder in your analysis folder. You should see the sames files as in your species folders in your data folder.

  • slurm_submissions/3_submit.slurm | The output of this script is a set of files in the Reads/SubsetReads folder in your analysis folder.

  • slurm_submissions/4_submit.slurm | The output of this script is a set of files in the Ray_Composite_Genome folder in your analysis folder. There must be a Contigs.fasta file in this folder.

  • slurm_submissions/5_submit.slurm | This script sets up the rest of the run. You can check that it completed correctly by looking for the phrase “==== Site list created:” in the output from the slurm run.

  • slurm_submissions/6_submit_array.slurm | This script produces information on where each genome read aligns to the reference. Each species folder in the SISRS_Run folder in your analysis folder should have a file that starts with the species name and ends in bam.

  • slurm_submissions/6b_submit_array.slurm | This script produces information on the sequence for each species. Each species folder in the SISRS_Run folder in your analysis folder should have a file that starts with the species name and ends in fa.

  • slurm_submissions/6c_submit_array.slurm | This script redoes the analysis of where each genome read aligns to the reference. Each species folder in the SISRS_Run folder in your analysis folder should have a file that starts with the species name and ends in bam. To determine whether this script ran and you aren’t looking at the results from step 6 check the date and time for the file.

  • slurm_submissions/6d_submit_array.slurm | This script produces information on the sequence for each species by site in the genome. Each species folder in the SISRS_Run folder in your analysis folder should have a file that starts with the species name and ends in LocList.

  • slurm_submissions/7_submit.slurm | For step 7 list several numbers for the missing flag in the range of approximately half of your species. This script produces an alignment of sites with shared ancestry that you can use to build your tree. You should see in your SISRS_Run folder in you analysis folder multiple files starting with alignment and ending with relaxed. To determine which file to use for building trees use head to look at the first line of each file that starts with alignment_pi_m and ends with _nogap.phylip-relaxed. You should use the file with the smallest m value (least amount of missing data) that also has a relatively large amount of total data (preferably >10000 sites).

7.1.1 What SISRS does

The software we have just used automates the process of identifying orthologous genetic markers. This means that we get regions of each genome from each species that are similar based on ancestry. For example, if one of these regions was a particular gene, we’d be able to say we have found the same gene for each species, and species that have a very similar sequence for this gene are closely related, while species that are different are much less closely related. We need lots of different regions of the genome because different parts of the genome can have different information. For instance, one gene may be nearly identical for a group of closely related species but different enough to help us with relationships among more distantly related species. I’ll refer to these useful pieces of the genome as loci from here on out, as a generic term for all the types of DNA that make up the genome. See Wikipedia for these different genomic elements.

We step through our analysis in the following way:

  1. Set up our analysis. This step makes all the folders we need for the analysis so we keep our data and results organized.

  2. Trim reads. Remember that when you get your data you don’t get the whole genome in one stretch. You actually get lots of little pieces of the genome. When the data comes off the sequencer the sequencer has the potential to make mistakes, especially at the end of the run and the reagents run out. We want to check each read for sloppy sequencing and get rid of any obvious errors.

  3. Get parts of the genome that are shared among species. We find these pieces by doing a genome assembly from all of our samples. A normal genome assembly would try to assembly one genome from the sequencing of one sample. Our approach will be to see what assembles from a small sample of each species combined. We expect that the only regions that assemble will be those that have information from multiple samples and are similar.

  4. Get each locus sequence for each species. Our assembly produced a bunch of loci that should be found in each of our species. But for each locus we only have one “average” sequence. Now we need to get the sequence for each locus for each species. To do this we need to go back to our original sequence data for each species and compare it to the sequence we have for each locus. By doing this we will produce the sequence for each locus for each species. As we go through this process we want to make sure we have good information, so if there is conflicting information at a site (for example if it is heterozygous) we won’t use it.

  5. Filter the dataset. Our whole collection of loci is a lot of data. At some sites all species are the same. Those sites don’t tell us about evolutionary history so we will remove them. We also don’t have information for all sites for all species, so when we lack a lot of information we’ll remove those as well.

  6. Put the data together. At this point you have an alignment of sites. Each row is a species. Each column is a particular site in the genome. You can see your output alignments in the SISRS_Run folder of your output folder. They will be labeled alignment_pi_m5 (or similar). You can view the first part of each file using head followed by the file name.