Documentation¶

Command Line Arguments¶

Genome Size Estimate (-gs, –genomeSize)
- Specify the approximate genome size estimate for group in basepairs (e.g. 3500000000 for primates)
Specifying starting data (-d, –data)
- Path to directory where reads are already split into folders by taxon (these can be linked files)
- Note 1: No spaces or special characters are allowed when naming taxon directories
- Note 2: If using -d option with pre-trimmed reads, you should also use the -trm flag, which tells SISRS to skip the trimming step (See SISRS_Small.zip for data structure)

Taxon ID File (-id)

Specify the taxon names with the TaxonIDs file. One taxon per line as follows:

> cat TaxonIDs
AotNan
CalJac
ColAng
GorGor
HomSap
HylMol
MacMul
MacNem
PanPan
PanTro
PapAnu
PapCyn

Output Directory (-dir,–directory)
- Path to desired SISRS output directory. DEFAULT: The directory preceding where the scripts are located
Trimmed Data (-trm,–trimmed)
- Specify if the data has already been trimmed
- DEFAULT: untrimmed
Processors (-p,–processors)
- Specify the number of available processors
- DEFAULT: 1
- Note: If running this on a multi-core machine, specify the number of processors per node here
Homozygosity Threshold (-thresh,–threshold)
- Specify the minimum site homozygosity for SISRS sites, must be between 0 and 1
- DEFAULT: 1 (SISRS sites have support for only a single base within taxa)
Minimum Coverage Threshold (-mr,–minread)
- Specify the minimum read coverage to call a SISRS site
- DEFAULT: 3 (Three reads required to call a site)
Missing Taxa Allowed (-ms,–missing)
- When creating the final SISRS alignment, specify the maximum number of missing taxa allowed per column
- You can give it a single number or give it a range of numbers, such as 0-6, and it will do a final alignment for 0, 1, 2, 3, 4, 5, and 6 missing taxa allowed per column
- It will also separate all the data out into folders labeled missing_(#)
- DEFAULT: 0 (Coverage for all taxa for all sites)
Existing Run This feature will auto detect if a previous SISRS run has been done based on the file structure and if specific files are present
1. Adding a Taxon (-aT,–addTaxon): Adding a new taxon to the previous SISRS run
2. Adding Additional Sequences (-aD,–addData): Adding a new file to existing taxa
Note: This relies on the -d/–data folder to specify were the new data is located

About the Scripts¶

This section will give a brief over view of what is needed to run each script. This will include the expected input (including flags) and it will also discuss the expected output.

Note: SISRS runs on an organized system in which you need to organize your data by taxon. Prior to running our software we ask that you make separate taxon folders for all taxa and move the data to the correct folder.

For examples please see the tutorial page.

sisrs_01_folder_setup¶

This script will start the setup of the folder structure needed to run SISRS
Required Arguments: -d/–date or -id
Optional Arguments: -trm/–trimmed

sisrs_02_read_trimmer¶

Running this script will:

Trim all reads in RawReads

Output trimmed reads to the TrimReads directory

Run FastQC on both trimmed and untrimmed datasets

Optional Arguments: -p,–processors

Other Trimming Options:

Trimming can certainly be done using your preferred methods, but this script is a wrapper around the BBDuk program (part of the BBMap Suite)

If your reads are already trimmed, ensure they are in the TrimReads directory and move on to Step 3 (Subsetting)

Multiple read files per taxon is fine, as are mixes of single-end and paired-end reads

Using our Wrapper:

Put all raw reads in (.fastq.gz format) in the appropriate taxon folder in RawReads

Raw data files are not modified or used after trimming, so avoid duplication if possible by making links to the fastq.gz files as opposed to copying or moving raw data

Note: Check FastQC output for trimmed data. If using ‘random’ DNA-seq data, be especially wary of high sequence duplication levels which could indicate non-random sequencing

Data will eventually be pooled, so best to remove low-quality data early to prevent it from being incorporated into the genome assembly

sisrs_03_read_subsetter¶

Running this script will:

SISRS identifies orthologs through a composite genome assembly step, where reads from all taxa are combined and assembled together

The subsampling step ensures that no one species dominates the assembly, and also limits the assembly of loci that are present in few species

Based on the requested depth [10*Genome Size Estimate/Number of Species], if a species has fewer total reads all reads will be sampled and the user will be notified with a warning message

For species that do have sufficient total coverage, an attempt is made to sample evenly from each read set

If certain read sets lack the required coverage, they are also sampled completely and the deficit is covered by a deeper read set

Expectations of data:

Single-end should all end in “_Trim.fastq.gz”

Paired-end reads should end in “_Trim_1.fastq.gz/_Trim_2.fastq.gz”

Reads must be in the appropriate TrimReads subfolder by Taxon

Required Arguments: -gs,–genomeSize

sisrs_04_ray_composite¶

This script provides:

A genome assembly script that wraps around Ray, which is fast but requires MPI even on one node. We plan to offer more assembly options in later releases. mpirun must be in your path

Optional Arguments: -p,–processors

sisrs_05_setup_sisrs¶

This script will:

Place all the files where they need to be for a SISRS run, including:
- Renaming scaffolds with SISRS_ at the front for downstream data handling
- Linking all trim reads to the SISRS analysis folders
- Indexing and processing the composite genome
- Creating SISRS runs scripts for each species

Optional Arguments: -p/–processors, -mr/–minread, and -thrs/–threshold

sisrs_06_run_sisrs¶

This script will:

sisrs_05_setup_sisrs generates a bash script in each taxon folder

This script will run them serially on one machine (one after another)

If multiple nodes are available, you likely want to skip this step as these scripts are independent and can be run in parallel (e.g. on an HPC machine or cluster as separate jobs)

Be sure to specify processors accordingly when running sisrs_05_setup_sisrs

Individual log files are created in each taxon folder

No Arguments needed

sisrs_07_output_sisrs¶