Chapter 4 Methods

Once you have settled on a group that you are interested in studying you need to find out if there are data to help you answer your research questions. We’ll be using genome sequence data to determine the species relationships in your group. Read the following to learn more about sequencing and molecular phylogenetics:

To acquire molecular data to compare the species in your group you either need to do a lot of sequencing, or find sequencing someone else has done. Becuase sequencing takes time and money we’ll try using data that is already available and just hasn’t been analyzed to answer the questions you are interested in.

  1. Go to the European Nucleotide Archive add background
  2. Select: Data type: Raw Reads
  3. Query: pick taxonomy (include subordinate)
  4. Add to query box: AND instrument_platform="ILLUMINA" AND (library_strategy="RNA-Seq" OR library_strategy="WGS")
  5. Fields: base_count, fastq_bytes, fastq_ftp, read_count, scientific_name, tax_id
  6. Click Download report tsv

Now you have a list of available data and relevant information to get it. It should look the same as what you saw in your browser. You can open this file in Excel or Google Sheets.

It might look like you have a lot of data, but it’s actually hard to tell from this page. You might have a lot of data for one or a few species and not enough to analyze your whole group. To figure out which data you will need sort this list of samples by scientific_name. For each species (scientific_name column) you will need about 10 times the amount of bases as the size of the genome (i.e. you should have on average about 10 sequence fragments covering each base in the genome). To figure out how many bases you need use Google and Google Scholar to find the size of the genome of one of your species. Now that you know how many bases you need select the samples you want (for a total of 10x coverage) and save those in a separate sheet.

The video on sequencing explains how the two ends of each genome fragment are sequenced. That means that most of your samples have two files to download. These are both listed in the fastq_ftp column. Highlight the fastq_ftp column, right-click, and select insert. Use Text-to-column in the Data menu to separate the column in two based on a semicolon.