Chapter 4 Getting Data

Once you have settled on a group that you are interested in studying you need to find out if there are data to help you answer your research questions. We’ll be using genome sequence data to determine the species relationships in your group. Read the following to learn more about sequencing and molecular phylogenetics:

To acquire molecular data to compare the species in your group you either need to do a lot of sequencing, or find sequencing someone else has done. Becuase sequencing takes time and money we’ll try using data that is already available and just hasn’t been analyzed to answer the questions you are interested in.

  1. Go to the European Nucleotide Archive (Learn more about ENA here)
  2. Select: Data type: Raw Reads (click next)
  3. Query: Select Taxonomy, Select filter: NCBI Taxonomy, Type a taxon name in the search box and select the correct group from the dropdown menu, select ‘include subordinate’
  4. Add to query box: AND instrument_platform="ILLUMINA" AND (library_strategy="RNA-Seq" OR library_strategy="WGS") (click next twice)
  5. Manually select fields: scientific_name, base_count, fastq_ftp (click the right arrow after each to put it in the Selected Fields box) (click search)
  6. Click Download report tsv

Now you have a list of available data and relevant information to get it. It should look the same as what you saw in your browser. You can open this file in Excel or Google Sheets.

It might look like you have a lot of data, but it’s actually hard to tell from this page. You might have a lot of data for one or a few species and not enough to analyze your whole group. To figure out which data you will need sort this list of samples by scientific_name. For each species (scientific_name column) you will need about 10 times the amount of bases as the size of the genome (i.e. you should have on average about 10 sequence fragments covering each base in the genome). To figure out how many bases you need use Google and Google Scholar to find the size of the genome of one of your species. Now that you know how many bases you need select the samples you want (for a total of 10x coverage) and save those in a separate sheet.

The video on sequencing explains how the two ends of each genome fragment are sequenced. That means that most of your samples have two files to download. These are both listed in the fastq_ftp column. Highlight the fastq_ftp column, right-click, and select insert. Use Text-to-column in the Data menu to separate the column in two based on a semicolon.

Once you are satisfied with the species you can work, use Google to find a species that is related, but not included in, your group. For example, if you are studying Primates you might select a rabbit as a related species. Go back to the ENA and find data for this species and add the appropriate information to your spreadsheet.

4.1 Example data and trees

Before we get into working with large data, let’s take a look at a smaller more managable dataset.

  • Download the program Geneious. For this class you can use the free trial or the university site license (see https://statfiles.uri.edu/Geneious/License%20Server.png).
  • Open Geneious
  • Select the Alignments folder
  • Select the CoxIII CDS alignment (you’ll see it below)
  • Right click and select Tree
  • Experiment with different tree building options

Look at the tree produced by this method.

  • What species are the most distantly related? Is this tree accurate according to what you know?
  • Select the ancestor of the group that should be most distantly related to all other groups. Click root. How does this change your understanding of the tree and relationships?
  • Select a node in the tree and click swap siblings. Does this change the tree?
  • Go back to the alignment and find sites
    • that have no information
    • with information that matches the tree
    • with information that contradicts the tree - how would you interpret this?
  • Repeat your analysis. This time click Resample Tree. How did this change the tree? Note the numbers on each node, which are a measure of the confidence that the decendents of that node share this common ancestor.