Introduction to Bioinformatics: Tree Practical¶

Authors: Jeremy Levy, Natasha Glover, Leonardo Martins¶

03/09/18¶

In this tutorial, we will construct phylogenetic trees. A phylogenetic tree is a representation of genealogical relationships among species, among genes, among populations, or even among individuals, and are used in a wide range of fields, from astrobiology to forensics, to the evolution of languages. We are interested in using them on sequence data, in order to learn more about the relationships between a particular set of species.¶

In Part 1, we will retreive the biological data required to build our trees. In Part 2, we will take the necessary steps to prepare the data, before building our trees. In Part 3, we will evaluate our trees and interpret them.¶

Part 1: Retrieving our data¶

In order to construct phylogenetic trees, we make comparisons between sets of genes belonging to different taxa. We are working under the assumption that the taxa we are studying are related by evolution. Therefore, we need to ensure that the genes we use to construct our trees are orthologous - that is, they have evolved through speciation events, from a common ancestor.

OMA groups are groups of sequences that are all orthologous to one another, and can be found in the OMA Browser (http://omabrowser.org/).

Retrieve the following groups from the OMA Browser: 188449 (fingerprint: GTRKKHA), 189801 (fingerprint: FLELWDA), 657678 (fingerprint: MITAVEC).

Tip: click on the question mark icon next to the group id to get a description of each of the tabs.

1) What phylogeny are these groups from?

2) What is the description/function of each of the groups?

3) How many other OMA groups share homology to each of the groups?

4) How many sequences are there in each orthologous group?

5) Are they protein or DNA sequences?

Tip: download the fasta sequences by clicking on the Download button at the top right hand corner of the list.

6) You will find that the sequence headers are split into multiple parts, separated by |'s: What do the these tags signify?

7) Often, long sequence headers are truncated when aligning sequences or using tree building software. Write a script which can convert the sequence headers to the 5 character species code only.

Tip: The species headers are only found on lines beginning with the '>' symbol, and contained within the OMA identifier.

Note: A mapping between the species code and the ID can be found on the OMA browser (http://omabrowser.org/cgi-bin/gateway.pl?f=InfoMatrix).

Part 2: Building trees¶

In order to construct phylogenetic trees, we must first align the sequences. This allows us to compare sequences site by site.

There are a multitude of Multiple Sequence Alignment tools available, many of which can be found on the EBI website: http://www.ebi.ac.uk/Tools/msa/.

1) Use an online sequence aligner to align the sequences. Which output format should you choose?

2) Construct a phylogenetic tree using your aligned sequences. Tools can be found on the Vital-IT website (http://embnet.vital-it.ch/raxml-bb/ - this could take between 10 minutes to over an hour, depending on the Group and model!) or on the EBI website (http://www.ebi.ac.uk/Tools/phylogeny/ - try building two trees by using both UPGMA and NJ clustering methods in the clustering options).

Alternatively, you can try the alignment and tree inference tools available online at http://www.phylogeny.fr/, in particular phyml as an alternative to RAxML if it is taking too long.

Tip: If building RAxML/phyml trees, be sure to keep the "Gamma model of rate heterogeneity" box unchecked, as using the CAT model is quicker and we don't have much time!

3) Why is RAxML and other likelihood methods much slower than the clustering methods?

4) Which Group do you think will take the longest to compute a tree for, using RAxML? Which do you think will be the quickest?

Part 3: Evaluating our trees¶

Now that we have our trees, we would like to visualise and compare them. Unfortunately, the output format (Newick) isn't particularly conducive to interpreting trees. Thankfully, there are online viewers to help us: http://phylo.io

1) View the trees that you have built using an online tree visualiser.

2) Reroot the trees, and swap branches, to make comparisons easier.

3) Which species, in Group 188449, shares a most recent common ancestor with Trametes versicolor?

4) Which species is most closely related to Mucor circinelloides in Group 189801?

5) Can you see differences between trees estimated using distinct methods? For instance, which species are grouped differently when comparing the trees found by the UPGMA and NJ methods in Group 657678?