Pyham: a python package to analyse hierarchical orthologous groups in orthoXML

• Author: Clement Train •

Pyham is a python library for handling orthoXML files containing Hierarchical Orthologous Groups (HOGs). It facilitates the extraction of evolutionary information contained in HOGs— either specific gene families or in aggregate. Depending on the functions, the output is provided as python data structures, as interactive javascript visualisations, or as graphs.

This post is a brief primer to pyham, with an emphasis on what it can do for you.

How to get pyham?

Pyham is available as python package on the pypi server and is compatible python 2 and python 3. You can easily install via pip using the following bash command:

pip install pyham

If you can check the official pyham website for further information about how to use pyham, documentation and others related resources.

What are Hierarchical Orthologous Groups (HOGs)?

You don’t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:

 

You can learn more about this in our previous blog post.

Where to find HOGs?

HOGs inferred on public genomes can be downloaded from the OMA orthology database. Other databases, such as Eggnog, OrthoDb or HieranoiDB also infer HOGs, but not all of these databases offer them in OrthoXML format. If you want to use your custom genomes to infer HOGs you can use the OMA standalone software.

How does pyham help you investigate on HOGs?

As input, pyham takes an orthoXML file containing HOGs and the related species tree. Pyham creates gene and genome objects based on the information extracted from the input files and provides an API to work directly on those phylogenetic objects (easy queries based on name or phylogenetic relations). The input species tree serves as a guide to define evolutionary relationships between genes and genomes.

How can I figure out the evolutionary history of genes in a given genome?

Pyham provides a mapper object for HOGs/genes across multiple taxonomic ranges. Remember that each HOG at a given taxanomic level corresponds to 1 gene in that particular ancestral genome. The idea of pyham is to map the HOGs of an ancestral genome to the HOGs/genes of its descendant genomes. The vertical mapper object allows for retreival of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).

compare_human_mammals = pyham_analysis.compare_genomes_vertically("Human","Mammals")
# Mammals HOGs with their single copy human descendant genes
compare_human_mammals.get_identical() 
# Mammals HOGs that been lost in between the two levels
compare_human_mammals.get_lost() 
# Human genes that have been "gained" in between the two levels
compare_human_mammals.get_gained() 
# Mammals HOGs with their multiple copy human descendant genes
compare_human_mammals.get_duplicated() 

What are the genes in an extant genome that have been ancestrally duplicated?

We can use logic operations in the previously described mapper object. In this case we can compare the genome of interest with its ancestral parent and retreive the duplicated genes that will be specific to this branch. For example, we can find the genes in human which were duplicated sometime between the speciation of tetrapoda and the speciation of mammals.

compare_mamm_tetra = pyham_analysis.compare_genomes_vertically("Mammals","Tetrapods")
mammals_specific_dupl_hogs = compare_human_mammals.get_duplicated()
human_genes_duplicated_before_mammals_speciation = []
for hog in mammals_specific_dupl_hogs:
    for gene in hog.get_descendant_genes()
        if gene.genome.name == "Human":
            human_genes_duplicated_before_mammals_speciation.append(gene)

What is the number of genes in a particular ancestral genome?

Ancestral genome objects act as proxy to fetch all hogs at specific taxon.

# return an ancestral genomes object
mammals_genome = pyham_analysis.get_ancestral_genome_by_name("Mammals")
# get the list of hogs in this ancestral genome
number_ancestral_geness_mammals = len(mammals_genome.genes)

How can I visualise the evolutionary history of a gene family (HOG)?

Pyham embeds HogVis, an interactive tool to visualise gene family evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.

# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)
# create and export the hog vis as .html
output_filename = "hogvis_example.html"
pyham_analysis.create_hog_visualisation(hog=hog_of_interest,outfile=output_filename)

As you can see in the figure below, HogVis is composed of two panels: a species tree that allows you to select the taonomic range of interest, a genes panel where each grey square represents an extant gene and each row a species.

hogvis open at mammalian level

We can see for example in the figure above that at the level of mammals all genes of this gene family are descendant from a single comon ancestral gene.

hogvis open at euarchontoglires level

If we are looking at the level of Euarchontoglires we observe that the genes are now split by a vertical line. This vertical line separates 2 group of genes that are each descendants from a same single ancestral gene. This is the result of a duplication in between Mammals and Euarchontoglires.

With a quick look we can easily identify when a duplication occured, which species have lost genes or how big genes families evolved.

How can I visually represent the different evolutionary events that occured in my genomic setup?

Pyham includes treeprofile, a tool to visualise an annotated species tree with evolutionary events (genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a minimalist and intuitive way to visualise the number of evolutionary events that occurred on each branch.

# create and export the treeprofile as .png (.svg, .pdf also available)
treeprofile= pyham_analysis.create_tree_profile(outfile="example.png")

treeprofile

As you can see in the figure above, the treeprofile is composed of the reference species used to perform the pyham analysis. Each internal node is displayed with its related histogram of phylogenetic events (number of genes duplicated, lost, gained, or not changed) that occurred on each branch.

Can I have a one-page summary of this blog post for reference?

Of course you can, we have prepared a PDF version of this blog post that you can download here!

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter.


Creative Commons 
            License The Dessimoz Lab blog is licensed under a Creative Commons Attribution 4.0 International License.
Last modified on June 29th, 2017.