Pyham is a python library for handling orthoXML files containing Hierarchical Orthologous Groups (HOGs). It facilitates the extraction of evolutionary information contained in HOGs— either specific gene families or in aggregate. Pyham can infer ancestral genomes, and pinpoint duplication and loss events with a gene family.
In this notebook we will provide a brief example on how to use pyham. For in-depth documentation, please see the following links:
%%bash
pip install pyham
import pyham
In order to use pyHam you need:
1) A species tree of the phylogeny of interest in one of the two following formats:
2) HOGs encoded in an OrthoXML file. IMPORTANT, species names need to be matching between the species tree and the orthoXML.
We can download this data for a given HOG using https://omabrowser.org/. For this example, download the HOG (orthoxml) associated with OR2L5_HUMAN at the root level. Tip: Click on the Hierarchical Orthologous Groups tab, go to table viewer, then download orthoxml.
In order to get the species tree, we will demonstrate using the OMA REST API:
import requests
import json
#request the species tree in newick format
response = requests.get("https://omabrowser.org/api/taxonomy/Eutheria/?type=newick", )
species_nwk = response.json()
#remove underscores and replace with spaces
species_nwk = species_nwk['newick'].replace("_"," ")
species_nwk
Here we will run the pyham analysis in order to determine how many genes are in the HOG at certain levels, as well as which lineages duplications occured to expand the gene family.
First we have to give pyham the orthoxml and the newick species tree as input to createt the ham object.
orthoxml_file = "./myhog.orthoxml"
species_tree = species_nwk
ham_object = pyham.Ham(species_tree, orthoxml_file, use_internal_name=True)
We can use the OMA browser in order to find the OMA identifier for our gene of interest. Then, we identify the hog based on the OMA identifier.
mygene = ham_object.get_genes_by_external_id('HUMAN02760')[0]
root_hog = ham_object.get_hog_by_gene(mygene)
print(root_hog)
root_level = root_hog.genome
print(root_level.name)
# genes at root hog
all_genes_root = root_hog.get_all_descendant_genes()
print(len(all_genes_root))
# get all species, with the count of how many genes per species
all_genes_clustered = root_hog.get_all_descendant_genes_clustered_by_species()
for species, genes in all_genes_clustered.items():
print(species.name, len(genes))
We can see that Fukomys damarensis has the highest number of genes, 18.
human_genome = ham_object.get_extant_genome_by_name('Homo sapiens')
all_genes_clustered[human_genome]
There are 4 genes in human.
We can make a treeprofile to visualize the duplication events along the species tree.
# create the treeprofile object
output_filename = "tree_profile.html"
treeprofile_myhog = ham_object.create_tree_profile(hog=root_hog, outfile=output_filename)
from IPython.display import IFrame
IFrame(output_filename, width=860, height=640)
We can see that the duplications leading to the human species occurred on the branches leading to the Simiiformes (3 duplications), on the branch leading to Homininae (1 duplication), as well as on the branch leading to Homo sapiens (1 duplication). Why are there only 4 genes in human rather than 6? Because we can see in Homo sapiens there are 2 genes which have been lost.