Pyham is a python library for handling orthoXML files containing Hierarchical Orthologous Groups (HOGs). It facilitates the extraction of evolutionary information contained in HOGs— either specific gene families or in aggregate. Pyham can infer ancestral genomes, and pinpoint duplication and loss events with a gene family.

In this notebook we will provide a brief example on how to use pyham. For in-depth documentation, please see the following links:

Installing pyHam

In [ ]:
%%bash
pip install pyham
In [1]:
import pyham

Getting the necessary files for pyham

In order to use pyHam you need:

1) A species tree of the phylogeny of interest in one of the two following formats:

  • newick: unique leaf names, internal node names are optional to name ancestral genomes, polytomies are accepted.
  • phyloxml: genomes are named after clade name attributes or if any phylogeny scientific_name attributes (custom attribute tag can be provided), internal node names are optional to name ancestral genomes, polytomies are accepted.

2) HOGs encoded in an OrthoXML file. IMPORTANT, species names need to be matching between the species tree and the orthoXML.

We can download this data for a given HOG using https://omabrowser.org/. For this example, download the HOG (orthoxml) associated with OR2L5_HUMAN at the root level. Tip: Click on the Hierarchical Orthologous Groups tab, go to table viewer, then download orthoxml.

In order to get the species tree, we will demonstrate using the OMA REST API:

In [11]:
import requests
import json
In [12]:
#request the species tree in newick format
response = requests.get("https://omabrowser.org/api/taxonomy/Eutheria/?type=newick", )
species_nwk = response.json()

#remove underscores and replace with spaces
species_nwk = species_nwk['newick'].replace("_"," ")
species_nwk
Out[12]:
'((Choloepus hoffmanni,Dasypus novemcinctus)Xenarthra,(Echinops telfairi,Loxodonta africana,Procavia capensis)Afrotheria,(((Erinaceus europaeus,Sorex araneus)Insectivora,(Myotis lucifugus,Pteropus vampyrus)Chiroptera,Equus caballus,(Felis catus,(Canis lupus familiaris,Ailuropoda melanoleuca,Mustela putorius furo)Caniformia)Carnivora,(Tursiops truncatus,Sus scrofa,(Bos taurus,Ovis aries)Bovidae,Vicugna pacos)Cetartiodactyla)Laurasiatheria,(((Microcebus murinus,Otolemur garnettii)Strepsirrhini,((Callithrix jacchus,((Macaca mulatta,Papio anubis,Chlorocebus sabaeus)Cercopithecinae,((Pongo abelii,(Gorilla gorilla gorilla,(Pan paniscus,Pan troglodytes)Pan,Homo sapiens)Homininae)Hominidae,Nomascus leucogenys)Hominoidea)Catarrhini)Simiiformes,Tarsius syrichta)Haplorrhini)Primates,Tupaia belangeri,((Ochotona princeps,Oryctolagus cuniculus)Lagomorpha,(Dipodomys ordii,(Cavia porcellus,Octodon degus,Chinchilla lanigera,Fukomys damarensis)Hystricomorpha,Ictidomys tridecemlineatus,(Jaculus jaculus,((Mus musculus,Rattus norvegicus)Murinae,Nannospalax galili)Muroidea)Myomorpha)Rodentia)Glires)Euarchontoglires)Boreoeutheria)Eutheria;'

Analysis

Here we will run the pyham analysis in order to determine how many genes are in the HOG at certain levels, as well as which lineages duplications occured to expand the gene family.

Making the HAM object

First we have to give pyham the orthoxml and the newick species tree as input to createt the ham object.

In [13]:
orthoxml_file = "./myhog.orthoxml" 
species_tree = species_nwk

ham_object = pyham.Ham(species_tree, orthoxml_file, use_internal_name=True)

Select the hog of interest

We can use the OMA browser in order to find the OMA identifier for our gene of interest. Then, we identify the hog based on the OMA identifier.

In [14]:
mygene = ham_object.get_genes_by_external_id('HUMAN02760')[0]

root_hog = ham_object.get_hog_by_gene(mygene)

print(root_hog)
<HOG(HOG:0355161)>

What is the root taxonomic level of this HOG?

In [15]:
root_level = root_hog.genome

print(root_level.name)
Eutheria

How many genes are in the HOG at the root level?

In [16]:
# genes at root hog
all_genes_root = root_hog.get_all_descendant_genes()
print(len(all_genes_root))
108

Which extant genome has the most copies in this gene family?

In [17]:
# get all species, with the count of how many genes per species
all_genes_clustered = root_hog.get_all_descendant_genes_clustered_by_species()

for species, genes in all_genes_clustered.items():
    print(species.name, len(genes))
Mustela putorius furo 1
Canis lupus familiaris 5
Ailuropoda melanoleuca 5
Felis catus 2
Equus caballus 4
Sus scrofa 7
Bos taurus 4
Ovis aries 3
Vicugna pacos 2
Pteropus vampyrus 6
Erinaceus europaeus 1
Ictidomys tridecemlineatus 1
Fukomys damarensis 18
Cavia porcellus 5
Dipodomys ordii 2
Mus musculus 1
Rattus norvegicus 1
Oryctolagus cuniculus 1
Microcebus murinus 1
Otolemur garnettii 2
Macaca mulatta 4
Papio anubis 3
Pan paniscus 5
Pan troglodytes 6
Chlorocebus sabaeus 1
Homo sapiens 4
Callithrix jacchus 2
Gorilla gorilla gorilla 1
Dasypus novemcinctus 2
Loxodonta africana 7
Echinops telfairi 1

We can see that Fukomys damarensis has the highest number of genes, 18.

How many genes in this family are in Human?

In [18]:
human_genome = ham_object.get_extant_genome_by_name('Homo sapiens')
all_genes_clustered[human_genome]
Out[18]:
[Gene(8069218), Gene(8069222), Gene(8069220), Gene(8069221)]

There are 4 genes in human.

In which lineage did the duplications making the human genes takes place?

We can make a treeprofile to visualize the duplication events along the species tree.

In [20]:
# create the treeprofile object
output_filename = "tree_profile.html"
treeprofile_myhog = ham_object.create_tree_profile(hog=root_hog, outfile=output_filename)

from IPython.display import IFrame
IFrame(output_filename, width=860, height=640)
Out[20]:
  • Click on the settings botton
  • On Displayed info, choose "Events"
  • Zoom/rescale the view in order to comfortably read it. You can also collapse nodes on the tree by clicking on them and choosing collapse.

We can see that the duplications leading to the human species occurred on the branches leading to the Simiiformes (3 duplications), on the branch leading to Homininae (1 duplication), as well as on the branch leading to Homo sapiens (1 duplication). Why are there only 4 genes in human rather than 6? Because we can see in Homo sapiens there are 2 genes which have been lost.