Exclusive: European Tour of Antonis Rokas

• Author: Christophe Dessimoz •

We are delighted to host Prof. Antonis Rokas, Vanderbilt, for two special seminars at University College London and at the University of Lausanne!

Genomics and the making of biodiversity across the budding yeast subphylum

Prof. Antonis Rokas, Vanderbilt University

London: Tue 13 Nov 2018, 11am, UCL, Roberts Building 309
Lausanne: Wed 14 Nov 2018, 11am, UNIL, Genopode auditorium A


Yeasts are unicellular fungi that do not form fruiting bodies. Although the yeast lifestyle has evolved multiple times, most known species belong to the subphylum Saccharomycotina (hereafter yeasts). This diverse group includes the premier eukaryotic model system, Saccharomyces cerevisiae; the common human commensal and opportunistic pathogen, Candida albicans; and over 1,000 other known species (with more continuing to be discovered). Yeasts are found in every biome and continent and are more genetically diverse than either plants or bilaterian animals. Ease of culture, simple life cycles, and small genomes (10– 20 Mbp) have made yeasts exceptional models for molecular genetics, biotechnology, and evolutionary genomics. Since only a tiny fraction of yeast biodiversity and metabolic capabilities has been tapped by industry and science, expanding the taxonomic breadth of deep genomic investigations will further illuminate how genome function evolves to encode their diverse metabolisms and ecologies. As part of National Science Foundation’s Dimensions of Biodiversity program, we have undertaken a large-scale comparative genomic study to uncover the genetic basis of metabolic diversity in the entire Saccharomycotina subphylum. In my talk, I will discuss the team’s evolutionary analyses of 332 genomes spanning the diversity of the subphylum. These include establishing a robust genus-level phylogeny and timetree for the subphylum, quantification of the extent of horizontal gene transfer for the subphylum, and characterization of the evolution of approximately 50 metabolic traits (and, in some cases, their underlying genes and pathways). These analyses allow us, for the first time, to infer the key metabolic characteristics of the Last Yeast Common Ancestor (LYCA) and characterize the tempo and mode of genome evolution across an entire subphylum.

All welcome!

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Predicting QTL genes by integrating functional data across species

• Author: Christophe Dessimoz •

The problem in a nutshell

Quantitative Trait Loci (QTL) are regions of a genome for which genetic variants correlate with particular traits. To take a simple example in plants, one might observe that the average seed size (trait) is significantly larger when considering the subset of a population which has a C at a particular position in the genome than a subpopulation with a T.

The reason QTL identifies genomic regions and not precise positions is that neighbouring variants tend to be inherited together. These regions typically contain hundreds of genes, making it difficult to say which one(s) are causal to the trait variation—if any at all (the causal genetic variation(s) can be in non-coding regions too).

Thus, to prioritise candidate causal genes within a QTL region, researchers typically consider previous knowledge on these genes, to see whether a particular gene “makes sense”. In the case of seed size, it might be a gene previously implicated in growth or regulation, or a gene known to influence seed size in a different species. This process is however requires substantial manual interpretation, and is thus labour-intensive and haphazard.

Enter QTLsearch

We realised that our framework of hierarchical orthologous groups, which relates genes across many species, could be extended to integrate QTL results with previous gene function annotations.

Conceptual overview of QTLsearch

Conceptual overview of QTLsearch

If we go back to the seed size example, it might be that among the genes in the window, one has an ortholog in a different species previously annotated with the GO term “reproductive system development”. This could be a good candidate causal gene.

One risk however in integrating lots of previous knowledge across many species is that we might also find some spurious patterns. We therefore had to devise a way of controlling for random associations between QTL regions and evolutionarily propagated knowledge. Such “null distribution” depends on the specificity or the terms in question, the amount of annotations, the size of the QTL regions, and the species sampling. To cope with this complexity, we chose to implement a non-parametric permutation test.

We implemented the tool as an open source package called QTLsearch, available here.

QTLsearch infers more candidate causal genes than manual analyses

We used QTLsearch to reanalyse two previous studies. In both cases, we could call more candidate genes than the original studies. But more importantly, the evidence behind our calls is fully traceable and statistically supported.

Barchart of QTLsearch performance

QTLsearch could identify more candidate genes than the original study, but in an automated, reproducible, and statistically meaningful way.

Thus we think this will greatly facilitate future QTL analyses, particularly those that are done in non-model species for which the previous experimental knowledge is very limited.

Behind the paper

This is the third paper that resulted from our collaboration with Bayer CropScience (now BASF CropScience), after our work on homoeologs and on detecting split genes.

The project was conceived by Henning Redestig, collaborator at Bayer at the start of the project (now at DuPont). Henning had contributed to a QTL study and knew how labor intensive the search for putative causal genes is. He realised that HOGs could provide a natural way of integrating functional knowledge across multiple species, to combine the QTL information with previous functional data.

Alex Warwick Vesztrocy, PhD student on the project and first author, ran with the idea—promptly implementing and testing it. Early results looked promising, but Alex soon realised that the mapping between metabolites and GO terms could be improved. He also realised that some terms were quite common, so he devised the approach to compute the significance scores.

Our manuscript was accepted as proceedings paper at the European Conference on Computational Biology (ECCB). In our lab, we like proceedings paper. It’s nice to be able to present the work and publish the paper, particularly since the ECCB proceedings appear in a good journal. More importantly, conferences impose hard deadlines. Deadlines for submission of course, but also for peer-reviewing and for deciding acceptance or not!


Alex Warwick Vesztrocy, Christophe Dessimoz*, Henning Redestig*, Prioritising Candidate Genes Causing QTL using Hierarchical Orthologous Groups, Bioinformatics, 2018, 34:17, pp. i612–i619 (ECCB 2018 proceedings) [Open Access Full Text]

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Over 10 computational biology positions at PhD, postdoc and group leader level

• Author: Christophe Dessimoz •

(Post updated on 4 Oct 2018 and on 13 Nov 2018)

Our lab has an open position, and so do collaborators and colleagues across Switzerland and Europe.

Please help us spread the word by forwarding this post. If you have computational biology jobs to announce, let me know and I will gladly add a link.

Postdoc position in our lab

PhD openings with colleagues

Postdoc openings with colleagues

Group leader positions

Bonus position (not computational but what the heck)

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

What genes do I have in common with a plant?

• Author: Natasha Glover •

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. Therefore, we can take any two species on the tree of life and find out what set of genes were present in the common ancestor of the two species by looking for the genes that they still share to this day. When two species are closely related, i.e. close together on the tree of life, we can observe that much of their genome is the same. But what do I mean by “same”, and what do I mean by “shared”?

Genomes are complex, with many layers of control (DNA-level, RNA-level, protein-level, epigenetic level, alternative splicing-level, protein-protein interaction level, transposable element-level, etc). Because of this complexity, there could be many ways to compare the genomes of two different species to determine which portion of the genome is shared.

In this blog post, when I refer to “shared” parts of the genome, I’m referring to orthologs, which are genes in different species that started diverging due to a speciation event. Orthologs can be inferred from complete genome sequences, and there are many different methods to do this. In particular, I will focus on the Dessimoz lab’s method and database, Orthologous MAtrix, or OMA. OMA uses protein sequences from whole genome annotations to compute orthologs between over 2000 species. The OMA algorithm is graph-based, compares all protein sequences to all others to find the closest between two genomes, allows for duplicates, and uses information from related genomes to improve the calling. OMA is available at https://omabrowser.org/.

It’s quite well-publicized that humans and chimps are extremely similar in terms of gene content, which is why they could be considered our evolutionary cousins1. However, when two species are more distantly related, there is less of their genomes which are the same. But how distant can we go in the tree of life and still see the same genes between different species? For example, how many genes do I (a human) have in common with a plant? Can we even detect orthologs between two species so evolutionarily distant? And if so, what biological/functional role do these genes play?

In this blog post, I will show how to use OMA and some of its tools to find out how much of the human genome we share with plants. There are many different plant species with their genomes’ sequenced, so I will use the extensively-studied model species Arabidopsis thaliana as the representative plant to compare the human protein-coding genes to.


For the remainder of this blog post, I will show how to use the OMA database, accessed via the python REST API, to get ortholog pairs between two genomes.

In python, I use the requests library to send queries for information to the server housing the OMA database. Pandas is a well-known python library for working with dataframes, so I will use that for analysis. For more information, see:

Let’s start our python code by importing libraries:

import requests
import json
import pandas as pd

The OMA API is accessible from the following link, so we will save it for later use in our code:

api_url = "https://omabrowser.org/api"

Getting pairwise orthologs with the OMA API

Pairwise orthologs are computed between all pairs of species in OMA as part of the normal OMA pipeline. Accessing the data will result in a list of pairs of genes, one from one species and the other from the other species. Any one gene may be involved in several different pairs, depending of if its ortholog has undergone duplication or not.

It is important to note that here we are only using pairs of orthologs. In OMA, we report several types of orthologs, namely pairs of orthologs or groups of orthologs (Hierarchical Orthologous Groups (HOGs). We use the pairs as the basis for building HOGs, which aggregates orthologous pairs together into clusters at different taxonomic levels. Some ortholog pairs are removed, and some new pairs are inferred when creating HOGs. Therefore, the set of orthologs deduced from HOGs will be slightly different than the set of orthologs purely from the pairs.

Here we will use the API to request all the pairwise orthologs between Homo sapiens and Arabidopsis thaliana. OMA uses either the NCBI taxon identifier or the 5-letter UniProt species code to identify species. I will use the UniProt codes, which are “HUMAN” and “ARATH.” To request the ortholog pairs between HUMAN and ARATH, the request would simply look like:

requests.get(api_url + '/pairs/HUMAN/ARATH/')

However, when sending a request that returns long lists, i.e. lists of thousands of genes, the OMA API uses pagination. This means all the results won’t be returned at once, but sequentially page by page, with only 100 results per page. This is a safeguard to make sure the OMA servers don’t get overloaded with requests. Here I show a simple workaround to get all of the pairs.

Get all pairs

genome1 = "HUMAN"
genome2 = "ARATH"

#get first page
response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2))

#use the header to calculate total number of pages
total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
print("There are {} pages that need to be requested.".format(total_nb_pages))

> There are 128 pages that need to be requested.

Since we want all the pages, not just the first one, we have to use a loop to go through and request all the pages.

#get responses for all pages
responses = []
for page in range(1, total_nb_pages + 1):
    tmp_response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2)+"?page="+str(page))

#some basic tidying up so we only have a list of pairs at the end
pairs = []
for response in responses:
    for pair in response:

#Example of a pair

> {'entry_1': {'entry_nr': 8066469,
>   'entry_url': 'https://omabrowser.org/api/protein/8066469/',
>   'omaid': 'HUMAN00009',
>   'canonicalid': 'NOC2L_HUMAN',
>   'sequence_md5': 'dc91b2521daf594037a6c318f3b04d5a',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.4b.15b.5a.3b',
>   'chromosome': '1',
>   'locus': {'start': 944694, 'end': 959240, 'strand': -1},
>   'is_main_isoform': True},
>  'entry_2': {'entry_nr': 12384097,
>   'entry_url': 'https://omabrowser.org/api/protein/12384097/',
>   'omaid': 'ARATH12334',
>   'canonicalid': 'NOC2L_ARATH',
>   'sequence_md5': 'f19ac310a5e56443f2ce0e4e832addb3',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.1c',
>   'chromosome': '2',
>   'locus': {'start': 7928254, 'end': 7931851, 'strand': 1},
>   'is_main_isoform': True},
>  'rel_type': '1:1',
>  'distance': 138.0,
>  'score': 1094.969970703125}

We can see that the data is in the form of a big list of pairs, with each pair being a dictionary. In this dictionary, one key is entry_1, which is the human gene, and the second key is entry_2, which is the arabidopsis gene. The values for each of these keys are dictionaries with information about the gene. Additionally, there are other key value pairs in the pair dictionary which gives information computed about the pair, such as rel_type, distance, and score (will be explained later).

Make dataframe with all pairs

I personally like to work with pandas because I find it easy to use, so I will import the information about the pairs into a dataframe.

#use pandas to make dataframe
df = pd.DataFrame.from_dict(pairs)

def make_columns_from_entry_dict(df, columns_to_make):
    '''Parses out the keys from the entry dictionary and adds them to the dataframe 
    with an appropriate column header.'''

    genome1 = df['entry_1'][0]['omaid'][:5]
    genome2 = df['entry_2'][0]['omaid'][:5]

    df[genome1+"_"+columns_to_make] = df.apply(lambda x: x['entry_1'][columns_to_make], axis=1)
    df[genome2+"_"+columns_to_make] = df.apply(lambda x: x['entry_2'][columns_to_make], axis=1)

    return df

#clean up the columns of the entry_1 and entry_2 dictionaries
df = make_columns_from_entry_dict(df, "omaid")    
df = df[['HUMAN_omaid','ARATH_omaid','rel_type','distance','score']]

#Here's a snippet of the dataframe
HUMAN_omaid ARATH_omaid rel_type distance score
0 HUMAN00009 ARATH12334 1:1 138.0 1094.969971
1 HUMAN00029 ARATH38493 1:1 168.0 246.330002
2 HUMAN00032 ARATH38034 1:1 69.0 773.869995
3 HUMAN00034 ARATH39850 m:n 140.0 635.900024
4 HUMAN00034 ARATH33319 m:n 144.0 655.409973
5 HUMAN00034 ARATH01678 m:n 137.0 723.580017
6 HUMAN00034 ARATH07547 m:n 134.0 761.450012
7 HUMAN00036 ARATH01479 1:1 103.0 554.590027
8 HUMAN00037 ARATH10852 1:1 60.0 2325.889893
9 HUMAN00052 ARATH02626 1:1 95.0 328.959991

Here you can see that each row is a pair of orthologs, with some information about each. The rel_type is the relationship cardinality, between the orthologs. 1:1 means only one ortholog in human found in arabidopsis, whereas 1:m would mean it duplicated in ARATH. m:n means there were lineage-specific duplications on both sides.

How many pairs between HUMAN and ARATH?

print("There are {} pairs of orthologs between {} and {}.".format(len(df), genome1, genome2))

> There are 12792 pairs of orthologs between HUMAN and ARATH.

What percentage is this of the human genome? What percentage of the Arabidopsis genome? To answer this we need to 1) Get the number of genes in each genome that this represents, because there can be more than 1 pair of orthologs per gene. 2) We also need to get the total number of genes per genome.

First we get the number of genes for each species that has at least one ortholog, using the dataframe from above. We don’t have to worry about alternative splic variants (ASVs) because OMA chooses a representative isoform when computing orthologs.

human_proteins_with_ortholog = set(df['HUMAN_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(human_proteins_with_ortholog), "HUMAN","ARATH"))

arath_proteins_with_ortholog = set(df['ARATH_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(arath_proteins_with_ortholog), "ARATH","HUMAN"))

> There are 3769 genes in HUMAN with at least 1 ortholog in ARATH.
> There are 5177 genes in ARATH with at least 1 ortholog in HUMAN.

Get the genomes of HUMAN and ARATH

Now we will again use the API to get all the proteins in the genome. (See https://omabrowser.org/api/docs#genome-proteins). Here I will write a function using the same procedure as above to deal with pagination.

#Use the API to get all the human genes
def get_all_proteins(genome):
    '''Gets all proteins using OMA REST API'''
    responses = []
    response = requests.get(api_url + '/genome/{}/proteins/'.format(genome))
    total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
    for page in range(1, total_nb_pages + 1):
        tmp_response = requests.get(api_url + '/genome/{}/proteins/'.format(genome)+"?page="+str(page))

    proteins = []
    for response in responses:
        for entry in response:
    return proteins

human_proteins = get_all_proteins("HUMAN")

#Here is an example entry

> {'entry_nr': 8066461,
>  'entry_url': 'https://omabrowser.org/api/protein/8066461/',
>  'omaid': 'HUMAN00001',
>  'canonicalid': 'OR4F5_HUMAN',
>  'sequence_md5': 'df953df7a11ee7be5484e511551ce8a4',
>  'oma_group': 650473,
>  'oma_hog_id': 'HOG:0361626.2a.7a',
>  'chromosome': '1',
>  'locus': {'start': 69091, 'end': 70008, 'strand': 1},
>  'is_main_isoform': True}

It is important to note that this list of genes/proteins includes ASVs for many species. Therefore we need to take care to remove them to have one represenative isoform per gene. We can do this by filtering the list of entries based on the ‘is_main_isoform’ key. Here is a small function to do that.

#get canonical isoform
def get_canonical_proteins(list_of_proteins):
    proteins_no_ASVs = []
    for protein in list_of_proteins:
        if protein['is_main_isoform'] == True:
    return proteins_no_ASVs

print("The number of genes before removing alternative splice variants: {}".format(len(human_proteins)))

human_proteins_no_ASVs = get_canonical_proteins(human_proteins)

print("The number of genes after removing alternative splice variants: {}".format(len(human_proteins_no_ASVs)))

> -----
> The number of genes before removing alternative splice variants: 30700
> The number of genes after removing alternative splice variants: 20152

#Do the same for ARATH
arath_proteins = get_all_proteins("ARATH")

print("The number of genes before removing alternative splice variants: {}".format(len(arath_proteins)))
arath_proteins_no_ASVs = get_canonical_proteins(arath_proteins)
print("The number of genes after removing alternative splice variants: {}".format(len(arath_proteins_no_ASVs)))

> -----
> The number of genes before removing alternative splice variants: 40999
> The number of genes after removing alternative splice variants: 27627

Proportion of the genomes which have orthologs in the other

Now that we have all the necessary information we can see what percentage of the human and arabidopsis genomes are shared, in terms of proportion of genes with at least 1 ortholog in the other species.

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("HUMAN", "ARATH", \
              round((len(human_proteins_with_ortholog)/len(human_proteins_no_ASVs)*100), 2)))

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("ARATH", "HUMAN", \
              round((len(arath_proteins_with_ortholog)/len(arath_proteins_no_ASVs)*100), 2)))

> The percentage of genes in HUMAN with at least 1 ortholog in ARATH: 18.7%
> The percentage of genes in ARATH with at least 1 ortholog in HUMAN: 18.74%

So the answer to the original questions is that BOTH humans and arabidopsis have 18.7% of their genome shared with each other. A weird coincidence that it turns out to be the same proportion of both genomes!


Using the OMA database for orthology inference, we found that:

  • There are 12792 pairs of orthologs between HUMAN and ARATH.
  • There are 3769 genes in human that have at least 1 ortholog in arabidopsis.
  • There are 5177 genes in arabidopsis that have at least 1 ortholog in human.
  • These numbers represent about 19% of both genomes.

Therefore, we can conclude that over the course of evolution since the animal and plant lineages split (~1.5 billion years ago2), about 19% of the protein-coding genes remain and are able to be detected by OMA.

Stay tuned for the next blog post, where I will show what are the functions of these shared genes!


  1. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). DOI: https://doi.org/10.1038/nature04072
  2. Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999). DOI: 10.1098/rspb.1999.0617

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs)

• Author: Clement Train •

(This entry was updated 19 Sep 2018 to reflect recent feature updates)

pyHam (‘python HOG analysis method’) makes it possible to extract useful information from HOGs encoded in standard OrthoXML format. It is available both as a python library and as a set of command-line scripts. Input HOGs in OrthoXML format are available from multiple bioinformatics resources, including OMA, Ensembl and HieranoidDB.

This post is a brief primer to pyham, with an emphasis on what it can do for you.

How to get pyHam?

pyHam is available as python package on the pypi server and is compatible python 2 and python 3. You can easily install via pip using the following bash command:

pip install pyham

You can check the official pyham website for further information about how to use pyham, documentation and the source code.

What are Hierarchical Orthologous Groups (HOGs)?

You don’t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:


You can learn more about this in our previous blog post.

Where to find HOGs?

HOGs inferred on public genomes can be downloaded from the OMA orthology database. Other databases, such as Eggnog, OrthoDb or HieranoiDB also infer HOGs, but not all of these databases offer them in OrthoXML format. You can check which database serves hogs as orthoxml here. If you want to use your custom genomes to infer HOGs you can use the OMA standalone software.

In order to facilitate the use of pyHam on single gene family, we provide the option to let pyham fetch required data directly from a comptabilble databases (for now only OMA is available for this feature). The user simply have to give the id of a gene inside the gene family (HOGs) of insterest along with the name of the compatibible database where to get the data and pyHam will do the rest.

For example, if you are interest by the P_53 gene in rat (P53 rat gene page in OMA) you simply have to run the following python code to set-up your pyHam session:

my_gene_query = 'P53_RAT'
database_to_query = 'oma'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from=database_to_query)

How does pyham help you investigate on HOGs?

The main features of pyHam are: (i) given a clade of interest, extract all the relevant HOGs, each of which ideally corresponds to a distinct ancestral gene in the last common ancestor of the clade; (ii) given a branch on the species tree, report the HOGs that duplicated on the branch, got lost on the branch, first appeared on that branch, or were simply retained; (iii) repeat the previous point along the entire species tree, and plot an overview of the gene evolution dynamics along the tree; and (iv) given a set of nested HOGs for a specific gene family of interest, generate a local iHam web page to visualize its evolutionary history.

What is the number of genes in a particular ancestral genome? (i)

In pyHam, ancestral genomes are attached to one specific internal node in the inputted species tree and denoted by the name of this taxon. Ancestral genes are then infered by fetching all the HOGs at the same level.

# Get the ancestral genome by name
rodents_genome = ham_analysis.get_ancestral_genome_by_name("Rodents")

# Get the related ancestral genes (HOGs)
rodents_ancestral_genes = rodents_genome.genes

# Get the number of ancestral genes at level of Rodents

How can I figure out the evolutionary history of genes in a given genome? (ii)

pyHam provides a feature to trace for HOGs/genes along a branch that span across one or multiple taxonomic ranges and report the HOGs that duplicated on this branch, got lost on this branch, first appeared on that branch, or were simply retained. The ‘vertical map’ (see further information on map here) allows for retrieval of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).

# Get the genome of interest
human = ham_analysis.get_extant_genome_by_name("HUMAN")
vertebrates = ham_analysis.get_ancestral_genome_by_name("Vertebrata")

# Instanciate the gene mapping !
vertical_human_vertebrates = ham_analysis.compare_genomes_vertically(human, vertebrates) # The order doesn't matter!

# The identical genes (that stay single copies) 
# one HOG at vertebrates -> one descendant gene in human

# The duplicated genes (that have duplicated) 
# one HOG at vertebrates -> list of its descendants gene in human

# The gained genes (that emerged in between)
# list of gene that appeared after vertebrates taxon

# The lost genes (that been lost in between) 
HOG at vertebrates that have been lost before human taxon

How can I get an overview of the gene evolution dynamics along the tree that occured in my genomic setup? (iii)

pyHam includes treeProfile (extension of the Phylo.io tool), a tool to visualise an annotated species tree with evolutionary events (genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a minimalist and intuitive way to visualise the number of evolutionary events that occurred on each branch or the numbers of ancestral genes along the species tree.

# create a local treeprofile web page
treeprofile = ham_analysis.create_tree_profile(outfile="treeprofile_example.html")

As you can see in the figure above, the treeprofile is composed of the reference species used to perform the pyham analysis. Each internal node is displayed with its related histogram of phylogenetic events (number of genes duplicated, lost, gained, or retained) that occurred on each branch. The tree profile either display the number of genes resulting from phylogenetics events or the number of phylogenetic events on themself; the switch can be made by opening the settings panel (histogram icon on top right) and selecting between ‘genes’ or ‘events’.

How can I visualise the evolutionary history of a gene family (HOG)? (iv)

pyHam embeds iHam, an interactive tool to visualise gene family evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.

# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)

# create and export the hog vis as .html
output_filename = "hogvis_example.html"

Then, you simply have to double click on the .html file to open it in your default internet browser. We provide you an example below of what you should see. A brief video tutorial on iHam is available at this URL.

iHam is composed of two panels: a species tree that allows you to select the taonomic range of interest, a genes panel where each grey square represents an extant gene and each row a species.

We can see for example that at the level of mammals (click on the related node and select ‘Freeze at this node’) all genes of this gene family are descendant from a single comon ancestral gene.

Now, if we look at the level of Euarchontoglires (redo the same procedure as for mammals to freeze the vis at this level) we observe that the genes are now split by a vertical line. This vertical line separates 2 group of genes that are each descendants from a same single ancestral gene. This is the result of a duplication in between Mammals and Euarchontoglires.

This small example demonstrate the simplicity of iHam usefulness to identify evolutionary events that occured in gene families (e.g. when a duplication occured, which species have lost genes or how big genes families evolved).

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Sex, alcohol, and structural variants in fission yeast

• Authors: Fritz Sedlazeck, Dan Jeffares & Christophe Dessimoz •

Our latest study just came out (Jeffares et al., Nature Comm 2017). In it, we carefully catalogued high-confidence structural variants among all known strains of the fission yeast population, and assessed their impact on spore viability, winemaking and other traits. This post gives a summary and the story behind the paper.

Structural variants (SVs) measure genetic variation beyond single nucleotide changes …

Next generation sequencing is enabling the study of genomic diversity on unprecedented levels. While most of this research has focused on single base pair differences (single nucleotide polymorphisms, SNPs), larger genomic differences (called structural variations, SVs) can also have an impact on the evolution of an organism, on traits and on diseases. SVs are usually loosely defined as events that are at least 50 base pair long. They are often classified in five subtypes: deletions, duplications, new sequence insertions, inversions and translocations.

Over the recent years the impact of SVs has been characterized in many organisms. For example, SVs play a role in cancer, when duplications often lead to multiple copies of important oncogenes. Furthermore, SVs are known to play a role in other human disorders such as autism, obesity, etc.

… but calling structural variants remains challenging

In principle, identifying SVs seems trivial: just map paired-end reads to a reference genome, look for any abnormally spaced pairs or split reads (i.e. reads with parts mapping to different regions), and—boom—structural variants!

In practice, things are much harder. This is partly due to the frustrating tendency for SVs occur in or near repetitive regions where short read sequencing struggles to disambiguate the reads. Or in highly variable regions of genome such as the chromosome ends, which tend to be the tinkering workshop of the genome.

As a result, a large proportion of SVs—typically at least 30-40%—remain undetected. As for false discovery rates (proportion of wrongly inferred SVs), they are mostly not well known because validating SVs on real data is very laborious.

Fission yeast: a compelling model to study structural variants

Studying structural variants in Schizosaccharomyces pombe is especially suited because:

  1. The genome is small, well-annotated and simple (few repeats, haploid).
  2. We had 40x or more coverage over 161 genomes covering the worldwide known population of S. pombe.
  3. We had more than 220 accurate trait measurements for these strains at hand. Since the traits are measured under strictly controlled conditions, they contain little (if any) environmental variance—in stark contrast to human traits.

SURVIVOR makes the most out of (imperfect) SV callers

To infer accurate SVs calls, we introduced SURVIVOR, a consensus method to reduce the false discovery rate, while maintaining high sensitivity. Using simulated data, we observed that consensus calls obtained from two to three different SV callers could recover most SV while keeping the false-discovery rate in check. For example, SURIVOR performed second best with a 70% sensitivity (best was Delly: 75%), while the false discovery rate was significantly reduced to 1% (Delly: 13%) (but remember these figures are based on simulation; performance on real data is likely worse.) Furthermore, we equipped SURVIVOR with different methods to simulate data sets and evaluate callers; merge data from different samples; compute bad map ability regions (BED file) over the different regions, etc. SURVIVOR is written in C++ so it’s fast enough to run on large genomes as well. Since then, we are running it on multiple human data sets, which takes only a few minutes on a laptop. SURVIVOR is available on GitHub.

SVs: now you see me, now you don’t

We applied SURVIVOR to our 161 genomic data sets, and then manually vetted all our calls to obtain a trustworthy set of SVs. We then discovered something suspicious. Some groups of strains that were very closely related (essentially clonal, differing by <150 SNPs) had different numbers of duplications, or different numbers of copies in duplications (1x, 2x, even 6x). This observation was also validated with lab experiments.

Interestingly we identified 15 duplications that were shared between the more diverse non-clonal strains (so these must have been shared during evolution) but could not be explained by the tree inferred from SNPs (Figure 1). To confirm this we compared the local phylogeny of SNPs in 20kb windows up and downstream of the duplications with the variance in copy numbers. Oddly the copy number variance was not highly correlated with the SNP tree. This lead to the conclusion that some SVs are transient and thus are gained or lost faster than SNPs.


Tree reconstructed from SNPs, with coloured dots indicating strains with identical SVs.

Duplications happen within near-clonal populations Phylogenetic tree of the strains reconstructed from SNPs data, with eight pairs of very close strains that nonetheless show structural variation. Click to enlarge.


Though this transience came as a surprise, there is actually supporting evidence from laboratory experiments carried out by Tony Carr back in 1989 that duplications can occur frequently in laboratory-reared S. pombe, and can revert. (Carr et al. 1989). The high turnover raises the possibility that SVs could be an important source for environmental adaptation.

SVs affect spore viability and are associated with several traits

We then investigated the phenotypic impact of these SVs. We used the 220 trait measurements from previous publications. We observed an inverse correlation between rearrangement distance and spore viability, confirming reports in other species that SVs can contribute to reproductive isolation. We also found a link between copy number variation and two traits relevant to wine making (malic acid accumulation and glucose+fructose ultilisation) (Benito et al. PLOS ONE 2016).


plots from SV paper showing the relationship between structural variants and spore viability, as well as the contribution of SVs to trait heritability

Structural variants, reproductive isolation, and wine. A) Making crosses between fission yeast strains often results in low offspring survival. The theory is that rearrangements (inversions and translocations) cause errors during meiosis, so we might expect them to affect offspring viability. If we compare offspring viability from crosses with the number of rearrangements that the parents differ by, there is a correlation, and a ‘forbidden triangle’ in the top right of the plot (it seem impossible to produce high viability spores when parents have many unshared rearrangements). B) SVs also affect traits. For > 200 traits (vertical bars) we used [LDAK](http://dougspeed.com/ldak/) to estimate the proportion of the narrow sense heritability that was caused by copy number variants (red), rearrangements (black) and SNPs (grey). Some traits are very strongly affected by copy number variants, such as the wine-making traits (wine-colored bars along the x-axis). C) Fission yeast wine tasting at UCL—how much of the taste is due structural variants? (Jürg Bähler at right).


We used the estimation of narrow sense heritability from Doug Speed’s LDAK program. Narrow sense heritability estimates how much of a difference in a trait between individuals can be explained by adding up all the tiny effects of the genomic differences (in our case SNPs; deletions and duplications; inversions and translocations and all combined). Overall, we found the heritability was better explained when combining the SNP data as well as the SVs data. In 45 traits SVs explained 25% or more of the trait variability. Five traits that were explained by over 90% heritability using SNPs and SVs came from different growth conditions in liquid medium. This may highlight again the influence of environmental conditions on the genomic structure. For 74 traits (~30% of those we analyzed) SVs explain more of the trait than the SNPs. These high SV-affected traits include malic acid, acetic acid and glucose/fructose contents of wine, key components of taste.

A collaborative effort

On a personal note, the paper concludes a wonderful team effort over two and a half years.

The project started as a summer project for Clemency Jolly, who had then just completed her 3rd undergraduate year at UCL, in the Dessimoz and Bähler labs. Dan Jeffares and the rest of the Bähler lab had just published their 161 fission yeast genomes, with an in-depth analysis of the association between SNPs and quantitative traits (Jeffares et al., Nature Genetics 2015). Studying SVs was the logical next step, but given the challenging nature of reliable SV calling, we also recruited to the team Fritz Sedlazeck, collaborator and expert in tool development for NGS data analysis then based in Mike Schatz’s lab at Cold Spring Harbor Laboratory.

At the end of the summer, it was clear that we were onto something, but there was still a lot be done. Clemency turned the work into her Master’s project, with Dan and Fritz redoubling their efforts until Clemency graduation in summer 2015. It took another year of intense work lead by Dan and Fritz to verify the calls, perform the GWAS and heritability analyses, and publish the work. Since then, Clemency has started her PhD at the Crick Institute, Fritz has moved to John Hopkins University, and Dan has started his own lab at the University of York.



Jeffares, D., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., Balloux, F., Dessimoz, C., Bähler, J., & Sedlazeck, F. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast Nature Communications, 8 DOI: 10.1038/ncomms14061

Carr AM, MacNeill SA, Hayles J, & Nurse P (1989). Molecular cloning and sequence analysis of mutant alleles of the fission yeast cdc2 protein kinase gene: implications for cdc2+ protein structure and function. Molecular & general genetics : MGG, 218 (1), 41-9 PMID: 2674650

Jeffares, D., Rallis, C., Rieux, A., Speed, D., Převorovský, M., Mourier, T., Marsellach, F., Iqbal, Z., Lau, W., Cheng, T., Pracana, R., Mülleder, M., Lawson, J., Chessel, A., Bala, S., Hellenthal, G., O’Fallon, B., Keane, T., Simpson, J., Bischof, L., Tomiczek, B., Bitton, D., Sideri, T., Codlin, S., Hellberg, J., van Trigt, L., Jeffery, L., Li, J., Atkinson, S., Thodberg, M., Febrer, M., McLay, K., Drou, N., Brown, W., Hayles, J., Salas, R., Ralser, M., Maniatis, N., Balding, D., Balloux, F., Durbin, R., & Bähler, J. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe Nature Genetics, 47 (3), 235-241 DOI: 10.1038/ng.3215

Benito, A., Jeffares, D., Palomero, F., Calderón, F., Bai, F., Bähler, J., & Benito, S. (2016). Selected Schizosaccharomyces pombe Strains Have Characteristics That Are Beneficial for Winemaking PLOS ONE, 11 (3) DOI: 10.1371/journal.pone.0151102

More info

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

How to access scientific papers for free?

• Author: Christophe Dessimoz •

You have a reference to a research paper of interest.

Perhaps this one:

Iantorno S et al, Who watches the watchmen? an appraisal of benchmarks for multiple sequence alignment, Multiple Sequence Alignment Methods (D Russell, Editor), Methods in Molecular Biology, 2014, Springer Humana, Vol. 1079. doi:10.1007/978-1-62703-646-7_4

How can you retrieve the full article?

Gold open access

If the paper is published as “gold” open access, you can download the PDF from the publisher’s website. In such cases, it’s easiest to paste the title and author names in Google and look for the first hit in the journal.

Alternatively, if you know the Digital Object Identifer (DOI) of the article, prepending http://doi.org should directly lead you to the article. In this case, the DOI is included at the end of the citation. (If you don’t know its DOI, you can find it on the publisher’s website or in a database such as PubMed). We get:


Unfortunately, we see that this particular paper is behind a paywall at the publisher.

Green open access

There is, however, still a chance that it might be deposited in a preprint server or in an institutional repository. This is referred to as “green” open access. One way to find out is to look at the article record in Google Scholar and look for a link in the right margin:

screenshot of google scholar with link in the right margin circled

In this case, the paper is thus available on arXiv.org.

If you know the DOI, an even quicker way of looking for a deposited version is by using the http://oadoi.org redirection tool, which works analogously to doi.org but redirect to a green open access version whenever possible:


Here, a free version of the article deposited in the UCL institutional repository is found.

If you use Chrome or Firefox, you can also use the Unpaywall browser extension to automatically get a link to green open access alternatives as you land on paywalled articles.

On the author’s homepage

Sometimes, the paper is available on the homepage of one of the authors. In this case, a link to the preprint is provided on the homepage of the corresponding author (item #36).


Instead of an institutional homepage, some authors self-archive their articles on ResearchGate. In the case of our paper, the full-text version is indeed directly available.

And otherwise, if one of the authors is active on ResearchGate, it’s also possible to send a full-text request at the click of a button.

Pirated version off Sci-Hub

Sci-Hub serves bootleg copies of pay-walled articles. This is illegal, so I only mention it for educational purposes. This works most reliably using, again, DOIs:


If you, purely hypothetically of course, pasted that URL in your browser, you would or would not get a PDF of the entire book in which the referenced article appears.


It’s also possible to request full-text articles via Twitter. As described in Wikipedia, this works by tweeting the article title, its DOI, an email address (to indicate to whom the article should be sent), and the hashtag #icanhazpdf. Someone with access to the article might send a copy via email. Once the article is received, the tweet is deleted. Again, I mention this for educational purpose only—don’t break the law.

cat asking whether it can haz pdf because cat is purr

Image credit: Field of Science

Altmetrics.com wrote an interesting post on #icanhazpdf a few years ago.

Email the corresponding author

Finally, you can always ask the corresponding author by email for a copy of their article. They will happily oblige.


[Update (19 Mar 2017): added mention of unpaywall to seemlessly retrieve green open access]

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Life as an academic: my 2016 in numbers

• Author: Christophe Dessimoz •

Life as an academic is varied and busy. Students sometimes believe that all we do is teach. In fact, we do quite a few other things. Here’s my 2016 in numbers.

  • number of papers published: 10
  • number of paper rejections: 7
  • number of books edited: 1
  • number of grant proposals submitted: 8
  • number of research contracts negotiated with the industry: 2
  • number of blog posts: 5
  • number of tweets: 474 (66% were retweets)
  • number of YouTube videos: 1
  • number of papers reviewed: 24
  • number of papers edited: 3
  • number of grants reviewed: 3
  • number of PhD theses examined: 2
  • number of emails received (excluding spam and mailing-lists): 12,695
  • number of emails written: 4,377 (!)
  • number of minutes videoconferencing on GoToMeeting: 13,236 (!!)
  • number of Geneva-London-Geneva roundtrips: 12
  • number of meetings with >50 attendees co-organised: 6
  • number of seminars hosted: 4
  • number of conferences attended: 3
  • number of talks given: 11
  • number of semester-long courses organised: 2
  • number of hours lectured: 32
  • number of 2000-word student papers marked: 47
  • number of summer students supervised: 4
  • number of overnight retreats attended: 4
  • number of work Christmas dinners attended: 3
  • number of annual reports written: 3 (this does not count)
  • number of Tête de Moine eaten at lab celebrations: 4
  • number of times moved home: 0 (noteworthy since we moved 5 times in the preceding 5 years…)

I wish you, Dear Reader, all the best in 2017!

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

What are Hierarchical Orthologous Groups (HOGs)?

• Author: Christophe Dessimoz •

One central concept in the OMA project and other work we do to infer relationships between genes is that of Hierarchical Orthologous Groups, or “HOGs” for the initiated.

We’ve written several papers on aspects pertaining to HOGs—how to infer them, how to evaluate them, them being increasingly adopted by orthology resources, etc.—but there is still a great deal of confusion as to what HOGs are and why they matter.

Natasha Glover, talented postdoc in the lab, has produced a brief video to introduce HOGs and convey why we are mad about them!




Altenhoff, A., Gil, M., Gonnet, G., & Dessimoz, C. (2013). Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs PLoS ONE, 8 (1) DOI: 10.1371/journal.pone.0053786

Boeckmann, B., Robinson-Rechavi, M., Xenarios, I., & Dessimoz, C. (2011). Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees Briefings in Bioinformatics, 12 (5), 423-435 DOI: 10.1093/bib/bbr034

Sonnhammer, E., Gabaldon, T., Sousa da Silva, A., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P., Dessimoz, C., & , . (2014). Big data and other challenges in the quest for orthologs Bioinformatics, 30 (21), 2993-2998 DOI: 10.1093/bioinformatics/btu492

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Interview with Tunca Doğan, OMA Visiting Fellow 2016

• Author: Tunca Doğan •

Note: the “Life in the Lab” series features interviews of interns and visitors. This post is by our second 2016 OMA Visiting Fellow Tunca Doğan, who spent a month with us earlier this year. You can follow Tunca on Twitter at @tuncadogan. —Christophe


Please introduce yourself in a few sentences.

My name is Tunca Doğan. I received my PhD in 2013 with a thesis study in the fields of bioinformatics and computational biology where we developed methods for the clustering of the protein sequences using unsupervised machine learning techniques (Dogan and Karacali, 2013). I’ve since been working as a post-doctoral fellow in the EMBL-EBI, UK under the Protein Function Development team (UniProt Database) leaded by Dr Maria Martin. Here I’m developing new tools and methods for the automated functional annotation of protein records in the UniProtKB using a variety of features including domain architectures (Dogan et al., 2016). I’m also conducting research in the field of computational drug discovery. As of 2016, I’m also affiliated to the Department of Health Informatics, METU, Turkey both as a senior research fellow and a faculty candidate.

Why did you choose to apply to the OMA visiting fellowship programme?

The team behind OMA is world-leading in the field of phylogenomics, and they authored many highly cited publications in this area. Moreover, OMA is considered to be one of the most reliable and comprehensive resources offering phylogenomic information on various species. I’ve applied to this programme in order to develop my knowledge in phylogenomic research, particularly about the OMA production. My specific research aim was to investigate if and how the information in OMA can be utilized in order to increase the coverage and the quality of the automated functional annotation of proteins in the UniProt database.

Discussions on UNIL campus with Leonardo de Oliveria Martins, Surag Nair, Clément Train, David Dylus and Tunca Doğan (from l. to r.)

What project did you work on during your visit?

The project I worked on had two sides: 1) investigating novel ways of quality checking of the data produced in the OMA pipeline (especially HOGs) using the Domain Architecture Alignment and Classification (DAAC) method I previously developed in UniProt; 2) investigating the use of OMA groups and HOGs to propagate the functional annotation between the (homologous) member proteins of the same clusters/classes.

Was there any highlight or low point you’d like to share?

It was a great experience for me both professionally and socially. I’ve learnt a great deal in just one month and we still keep our collaboration with the continuation of the abovementioned project. Everyone I met in the group: Christophe, Adrian, David, Leonardo and Clement were all knowledgeable, helpful and friendly that I had great time during my stay. It was a great pleasure to meet and to work with them all…

UNIL/EPFL campus is just beautiful, at the shores of lake Geneva. The campus is also well-equipped for all possible needs. This was also my first time in Switzerland and I was enchanted by the beauty of this country… The only downside for a foreign visitor could be the expensiveness of life in Switzerland, which was also manageable with a little prior investigation and planning.

Do you have any practical tip for future OMA visiting fellows?

I definitely recommend any researcher (at PhD or post-doc level) that has an interest in phylogenomics to apply to this programme. You’ll learn a great deal and have a good time at the same time. Also (for the foreigners) do not forget about travelling around this beautiful country in your spare time…


Editor’s note: If you are interested in the OMA visiting fellowship programme, consult this page.


Doğan T, & Karaçalı B (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PloS one, 8 (9) PMID: 24069417

Doğan T, MacDougall A, Saidi R, Poggioli D, Bateman A, O’Donovan C, & Martin MJ (2016). UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics (Oxford, England), 32 (15), 2264-71 PMID: 27153729

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Creative Commons 
            License The Dessimoz Lab blog is licensed under a Creative Commons Attribution 4.0 International License.
Last modified on November 13th, 2018.