# The Banana Conjecture

• Author: Natasha Glover •

I recently became aware of the memes and popular science articles going around the internet claiming that we share 50% of our DNA with bananas. For example:

I work in the Dessimoz lab at the University of Lausanne, and here we are in the business of comparing genes. In fact, I’ve had a similar question before一 what percentage of our protein-coding genes do we share with another plant, Arabidopsis thaliana. I computed the number as being closer to 17%.

I wanted to get to the bottom of this question once and for all: What percentage of a human’s “genetic material” is shared with a banana? There have been several other blog posts from scientists touching on this question (Neil Saunders: “50% bananas”, Stack Exchange skeptics: “Do humans share 50% of their DNA with bananas?”, Sanogenetics: “Are We Genetically Similar To Bananas And Why Is This Important For Research In Disease?”).

However, I wanted to go a little deeper into:

1. Where this number came from, and the extent of it being spread on the internet.
2. What exactly do we mean by “shared genetic material”?
3. Some results I computed in attempts to put this controversy to rest once and for all.

In this blog post I will attempt to address these questions.

## Where did the mythical 50% come from anyway?

After performing a quick google search, it seems that the relatedness between a human and a banana has been a popular question. With a cursory, non-exhaustive search, I show in the table below eight sources who report that 44-60% of the human genome is “shared” with banana.

Source quote
Irishnews.com “But we are also genetically related to bananas – with whom we share 50% of our DNA – and slugs – with whom we share 70% of our DNA.”
getscience.com “Banana: more than 60 percent identical. Many of the “housekeeping” genes that are necessary for basic cellular function, such as for replicating DNA, controlling the cell cycle, and helping cells divide are shared between many plants (including bananas) and animals.”
thenakedscientists.com “So where does this banana statistic come from? Is it just complete nonsense? Well, no. We do in fact share about 50% of our genes with plants – including bananas.”
PopSci.com “Bananas have 44.1% of genetic makeup in common with humans.”
MythBusters (tv show) facebook “#sciencefact: Humans share approximately 98% of their DNA with chimps, 70% with slugs, and 50% with bananas! http://bit.ly/qsWX8p”
mirror.co.uk “Humans share 50% of our DNA with a banana.”
Business Insider “The genetic similarity between a human and a banana is 60%.” Source: National Human Genome Research Institute (However, no link and when I tried to search)
Sundaypost.com “Yes, and we share 50% with bananas. It’s not surprising, if you look at the basic mechanism of biochemistry.”

What is disconcerting is that at least half of these sources come from popular science websites or science sections of newspapers, yet few have any sort of citation at all. The only exceptions were Popular Science, which gave DataScope as a source, and Business Insider, who cites the National Human Genome Research Institute. However, neither of these articles give a link or further information to follow up on.

Upon further digging, I found one recent article published on howstuffworks entitled “Do People and Bananas Really Share 50 Percent of the Same DNA?”, which contains an interview with one of the scientists from the Human Genome Research Institute, where he explains how they arrived at that number.

“Brody says the experiment was not published, as most scientific research is. Instead, it was generated to be included as part of an educational Smithsonian Museum of Natural History video called ‘The Animated Genome.’ That video noted that DNA between a human and a banana is ‘41 percent similar.’””

The article goes on to explain that this 41% figure comes from a blast search between protein sequences of human and banana. They found about 7,000 hits, and the average percent identity of these hits was 41%. He goes on to note:

“This is the average similarity between proteins (gene products), not genes… Of course, there are many, many genes in our genome that do not have a recognizable counterpart in the banana genome and vice versa.”

So when we get to the bottom of it, the 50% figure is actually 40% average amino acid percent identity between 7000 blast hits of human and banana.

## What do we mean when we say “we share 50% of our DNA with a banana”?

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. What determines how many genes in common depends on how far back in time the two species shared a common ancestor. For example, humans and chimps share such a high percentage of genes, because we only diverged ~6 MYA1. However, human and banana (more specifically the common ancestors which led to human and banana) split around 1.5 BILLION years ago2. Talk about a banana split! Therefore we would expect a lot less to be conserved.

As brought up by Neil Saunders in his blog post, “What does ‘we share 50% of our DNA’ really mean?” A non-biologist perhaps might not see the nuance in this question. If I were going to play the devil’s advocate, I could say that a child shares 50% of its DNA with their parents. Or even that every organism shares 100% DNA, as it is all made up of Gs, Cs, As, and Ts. Thus, it is important to be specific on what we’re talking about.

This shared DNA could be referring to a number of things: protein-coding genes, non-coding genes, transposable elements, the percent that gets aligned in a whole genome alignment3, etc. Each of these specific features evolve at different rates, and thus will be more or less conserved between any given species.

### Well, how do we know if these genomic features are conserved?

Generally, sequences are compared by making an alignment, and then computing the percent identity or evolutionary distance between the two sequences. If the sequences are sufficiently similar, they can be declared as conserved. Thus, “conserved” can be seen as either categorical (i.e. conserved or not), and then specified as a quantitative value (conserved to a certain degree). For more information, see the Wiki page on conserved sequences.

### What are the genomic features the most likely to be conserved?

“Conservation indicates that a sequence has been maintained by natural selection” (wiki). Genes, or DNA sequences, encode for the proteins. Proteins are slower to evolve and change than the DNA, due to the redundancy of the genetic code. Thus proteins are the genomic feature most likely to be conserved between evolutionary distant species. While it is true that other genomic features such as non-coding regulatory sequences or non-coding RNA can be conserved over long evolutionary distances, they are far more likely to diverge in sequence than proteins4,5. Other genetic features such as transposable elements, or intergenic “junk DNA” are even less likely to be conserved, as their sequences are under less selection pressure and accumulate mutations at an even higher rate.

It is important to note that while we generally declare sequences to be conserved on the basis of sequence similarity, sequences may be still conserved and lack similarity. For example, two sequences might be conserved in the structure of the protein, indicating homology 6,7. Additionally, sequences might be in a syntenic position, indicating ancestral conservation, but may also lack sequence similarity 8. Thus, it is possible for some genes to be shared between evolutionary distant species, but they may fly under the radar of our current homology-inference tools. So, in order to investigate the 50% shared DNA claim, we can only focus on sequence conservation which we are able to detect.

To understand how much of the genome is conserved between banana and human, I will look at proteins because it’s the feature most likely to be conserved between human and banana. This is to be as permissive as possible in attempts to give the benefit of the doubt to the 50% meme.

Now the question is, how do we compare all the proteins in one species to all the proteins in another species and see which ones “match”, i.e. descended from a common ancestral gene? This is a fundamental problem important for studying evolution. Orthologs are the term we use for genes in different species that started diverging due to a speciation event, i.e. “corresponding” genes between species. This is where our lab’s expertise comes in: we maintain Orthologous Matrix, which is a method and database for finding orthologs between many species.

## Orthologs in common between human and banana

I wanted to see what percent of human’s genes are orthologous to banana genes一and vice versa一what percent of banana’s genes are orthologous to human’s. To compare several different methods, I tested three common methods for finding orthologs: OMA9, OrthoInspector10, and best-bidirectional hit (using BLASTP)11. For each method, I divided the number of orthologs found by the number genes in the genome to come up with a percentage of each genome that is shared. You can find all the details here jupyter notebook, but the results are summed up in the graph below:

Comparison of ortholog methods

As you can see, all the orthology-inference methods tested show a maximum of 25% of human genes to be orthologous to banana. Again, these results give the most leeway, as we used protein sequences, which are the genomic elements the most likely to be conserved.

Additionally, I investigated the percentage of a whole-genome alignment that would be shared between banana and human. Since this is computationally intensive, I used Ensembl Compara, which has precomputed pairwise whole-genome alignments between a number of species. A whole-genome alignment looks at the whole genome, not just genes, as well as compares DNA rather than proteins. They didn’t have results between human and banana, but here are the results between human and chimp, mouse, and zebrafish:

Data obtained from https://uswest.ensembl.org/info/genome/compara/analyses.html#

As we get progressively further in evolutionary distance, we get a smaller and smaller percentage of the genome which is able to be aligned. We can presume that plants would be even less than 1%, a far cry from the 50% as reported by internet memes.

So whichever way you slice it, humans share at most ¼, not ½ of its genetic material with banana (at least what we are able to detect)!

## What do these human-banana orthologs DO?

Now that we have found the human-banana orthologs, we can try to gain some insight into what these genes do. To do this, I performed a Gene Ontology (GO) enrichment analysis of the human genes. GO enrichment works by assigning functional annotations to all of the sequences, then looking for a statistical overrepresentation of certain functions in a subset of genes compared to the entire genome.

I used the PANTHER Overrepresentation Test web server for the GO enrichment, then used GO-Figure12 for summarizing and visualizing the most enriched Biological Processes. All the details are in the jupyter notebook.

The top 10 overrepresented GO terms, i.e. a summary of the most common functions of the human genes with orthologs, is shown below:

Top 10 overrepresented GO Biological Processes for human protein-coding genes with banana orthologs

We can see that the human-banana orthologs are highly enriched for basic, metabolic processes such as “cellular metabolic process,” “gene expression,” and “RNA processing.” These biological functions are likely genes which encode for cellular processes that are essential for eukaryotic life!

## Take home message

• “Humans share 50% of DNA with banana” is a statement that has very little meaning.
• We must be careful to be precise in our language. We have to clarify what we mean when we give a percentage of “shared genetic material/DNA/genome.” I argue that the percentage of protein-coding genes is currently the best way to compare evolutionarily distant species
• There’s no evidence that humans have 50% of detectable orthologs with a banana. In my analysis, I show between 17 and 24%, depending on which method was used. As scientists, we have to do a better job communicating science with each other and with the general public.

Even though we don’t have 50% genes in common with banana, we still have ~20% which is nothing to scoff at! The functions of these genes are most likely basic housekeeping proteins involved in metabolic processes that are necessary for most, if not all of eukaryotic life. It is amazing that these genes have been conserved over 1.5 billion years of evolution!

### References

1. Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103–1108 (2006).
2. Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999).
3. Armstrong, J., Fiddes, I. T., Diekhans, M. & Paten, B. Whole-Genome Alignment and Comparative Annotation. Annu Rev Anim Biosci 7, 41–64 (2019).
4. Ransohoff, J. D., Wei, Y. & Khavari, P. A. The functions and unique features of long intergenic non-coding RNA. Nat. Rev. Mol. Cell Biol. 19, 143–157 (2018).
5. Diederichs, S. The four dimensions of noncoding RNA conservation. Trends Genet. 30, 121–123 (2014).
6. Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins 77, 499–508 (2009).
7. Zheng, W. et al. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput. Biol. 15, e1007411 (2019).
8. Vakirlis, N., Carvunis, A.-R. & McLysaght, A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Cold Spring Harbor Laboratory 735175 (2019) doi:10.1101/735175.
9. Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. doi:10.1093/nar/gkaa1007.
10. Nevers, Y. et al. OrthoInspector 3.0: open portal for comparative genomics. Nucleic Acids Res. 47, D411–D418 (2019).
11. Moreno-Hagelsieb, G. & Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319–324 (2008).
12. Reijnders, M. J. & Waterhouse, R. M. Summary Visualisations of Gene Ontology Terms with GO-Figure! Cold Spring Harbor Laboratory 2020.12.02.408534 (2020) doi:10.1101/2020.12.02.408534.

Share or comment:

# Progress in genomic checkers

• Author: Nastassia Gobet •

When I started using word processors, the spell checker was only looking at small and common typing errors and was often trying to correct acceptable words due to lack of vocabulary. A few years later, they not only are better at it and use more developed dictionaries, but they can also capture grammar mistakes and redundant phrases. A similar story is happening with the detection of genomic variants.

## The genome as a big text

The genome can be considered as a big text, written in a 4-letter alphabet (A, C, G, T). When comparing the genomic words from two individuals, we can look at single or few letter(s) differences (single nucleotide variants, SNVs) and longer patterns (structural variants, SVs) such as words, sentences, and paragraphs that are added (insertions) or missing (deletions), exchanged (translocations), repeated (duplications and copy number variations, CNVs), inverted (inversions) or combinations of these (complex SVs).

## Discovering the importance of SVs

About ten years ago, the focus was mainly on SNVs as these are numerous and many methods to detect them were developed. They were studied in deep and indexed in dictionaries (databases) that also document their frequencies. However, one letter differences do not necessarily have a significant effect on the meaning of the text (the phenotypes). On the other hand, although SVs were underestimated and consequently understudied, they were discovered to have a profound phenotypic impact on gene regulation, dosage, and function. Therefore, they are important in a wide variety of medical conditions: cancers, neurological diseases (Parkinson, Huntington), and mental disorders (autism, schizophrenia).

## Challenges in SV identification

Methods were recently developed and are currently being developed to detect SVs. A number of challenges need to be dealt with. First, short read sequencing greatly limits the detection of large events exceeding read length. Consequently, using longer read technologies (PacBio and ONT) is improving the range of detectable SVs, but this comes at the cost of decreased sequencing accuracy and higher price. Hybrid strategies combining short and long reads are therefore promising. Second, SVs are hard to classify as the variant type depends on variant sequence context: a sequence can be considered an insertion, duplication, or translocation depending on the source (Figure 1). In addition, the number of possible SVs is infinite, whereas for SNVs there are 3 variants per position in the worst case. SVs are thus hard to compare: which criteria should we use to determine if two slightly different calls correspond to the same event or not? This affects SV reporting and frequencies. Due to the relative youth of the field, standards and best practices have yet to be established. Different initiatives (eg. Genome in a Bottle and SEQC2) aim at better characterizing false positives and false negatives in SV calling. This should help implement more objective benchmarking and comparison between the various detection methods.

Figure 1: An SV was called for a sequence from a sample differing from the reference sequence. Three possible scenarios of formation could explain the SV observed: an insertion, a duplication or a translocation.

## Future of genomic spelling and grammar checkers

Standards and objective benchmarking for SV detection are still missing, so one must be careful with results obtained from current methods. However, SVs are increasingly recognized as being important and technologies to detect them are evolving rapidly. I think their use will become a more common practice in genomic variation studies in a few years, similar to spelling and grammar checkers in text processors. And you, which genome checker will you use?

## Reference

Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. 2019. Structural variant calling: the long and the short of it. Genome Biol 20:246. doi:10.1186/s13059-019-1828-7.

If you want to get involved in improving SV variant detection, consider joining this Hackathon, to be hold remotely Oct. 11-14, 2020.

Share or comment:

# OMA Standalone made easy: a step-by-step guide

• Author: Natasha Glover •

Got newly sequenced genomes with protein annotations? Need to quickly and easily define the homologous relationships between the genes?

OMA Standalone is a software developed by our lab which can be used to infer homologs from whole genomes, including orthologs, paralogs, and Hierarchical Orthologous Groups (Altenhoff et al 2019).

The OMA Standalone algorithm works like this:

In short, it takes as input user-contributed custom genomes (with the option of combining them with reference genomes already in the OMA database), and proceeds through three main parts:

1. Quality and consistency checks of the genomes that will be used to run OMA Standalone;
2. All-against-all alignments of every protein sequence to all other protein sequences;
3. Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020).

Although the OMA Standalone is well-documented and straightforward, one of the challenges can be running it on an High Performance Cluster (HPC).

In order to understand the bare necessities needed to run OMA Standalone, we wrote an OMA Standalone Cheat Sheet, which you can download and follow the step-by-step instructions on running the software on an HPC. We use the cluster Wally as an example, as that is one of the HPCs here at the University of Lausanne. Wally uses SLURM as the scheduler for submitting jobs, so all the examples will be shown with that. We plan in the future to provide additional information on running with other schedulers, such as LSF or SGE. In the Cheat Sheet, you will find tips, hints, commands, and example scripts to run OMA Standalone on Wally.

Additionally, we prepared a video which walks the user through the process of running OMA Standalone from start to finish, including:

• Preparing your genomes for running
• Editing the necessary parameters file
• Creating the job scripts and

The video can be found on our lab’s YouTube channel, at OMA standalone: how to efficiently identify orthologs using a cluster, and is also embedded here for your convenience:

We hope these resources can be helpful if you need help getting started running OMA Standalone. But don’t forget, there is also plenty of information that can be found on the OMA Standalone webpage or in the OMA Standalone paper. If all else fails, don’t hesitate to contact us on Biostars.

## References

1. Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).
2. Zahn-Zabal, M., Dessimoz, C. & Glover, N. M. Identifying orthologs with OMA: A primer. F1000Res. 9, 27 (2020).

Share or comment:

# Creating a bibliography with links to PubMed and PubMedCentral

• Author: Christophe Dessimoz •

We just submitted a paper to Nucleic Acids Research Web Server issue. As it turns out, the editor requires a bibliography with links to DOI, PubMed, and PubMedCentral entries.

This is a brief tutorial to generate such a bibliography from a bibtex file which contains the relevant entries, loosely based on the explanations provided in this TeX StackExchange entry.

## Generating the bibtex file

In the lab, we mainly use Paperpile as bibliography management system, but most system allow to export records in bibtex format. If available, Paperpile includes DOI, PubMed IDs, and PubMedCentral IDs as follows:

@ARTICLE{Glover2019-xs,
title    = "{Assigning confidence scores to homoeologs using fuzzy logic}",
author   = "Glover, Natasha M and Altenhoff, Adrian and Dessimoz, Christophe",
journal  = "PeerJ",
volume   =  6,
pages    = "e6231",
year     =  2019,
doi      = "10.7717/peerj.6231",
pmid     = "30648004",
pmc      = "PMC6330999"
}

In this example, we store the bibliography in a file named ref.bib.

## Extending biblatex to include PMID and PMC links in the bibliography

DOI are already supported by most bibliography systems. To also include PMID and PMCIDs, the trick is to use the flexible BibLatex package.

In a separate definition file, which we named adn.dbx, add the additional definitions for PMID and PMCIDs.

\DeclareDatamodelFields[type=field,datatype=literal]{
pmid,
pmc,
}

We can now include this file in the library definition in the main LaTeX file (in the preamble, i.e. before \begin{document}, and define the links to PubMed and PubMedCentral entries:

\usepackage[backend=biber,style=authoryear,datamodel=adn]{biblatex}
\DeclareFieldFormat{pmc}{%
\ifhyperref
\DeclareFieldFormat{pmid}{%
\ifhyperref
\renewbibmacro*{finentry}{%
\printfield{pmid}%
\newunit
\printfield{pmc}%
\newunit
\finentry}
\addbibresource{ref.bib}

We can generate the bibliography by citing every paper using the \cite{} command, and printing the bibliography.

\cite{Glover2019-xs}
\newpage
\printbibliography

## Polishing: highlighting the links with colour

To make the links more visible, define the hyperref package accordingly:

\usepackage[colorlinks]{hyperref}


## OK, thanks but could I just have the files please?

Of course! Here they are.

Share or comment:

# How to get published: interview series

• Author: Natasha Glover •

You’ve wrapped up the research on project. You’ve gotten good results. It’s time to publish them—your PhD/postdoc/career depends on it.

But you’re drawing a blank. Staring at the screen for hours, not knowing how to get started, and the only thing you can produce is an empty document. One of the most daunting tasks for young scientists is writing and publishing a paper, especially the first one.

In the context of a tutorial of the UNIL Quantitative Biology PhD Program, I prepared a series of three short videos featuring interviews with professors on tips to successfully publish a scientific paper. These videos were aimed towards PhD students, but contain useful advice for anyone, at any stage of their career. Here’s what they had to say:

# Part 3: responding to reviewers

Share or comment:

# Exclusive: European Tour of Antonis Rokas

• Author: Christophe Dessimoz •

We are delighted to host Prof. Antonis Rokas, Vanderbilt, for two special seminars at University College London and at the University of Lausanne!

## Genomics and the making of biodiversity across the budding yeast subphylum

Prof. Antonis Rokas, Vanderbilt University

London: Tue 13 Nov 2018, 11am, UCL, Roberts Building 309
Lausanne: Wed 14 Nov 2018, 11am, UNIL, Genopode auditorium A

Abstract

Yeasts are unicellular fungi that do not form fruiting bodies. Although the yeast lifestyle has evolved multiple times, most known species belong to the subphylum Saccharomycotina (hereafter yeasts). This diverse group includes the premier eukaryotic model system, Saccharomyces cerevisiae; the common human commensal and opportunistic pathogen, Candida albicans; and over 1,000 other known species (with more continuing to be discovered). Yeasts are found in every biome and continent and are more genetically diverse than either plants or bilaterian animals. Ease of culture, simple life cycles, and small genomes (10– 20 Mbp) have made yeasts exceptional models for molecular genetics, biotechnology, and evolutionary genomics. Since only a tiny fraction of yeast biodiversity and metabolic capabilities has been tapped by industry and science, expanding the taxonomic breadth of deep genomic investigations will further illuminate how genome function evolves to encode their diverse metabolisms and ecologies. As part of National Science Foundation’s Dimensions of Biodiversity program, we have undertaken a large-scale comparative genomic study to uncover the genetic basis of metabolic diversity in the entire Saccharomycotina subphylum. In my talk, I will discuss the team’s evolutionary analyses of 332 genomes spanning the diversity of the subphylum. These include establishing a robust genus-level phylogeny and timetree for the subphylum, quantification of the extent of horizontal gene transfer for the subphylum, and characterization of the evolution of approximately 50 metabolic traits (and, in some cases, their underlying genes and pathways). These analyses allow us, for the first time, to infer the key metabolic characteristics of the Last Yeast Common Ancestor (LYCA) and characterize the tempo and mode of genome evolution across an entire subphylum.

All welcome!

Share or comment:

# Predicting QTL genes by integrating functional data across species

• Author: Christophe Dessimoz •

## The problem in a nutshell

Quantitative Trait Loci (QTL) are regions of a genome for which genetic variants correlate with particular traits. To take a simple example in plants, one might observe that the average seed size (trait) is significantly larger when considering the subset of a population which has a C at a particular position in the genome than a subpopulation with a T.

The reason QTL identifies genomic regions and not precise positions is that neighbouring variants tend to be inherited together. These regions typically contain hundreds of genes, making it difficult to say which one(s) are causal to the trait variation—if any at all (the causal genetic variation(s) can be in non-coding regions too).

Thus, to prioritise candidate causal genes within a QTL region, researchers typically consider previous knowledge on these genes, to see whether a particular gene “makes sense”. In the case of seed size, it might be a gene previously implicated in growth or regulation, or a gene known to influence seed size in a different species. This process is however requires substantial manual interpretation, and is thus labour-intensive and haphazard.

## Enter QTLsearch

We realised that our framework of hierarchical orthologous groups, which relates genes across many species, could be extended to integrate QTL results with previous gene function annotations.

Conceptual overview of QTLsearch

If we go back to the seed size example, it might be that among the genes in the window, one has an ortholog in a different species previously annotated with the GO term “reproductive system development”. This could be a good candidate causal gene.

One risk however in integrating lots of previous knowledge across many species is that we might also find some spurious patterns. We therefore had to devise a way of controlling for random associations between QTL regions and evolutionarily propagated knowledge. Such “null distribution” depends on the specificity or the terms in question, the amount of annotations, the size of the QTL regions, and the species sampling. To cope with this complexity, we chose to implement a non-parametric permutation test.

We implemented the tool as an open source package called QTLsearch, available here.

## QTLsearch infers more candidate causal genes than manual analyses

We used QTLsearch to reanalyse two previous studies. In both cases, we could call more candidate genes than the original studies. But more importantly, the evidence behind our calls is fully traceable and statistically supported.

QTLsearch could identify more candidate genes than the original study, but in an automated, reproducible, and statistically meaningful way.

Thus we think this will greatly facilitate future QTL analyses, particularly those that are done in non-model species for which the previous experimental knowledge is very limited.

## Behind the paper

This is the third paper that resulted from our collaboration with Bayer CropScience (now BASF CropScience), after our work on homoeologs and on detecting split genes.

The project was conceived by Henning Redestig, collaborator at Bayer at the start of the project (now at DuPont). Henning had contributed to a QTL study and knew how labor intensive the search for putative causal genes is. He realised that HOGs could provide a natural way of integrating functional knowledge across multiple species, to combine the QTL information with previous functional data.

Alex Warwick Vesztrocy, PhD student on the project and first author, ran with the idea—promptly implementing and testing it. Early results looked promising, but Alex soon realised that the mapping between metabolites and GO terms could be improved. He also realised that some terms were quite common, so he devised the approach to compute the significance scores.

Our manuscript was accepted as proceedings paper at the European Conference on Computational Biology (ECCB). In our lab, we like proceedings paper. It’s nice to be able to present the work and publish the paper, particularly since the ECCB proceedings appear in a good journal. More importantly, conferences impose hard deadlines. Deadlines for submission of course, but also for peer-reviewing and for deciding acceptance or not!

## Reference

Alex Warwick Vesztrocy, Christophe Dessimoz*, Henning Redestig*, Prioritising Candidate Genes Causing QTL using Hierarchical Orthologous Groups, Bioinformatics, 2018, 34:17, pp. i612–i619 (ECCB 2018 proceedings) [Open Access Full Text]

Share or comment:

# Over 10 computational biology positions at PhD, postdoc and group leader level

• Author: Christophe Dessimoz •

(Post updated on 4 Oct 2018 and on 13 Nov 2018)

Our lab has an open position, and so do collaborators and colleagues across Switzerland and Europe.

## Bonus position (not computational but what the heck)

Share or comment:

# What genes do I have in common with a plant?

• Author: Natasha Glover •

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. Therefore, we can take any two species on the tree of life and find out what set of genes were present in the common ancestor of the two species by looking for the genes that they still share to this day. When two species are closely related, i.e. close together on the tree of life, we can observe that much of their genome is the same. But what do I mean by “same”, and what do I mean by “shared”?

Genomes are complex, with many layers of control (DNA-level, RNA-level, protein-level, epigenetic level, alternative splicing-level, protein-protein interaction level, transposable element-level, etc). Because of this complexity, there could be many ways to compare the genomes of two different species to determine which portion of the genome is shared.

In this blog post, when I refer to “shared” parts of the genome, I’m referring to orthologs, which are genes in different species that started diverging due to a speciation event. Orthologs can be inferred from complete genome sequences, and there are many different methods to do this. In particular, I will focus on the Dessimoz lab’s method and database, Orthologous MAtrix, or OMA. OMA uses protein sequences from whole genome annotations to compute orthologs between over 2000 species. The OMA algorithm is graph-based, compares all protein sequences to all others to find the closest between two genomes, allows for duplicates, and uses information from related genomes to improve the calling. OMA is available at https://omabrowser.org/.

It’s quite well-publicized that humans and chimps are extremely similar in terms of gene content, which is why they could be considered our evolutionary cousins1. However, when two species are more distantly related, there is less of their genomes which are the same. But how distant can we go in the tree of life and still see the same genes between different species? For example, how many genes do I (a human) have in common with a plant? Can we even detect orthologs between two species so evolutionarily distant? And if so, what biological/functional role do these genes play?

In this blog post, I will show how to use OMA and some of its tools to find out how much of the human genome we share with plants. There are many different plant species with their genomes’ sequenced, so I will use the extensively-studied model species Arabidopsis thaliana as the representative plant to compare the human protein-coding genes to.

# Setup

For the remainder of this blog post, I will show how to use the OMA database, accessed via the python REST API, to get ortholog pairs between two genomes.

In python, I use the requests library to send queries for information to the server housing the OMA database. Pandas is a well-known python library for working with dataframes, so I will use that for analysis. For more information, see:

Let’s start our python code by importing libraries:

import requests
import json
import pandas as pd


The OMA API is accessible from the following link, so we will save it for later use in our code:

api_url = "https://omabrowser.org/api"


# Getting pairwise orthologs with the OMA API

Pairwise orthologs are computed between all pairs of species in OMA as part of the normal OMA pipeline. Accessing the data will result in a list of pairs of genes, one from one species and the other from the other species. Any one gene may be involved in several different pairs, depending of if its ortholog has undergone duplication or not.

It is important to note that here we are only using pairs of orthologs. In OMA, we report several types of orthologs, namely pairs of orthologs or groups of orthologs (Hierarchical Orthologous Groups (HOGs). We use the pairs as the basis for building HOGs, which aggregates orthologous pairs together into clusters at different taxonomic levels. Some ortholog pairs are removed, and some new pairs are inferred when creating HOGs. Therefore, the set of orthologs deduced from HOGs will be slightly different than the set of orthologs purely from the pairs.

Here we will use the API to request all the pairwise orthologs between Homo sapiens and Arabidopsis thaliana. OMA uses either the NCBI taxon identifier or the 5-letter UniProt species code to identify species. I will use the UniProt codes, which are “HUMAN” and “ARATH.” To request the ortholog pairs between HUMAN and ARATH, the request would simply look like:

requests.get(api_url + '/pairs/HUMAN/ARATH/')


However, when sending a request that returns long lists, i.e. lists of thousands of genes, the OMA API uses pagination. This means all the results won’t be returned at once, but sequentially page by page, with only 100 results per page. This is a safeguard to make sure the OMA servers don’t get overloaded with requests. Here I show a simple workaround to get all of the pairs.

## Get all pairs

genome1 = "HUMAN"
genome2 = "ARATH"

#get first page
response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2))

#use the header to calculate total number of pages
print("There are {} pages that need to be requested.".format(total_nb_pages))

> There are 128 pages that need to be requested.


Since we want all the pages, not just the first one, we have to use a loop to go through and request all the pages.

#get responses for all pages
responses = []
for page in range(1, total_nb_pages + 1):
tmp_response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2)+"?page="+str(page))
responses.append(tmp_response.json())

#some basic tidying up so we only have a list of pairs at the end
pairs = []
for response in responses:
for pair in response:
pairs.append(pair)

#Example of a pair
pairs[0]

> {'entry_1': {'entry_nr': 8066469,
>   'entry_url': 'https://omabrowser.org/api/protein/8066469/',
>   'omaid': 'HUMAN00009',
>   'canonicalid': 'NOC2L_HUMAN',
>   'sequence_md5': 'dc91b2521daf594037a6c318f3b04d5a',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.4b.15b.5a.3b',
>   'chromosome': '1',
>   'locus': {'start': 944694, 'end': 959240, 'strand': -1},
>   'is_main_isoform': True},
>  'entry_2': {'entry_nr': 12384097,
>   'entry_url': 'https://omabrowser.org/api/protein/12384097/',
>   'omaid': 'ARATH12334',
>   'canonicalid': 'NOC2L_ARATH',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.1c',
>   'chromosome': '2',
>   'locus': {'start': 7928254, 'end': 7931851, 'strand': 1},
>   'is_main_isoform': True},
>  'rel_type': '1:1',
>  'distance': 138.0,
>  'score': 1094.969970703125}


We can see that the data is in the form of a big list of pairs, with each pair being a dictionary. In this dictionary, one key is entry_1, which is the human gene, and the second key is entry_2, which is the arabidopsis gene. The values for each of these keys are dictionaries with information about the gene. Additionally, there are other key value pairs in the pair dictionary which gives information computed about the pair, such as rel_type, distance, and score (will be explained later).

## Make dataframe with all pairs

I personally like to work with pandas because I find it easy to use, so I will import the information about the pairs into a dataframe.

#use pandas to make dataframe
df = pd.DataFrame.from_dict(pairs)

def make_columns_from_entry_dict(df, columns_to_make):
'''Parses out the keys from the entry dictionary and adds them to the dataframe

genome1 = df['entry_1'][0]['omaid'][:5]
genome2 = df['entry_2'][0]['omaid'][:5]

df[genome1+"_"+columns_to_make] = df.apply(lambda x: x['entry_1'][columns_to_make], axis=1)
df[genome2+"_"+columns_to_make] = df.apply(lambda x: x['entry_2'][columns_to_make], axis=1)

return df

#clean up the columns of the entry_1 and entry_2 dictionaries
df = make_columns_from_entry_dict(df, "omaid")
df = df[['HUMAN_omaid','ARATH_omaid','rel_type','distance','score']]

#Here's a snippet of the dataframe
df[:10]

HUMAN_omaid ARATH_omaid rel_type distance score
0 HUMAN00009 ARATH12334 1:1 138.0 1094.969971
1 HUMAN00029 ARATH38493 1:1 168.0 246.330002
2 HUMAN00032 ARATH38034 1:1 69.0 773.869995
3 HUMAN00034 ARATH39850 m:n 140.0 635.900024
4 HUMAN00034 ARATH33319 m:n 144.0 655.409973
5 HUMAN00034 ARATH01678 m:n 137.0 723.580017
6 HUMAN00034 ARATH07547 m:n 134.0 761.450012
7 HUMAN00036 ARATH01479 1:1 103.0 554.590027
8 HUMAN00037 ARATH10852 1:1 60.0 2325.889893
9 HUMAN00052 ARATH02626 1:1 95.0 328.959991

Here you can see that each row is a pair of orthologs, with some information about each. The rel_type is the relationship cardinality, between the orthologs. 1:1 means only one ortholog in human found in arabidopsis, whereas 1:m would mean it duplicated in ARATH. m:n means there were lineage-specific duplications on both sides.

# How many pairs between HUMAN and ARATH?

print("There are {} pairs of orthologs between {} and {}.".format(len(df), genome1, genome2))

> There are 12792 pairs of orthologs between HUMAN and ARATH.


What percentage is this of the human genome? What percentage of the Arabidopsis genome? To answer this we need to 1) Get the number of genes in each genome that this represents, because there can be more than 1 pair of orthologs per gene. 2) We also need to get the total number of genes per genome.

First we get the number of genes for each species that has at least one ortholog, using the dataframe from above. We don’t have to worry about alternative splic variants (ASVs) because OMA chooses a representative isoform when computing orthologs.

human_proteins_with_ortholog = set(df['HUMAN_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
format(len(human_proteins_with_ortholog), "HUMAN","ARATH"))

arath_proteins_with_ortholog = set(df['ARATH_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
format(len(arath_proteins_with_ortholog), "ARATH","HUMAN"))

> There are 3769 genes in HUMAN with at least 1 ortholog in ARATH.
> There are 5177 genes in ARATH with at least 1 ortholog in HUMAN.


# Get the genomes of HUMAN and ARATH

Now we will again use the API to get all the proteins in the genome. (See https://omabrowser.org/api/docs#genome-proteins). Here I will write a function using the same procedure as above to deal with pagination.

#Use the API to get all the human genes
def get_all_proteins(genome):
'''Gets all proteins using OMA REST API'''
responses = []
response = requests.get(api_url + '/genome/{}/proteins/'.format(genome))
for page in range(1, total_nb_pages + 1):
tmp_response = requests.get(api_url + '/genome/{}/proteins/'.format(genome)+"?page="+str(page))
responses.append(tmp_response.json())

proteins = []
for response in responses:
for entry in response:
proteins.append(entry)
return proteins

human_proteins = get_all_proteins("HUMAN")

#Here is an example entry
human_proteins[0]

> {'entry_nr': 8066461,
>  'entry_url': 'https://omabrowser.org/api/protein/8066461/',
>  'omaid': 'HUMAN00001',
>  'canonicalid': 'OR4F5_HUMAN',
>  'sequence_md5': 'df953df7a11ee7be5484e511551ce8a4',
>  'oma_group': 650473,
>  'oma_hog_id': 'HOG:0361626.2a.7a',
>  'chromosome': '1',
>  'locus': {'start': 69091, 'end': 70008, 'strand': 1},
>  'is_main_isoform': True}


It is important to note that this list of genes/proteins includes ASVs for many species. Therefore we need to take care to remove them to have one represenative isoform per gene. We can do this by filtering the list of entries based on the ‘is_main_isoform’ key. Here is a small function to do that.

#get canonical isoform
def get_canonical_proteins(list_of_proteins):
proteins_no_ASVs = []
for protein in list_of_proteins:
if protein['is_main_isoform'] == True:
proteins_no_ASVs.append(protein)
return proteins_no_ASVs

print("HUMAN\n-----")
print("The number of genes before removing alternative splice variants: {}".format(len(human_proteins)))

human_proteins_no_ASVs = get_canonical_proteins(human_proteins)

print("The number of genes after removing alternative splice variants: {}".format(len(human_proteins_no_ASVs)))

> HUMAN
> -----
> The number of genes before removing alternative splice variants: 30700
> The number of genes after removing alternative splice variants: 20152

#Do the same for ARATH
print("ARATH\n-----")
arath_proteins = get_all_proteins("ARATH")

print("The number of genes before removing alternative splice variants: {}".format(len(arath_proteins)))
arath_proteins_no_ASVs = get_canonical_proteins(arath_proteins)
print("The number of genes after removing alternative splice variants: {}".format(len(arath_proteins_no_ASVs)))

> ARATH
> -----
> The number of genes before removing alternative splice variants: 40999
> The number of genes after removing alternative splice variants: 27627


# Proportion of the genomes which have orthologs in the other

Now that we have all the necessary information we can see what percentage of the human and arabidopsis genomes are shared, in terms of proportion of genes with at least 1 ortholog in the other species.

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
.format("HUMAN", "ARATH", \
round((len(human_proteins_with_ortholog)/len(human_proteins_no_ASVs)*100), 2)))

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
.format("ARATH", "HUMAN", \
round((len(arath_proteins_with_ortholog)/len(arath_proteins_no_ASVs)*100), 2)))

> The percentage of genes in HUMAN with at least 1 ortholog in ARATH: 18.7%
> The percentage of genes in ARATH with at least 1 ortholog in HUMAN: 18.74%


So the answer to the original questions is that BOTH humans and arabidopsis have 18.7% of their genome shared with each other. A weird coincidence that it turns out to be the same proportion of both genomes!

# Conclusion

Using the OMA database for orthology inference, we found that:

• There are 12792 pairs of orthologs between HUMAN and ARATH.
• There are 3769 genes in human that have at least 1 ortholog in arabidopsis.
• There are 5177 genes in arabidopsis that have at least 1 ortholog in human.
• These numbers represent about 19% of both genomes.

Therefore, we can conclude that over the course of evolution since the animal and plant lineages split (~1.5 billion years ago2), about 19% of the protein-coding genes remain and are able to be detected by OMA.

Stay tuned for the next blog post, where I will show what are the functions of these shared genes!

## References

1. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). DOI: https://doi.org/10.1038/nature04072
2. Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999). DOI: 10.1098/rspb.1999.0617

Share or comment:

# pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs)

• Author: Clement Train •

(This entry was updated 19 Sep 2018 to reflect recent feature updates)

pyHam (‘python HOG analysis method’) makes it possible to extract useful information from HOGs encoded in standard OrthoXML format. It is available both as a python library and as a set of command-line scripts. Input HOGs in OrthoXML format are available from multiple bioinformatics resources, including OMA, Ensembl and HieranoidDB.

This post is a brief primer to pyham, with an emphasis on what it can do for you.

# How to get pyHam?

pyHam is available as python package on the pypi server and is compatible python 2 and python 3. You can easily install via pip using the following bash command:

pip install pyham


You can check the official pyham website for further information about how to use pyham, documentation and the source code.

# What are Hierarchical Orthologous Groups (HOGs)?

You don’t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:

# Where to find HOGs?

HOGs inferred on public genomes can be downloaded from the OMA orthology database. Other databases, such as Eggnog, OrthoDb or HieranoiDB also infer HOGs, but not all of these databases offer them in OrthoXML format. You can check which database serves hogs as orthoxml here. If you want to use your custom genomes to infer HOGs you can use the OMA standalone software.

In order to facilitate the use of pyHam on single gene family, we provide the option to let pyham fetch required data directly from a comptabilble databases (for now only OMA is available for this feature). The user simply have to give the id of a gene inside the gene family (HOGs) of insterest along with the name of the compatibible database where to get the data and pyHam will do the rest.

For example, if you are interest by the P_53 gene in rat (P53 rat gene page in OMA) you simply have to run the following python code to set-up your pyHam session:

my_gene_query = 'P53_RAT'
database_to_query = 'oma'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from=database_to_query)


The main features of pyHam are: (i) given a clade of interest, extract all the relevant HOGs, each of which ideally corresponds to a distinct ancestral gene in the last common ancestor of the clade; (ii) given a branch on the species tree, report the HOGs that duplicated on the branch, got lost on the branch, first appeared on that branch, or were simply retained; (iii) repeat the previous point along the entire species tree, and plot an overview of the gene evolution dynamics along the tree; and (iv) given a set of nested HOGs for a specific gene family of interest, generate a local iHam web page to visualize its evolutionary history.

## What is the number of genes in a particular ancestral genome? (i)

In pyHam, ancestral genomes are attached to one specific internal node in the inputted species tree and denoted by the name of this taxon. Ancestral genes are then infered by fetching all the HOGs at the same level.

# Get the ancestral genome by name
rodents_genome = ham_analysis.get_ancestral_genome_by_name("Rodents")

# Get the related ancestral genes (HOGs)
rodents_ancestral_genes = rodents_genome.genes

# Get the number of ancestral genes at level of Rodents
print(len(rodents_ancestral_genes)


## How can I figure out the evolutionary history of genes in a given genome? (ii)

pyHam provides a feature to trace for HOGs/genes along a branch that span across one or multiple taxonomic ranges and report the HOGs that duplicated on this branch, got lost on this branch, first appeared on that branch, or were simply retained. The ‘vertical map’ (see further information on map here) allows for retrieval of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).

# Get the genome of interest
human = ham_analysis.get_extant_genome_by_name("HUMAN")
vertebrates = ham_analysis.get_ancestral_genome_by_name("Vertebrata")

# Instanciate the gene mapping !
vertical_human_vertebrates = ham_analysis.compare_genomes_vertically(human, vertebrates) # The order doesn't matter!

# The identical genes (that stay single copies)
# one HOG at vertebrates -> one descendant gene in human
vertical_human_vertebrates.get_retained())

# The duplicated genes (that have duplicated)
# one HOG at vertebrates -> list of its descendants gene in human
vertical_human_vertebrates.get_duplicated())

# The gained genes (that emerged in between)
# list of gene that appeared after vertebrates taxon
vertical_human_vertebrates.get_gained()

# The lost genes (that been lost in between)
HOG at vertebrates that have been lost before human taxon
vertical_human_vertebrates.get_lost()


## How can I get an overview of the gene evolution dynamics along the tree that occured in my genomic setup? (iii)

pyHam includes treeProfile (extension of the Phylo.io tool), a tool to visualise an annotated species tree with evolutionary events (genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a minimalist and intuitive way to visualise the number of evolutionary events that occurred on each branch or the numbers of ancestral genes along the species tree.

# create a local treeprofile web page
treeprofile = ham_analysis.create_tree_profile(outfile="treeprofile_example.html")


As you can see in the figure above, the treeprofile is composed of the reference species used to perform the pyham analysis. Each internal node is displayed with its related histogram of phylogenetic events (number of genes duplicated, lost, gained, or retained) that occurred on each branch. The tree profile either display the number of genes resulting from phylogenetics events or the number of phylogenetic events on themself; the switch can be made by opening the settings panel (histogram icon on top right) and selecting between ‘genes’ or ‘events’.

## How can I visualise the evolutionary history of a gene family (HOG)? (iv)

pyHam embeds iHam, an interactive tool to visualise gene family evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.

# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)

# create and export the hog vis as .html
output_filename = "hogvis_example.html"
pyham_analysis.create_hog_visualisation(hog=hog_of_interest,outfile=output_filename)


Then, you simply have to double click on the .html file to open it in your default internet browser. We provide you an example below of what you should see. A brief video tutorial on iHam is available at this URL.

iHam is composed of two panels: a species tree that allows you to select the taonomic range of interest, a genes panel where each grey square represents an extant gene and each row a species.

We can see for example that at the level of mammals (click on the related node and select ‘Freeze at this node’) all genes of this gene family are descendant from a single comon ancestral gene.

Now, if we look at the level of Euarchontoglires (redo the same procedure as for mammals to freeze the vis at this level) we observe that the genes are now split by a vertical line. This vertical line separates 2 group of genes that are each descendants from a same single ancestral gene. This is the result of a duplication in between Mammals and Euarchontoglires.

This small example demonstrate the simplicity of iHam usefulness to identify evolutionary events that occured in gene families (e.g. when a duplication occured, which species have lost genes or how big genes families evolved).

Share or comment: