OMArk: assessing proteome quality, quick and easy

• Author: Yannis Nevers •

I am excited to introduce our preprint for our new tool OMArk. We hope our software will help fill a gap in assessing the quality of gene annotation sets.

Many studies directly rely on the protein-coding gene repertoires (“proteomes”) predicted from genome assemblies to perform their comparisons. Doing so, they rely on the assumption that the predicted gene content of all genomes are of homogeneous quality and an accurate reflection of reality. Yet in practice, this assumption is rarely met, with protein-coding genes often missing or fragmented in the reported proteomes, non-coding sequences wrongly annotated as coding genes by gene predictors, or contamination from other species wrongly included among the reported sequences.


Why a new proteome quality tool?

Our new method, OMArk, provides a way to easily and comprehensively measure different aspects of proteome quality: completeness of the gene repertoire, consistency of the included genes at the taxonomic level, whether they have doubtful gene structures, and presence or not of inter or intra-domain contamination. Furthermore, contrary to existing methods, OMArk does not rely on a manual selection of reference dataset; instead, it automatically identifies the most likely taxonomic classification of the test proteomes. It can thus process any test proteome across the tree of life using a universal reference database.

Conceptual overview of the OMArk tool for genome quality assessmentgenome or proteome consistency assessment gives new insights

Conceptual overview of OMArk (left) and how the innovative consistency assessment is computed (right).


OMArk is accurate and provides new insights

We performed extensive validation of the method by introducing controlled amounts of noise, fragmentation and contamination to reference proteomes and accurately estimating these amounts using OMArk. We also performed a large-scale analysis of 1805 eukaryotic UniProt Reference Proteomes with our software and were able to detect unambiguous cases of quality issues, either caused from incompleteness, contamination, or inclusion of translated non-coding sequences. In the most extreme case, we found a plant proteome with contamination from eight different species—fungi and bacteria.

OMArk run on all UniProt Reference Proteomes

OMArk results on 1805 Eukaryotic proteomes from UniProt. Interactively check results on the OMArk webserver, e.g. for the current cowpea weevil reference proteome.

Why does the consistency metric matter? For example, comparing the Ensembl gene set for two assemblies of Bombus impatiens. We can detect a major improvement in consistency (including contamination removal) for a similar completeness.

OMArk can reveal improvements in genome assemblies/annotations even if completeness has not substantially changed

OMArk can reveal improvements in genome assemblies/annotations even if completeness has not substantially changed.


OMArk is quick and easy to run

OMArk can be easily used as a command line tool or on our OMArk webserver. On the webserver, you can submit a FASTA file of your proteome and get results in about 30 minutes. Nothing more required. You can visualize the results and directly compare it to precomputed results from closely related species (UniProt reference proteomes).

More details can be found in the preprint linked below. Please let us know how the tool works for you!



Yannis Nevers, Victor Rossier, Clément Marie Train, Adrian Altenhoff, Christophe Dessimoz, Natasha Glover. Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv 2022.11.25.517970

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

This year, the MCEB international conference was held in Switzerland for the first time in its history.

During five days, from June 26th to 30th 2022, experts in mathematical and computational evolutionary biology from all over the world exchanged about their passion in Chateau d’Oex, in the heart of the welcoming mountains of the regional natural park of Gruyeres.

Here is a little summary on this annual not-to-be-missed event for evolutionary biology aficinionados.

The edition 2022 of the international MCEB conference brought together a hundreds of scientists from diverse disciplines and at different stages of their career, from PhD students to world-renown senior scientists. During five consecutive days, the Chateau d’Oex has hence been the improbable scene of inspiring exchanges about evolution between mathematicians, computational biologists, evolutionary biologists, ecologists, epidemiologists and cancer biologists.

view from conference roomcheese making

View from the conference venue (left) and cheese making social activity (right).

In total, 6 one-hour talks and 20 short talks were given to present epistemological perspectives, recent methodological advances and challenges yet to be addressed for reconstructing the evolutionary history of the genes, genomes, populations and species observed then and today on Earth.

As a mirror of the truly interdisciplinary nature of the event, a wide range of phylogenetic structures have been discussed during these five days. Networks of gene flows, phylogenetic trees, or genealogical trees predicted by coalescent theory were on the menu of this year.

Experts specialized in introgressive events such as horizontal gene transfers or endosymbioses provided insights on the methods and challenges to model reticulate evolution. With this respect, an inspirational talk was given on how ghost lineages that went extinct in the past but nonetheless exchanged some DNA with ancestors of extant species could mislead our interpretation of the directionality of gene flows within phylogenetic networks.

Lectures on the theoretical advances that were made throughout the last 50 years in the reconstruction of gene and species phylogenetic trees were then given by world leaders in the field of phylogeny. In particular, they provided mathematical evidence that contrary to what is practiced today to reconstruct species trees, neither the consensus tree of several gene trees nor the tree inferred from the concatenated alignment of these genes actually give a good approximation of the phylogeny of the different species encoding these genes, which highlights the urgent need to pursue methodological efforts to better model species evolution.

On shorter evolutionary timescales, numerous mathematical models to infer geneological trees of human populations or cancer cell lineages were also presented during the conference.

view from conference roomcheese making

Poster session (left) and group photo (right).

Finally, a strong focus was placed this year on methods for coupling phylogenetic inferences with phenotypical, ecological, archaeological, geographical, epidemiological and medical data in order to study how traits or diseases evolved across space and time. Striking examples of these integrative analyses were provided by methodologies to retrace with accuracy the evolution of the recent Sars-Cov2 and MERS-Cov pandemics over time and space.

In addition to these talks of exceptional scientific quality, two poster sessions animated by junior researchers and students took place during the conference and were truly appreciated by every participants for the scientific excellence of the posters and the conviviality of the moments.

Overall, through five days of scientific presentations, poster sessions, dinners, parties and social activities such as hiking in the Alpes or visiting a cheese factory, scientific exchanges and informal talks were fostered and allowed to create news bonds within this community of researchers.

The MCEB 2022 conference was a great success as it enabled a diversity of scientists from all over the world to meet, exchange on their work and build new collaborations!

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

The Banana Conjecture

• Author: Natasha Glover •

I recently became aware of the memes and popular science articles going around the internet claiming that we share 50% of our DNA with bananas. For example:

Banana Ortholog Meme


I work in the Dessimoz lab at the University of Lausanne, and here we are in the business of comparing genes. In fact, I’ve had a similar question before一 what percentage of our protein-coding genes do we share with another plant, Arabidopsis thaliana. I computed the number as being closer to 17%.

I wanted to get to the bottom of this question once and for all: What percentage of a human’s “genetic material” is shared with a banana? There have been several other blog posts from scientists touching on this question (Neil Saunders: “50% bananas”, Stack Exchange skeptics: “Do humans share 50% of their DNA with bananas?”, Sanogenetics: “Are We Genetically Similar To Bananas And Why Is This Important For Research In Disease?”).

However, I wanted to go a little deeper into:

  1. Where this number came from, and the extent of it being spread on the internet.
  2. What exactly do we mean by “shared genetic material”?
  3. Some results I computed in attempts to put this controversy to rest once and for all.

In this blog post I will attempt to address these questions.

Where did the mythical 50% come from anyway?

After performing a quick google search, it seems that the relatedness between a human and a banana has been a popular question. With a cursory, non-exhaustive search, I show in the table below eight sources who report that 44-60% of the human genome is “shared” with banana.

Source quote “But we are also genetically related to bananas – with whom we share 50% of our DNA – and slugs – with whom we share 70% of our DNA.” “Banana: more than 60 percent identical. Many of the “housekeeping” genes that are necessary for basic cellular function, such as for replicating DNA, controlling the cell cycle, and helping cells divide are shared between many plants (including bananas) and animals.” “So where does this banana statistic come from? Is it just complete nonsense? Well, no. We do in fact share about 50% of our genes with plants – including bananas.” “Bananas have 44.1% of genetic makeup in common with humans.”
MythBusters (tv show) facebook “#sciencefact: Humans share approximately 98% of their DNA with chimps, 70% with slugs, and 50% with bananas!” “Humans share 50% of our DNA with a banana.”
Business Insider “The genetic similarity between a human and a banana is 60%.” Source: National Human Genome Research Institute (However, no link and when I tried to search) “Yes, and we share 50% with bananas. It’s not surprising, if you look at the basic mechanism of biochemistry.”

What is disconcerting is that at least half of these sources come from popular science websites or science sections of newspapers, yet few have any sort of citation at all. The only exceptions were Popular Science, which gave DataScope as a source, and Business Insider, who cites the National Human Genome Research Institute. However, neither of these articles give a link or further information to follow up on.

Upon further digging, I found one recent article published on howstuffworks entitled “Do People and Bananas Really Share 50 Percent of the Same DNA?”, which contains an interview with one of the scientists from the Human Genome Research Institute, where he explains how they arrived at that number.

“Brody says the experiment was not published, as most scientific research is. Instead, it was generated to be included as part of an educational Smithsonian Museum of Natural History video called ‘The Animated Genome.’ That video noted that DNA between a human and a banana is ‘41 percent similar.’””

The article goes on to explain that this 41% figure comes from a blast search between protein sequences of human and banana. They found about 7,000 hits, and the average percent identity of these hits was 41%. He goes on to note:

“This is the average similarity between proteins (gene products), not genes… Of course, there are many, many genes in our genome that do not have a recognizable counterpart in the banana genome and vice versa.”

So when we get to the bottom of it, the 50% figure is actually 40% average amino acid percent identity between 7000 blast hits of human and banana.

What do we mean when we say “we share 50% of our DNA with a banana”?

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. What determines how many genes in common depends on how far back in time the two species shared a common ancestor. For example, humans and chimps share such a high percentage of genes, because we only diverged ~6 MYA1. However, human and banana (more specifically the common ancestors which led to human and banana) split around 1.5 BILLION years ago2. Talk about a banana split! Therefore we would expect a lot less to be conserved.

As brought up by Neil Saunders in his blog post, “What does ‘we share 50% of our DNA’ really mean?” A non-biologist perhaps might not see the nuance in this question. If I were going to play the devil’s advocate, I could say that a child shares 50% of its DNA with their parents. Or even that every organism shares 100% DNA, as it is all made up of Gs, Cs, As, and Ts. Thus, it is important to be specific on what we’re talking about.

This shared DNA could be referring to a number of things: protein-coding genes, non-coding genes, transposable elements, the percent that gets aligned in a whole genome alignment3, etc. Each of these specific features evolve at different rates, and thus will be more or less conserved between any given species.

Well, how do we know if these genomic features are conserved?

Generally, sequences are compared by making an alignment, and then computing the percent identity or evolutionary distance between the two sequences. If the sequences are sufficiently similar, they can be declared as conserved. Thus, “conserved” can be seen as either categorical (i.e. conserved or not), and then specified as a quantitative value (conserved to a certain degree). For more information, see the Wiki page on conserved sequences.

What are the genomic features the most likely to be conserved?

“Conservation indicates that a sequence has been maintained by natural selection” (wiki). Genes, or DNA sequences, encode for the proteins. Proteins are slower to evolve and change than the DNA, due to the redundancy of the genetic code. Thus proteins are the genomic feature most likely to be conserved between evolutionary distant species. While it is true that other genomic features such as non-coding regulatory sequences or non-coding RNA can be conserved over long evolutionary distances, they are far more likely to diverge in sequence than proteins4,5. Other genetic features such as transposable elements, or intergenic “junk DNA” are even less likely to be conserved, as their sequences are under less selection pressure and accumulate mutations at an even higher rate.

It is important to note that while we generally declare sequences to be conserved on the basis of sequence similarity, sequences may be still conserved and lack similarity. For example, two sequences might be conserved in the structure of the protein, indicating homology 6,7. Additionally, sequences might be in a syntenic position, indicating ancestral conservation, but may also lack sequence similarity 8. Thus, it is possible for some genes to be shared between evolutionary distant species, but they may fly under the radar of our current homology-inference tools. So, in order to investigate the 50% shared DNA claim, we can only focus on sequence conservation which we are able to detect.

To understand how much of the genome is conserved between banana and human, I will look at proteins because it’s the feature most likely to be conserved between human and banana. This is to be as permissive as possible in attempts to give the benefit of the doubt to the 50% meme.

Now the question is, how do we compare all the proteins in one species to all the proteins in another species and see which ones “match”, i.e. descended from a common ancestral gene? This is a fundamental problem important for studying evolution. Orthologs are the term we use for genes in different species that started diverging due to a speciation event, i.e. “corresponding” genes between species. This is where our lab’s expertise comes in: we maintain Orthologous Matrix, which is a method and database for finding orthologs between many species.

Orthologs in common between human and banana

I wanted to see what percent of human’s genes are orthologous to banana genes一and vice versa一what percent of banana’s genes are orthologous to human’s. To compare several different methods, I tested three common methods for finding orthologs: OMA9, OrthoInspector10, and best-bidirectional hit (using BLASTP)11. For each method, I divided the number of orthologs found by the number genes in the genome to come up with a percentage of each genome that is shared. You can find all the details here jupyter notebook, but the results are summed up in the graph below:

Banana Ortholog Comparison

Comparison of ortholog methods

As you can see, all the orthology-inference methods tested show a maximum of 25% of human genes to be orthologous to banana. Again, these results give the most leeway, as we used protein sequences, which are the genomic elements the most likely to be conserved.

Additionally, I investigated the percentage of a whole-genome alignment that would be shared between banana and human. Since this is computationally intensive, I used Ensembl Compara, which has precomputed pairwise whole-genome alignments between a number of species. A whole-genome alignment looks at the whole genome, not just genes, as well as compares DNA rather than proteins. They didn’t have results between human and banana, but here are the results between human and chimp, mouse, and zebrafish:

Ensembl Whole Genome Comparisons

Data obtained from

As we get progressively further in evolutionary distance, we get a smaller and smaller percentage of the genome which is able to be aligned. We can presume that plants would be even less than 1%, a far cry from the 50% as reported by internet memes.

So whichever way you slice it, humans share at most ¼, not ½ of its genetic material with banana (at least what we are able to detect)!

What do these human-banana orthologs DO?

Now that we have found the human-banana orthologs, we can try to gain some insight into what these genes do. To do this, I performed a Gene Ontology (GO) enrichment analysis of the human genes. GO enrichment works by assigning functional annotations to all of the sequences, then looking for a statistical overrepresentation of certain functions in a subset of genes compared to the entire genome.

I used the PANTHER Overrepresentation Test web server for the GO enrichment, then used GO-Figure12 for summarizing and visualizing the most enriched Biological Processes. All the details are in the jupyter notebook.

The top 10 overrepresented GO terms, i.e. a summary of the most common functions of the human genes with orthologs, is shown below:

Ensembl Whole Genome Comparisons

Top 10 overrepresented GO Biological Processes for human protein-coding genes with banana orthologs

We can see that the human-banana orthologs are highly enriched for basic, metabolic processes such as “cellular metabolic process,” “gene expression,” and “RNA processing.” These biological functions are likely genes which encode for cellular processes that are essential for eukaryotic life!

Take home message

  • “Humans share 50% of DNA with banana” is a statement that has very little meaning.
  • We must be careful to be precise in our language. We have to clarify what we mean when we give a percentage of “shared genetic material/DNA/genome.” I argue that the percentage of protein-coding genes is currently the best way to compare evolutionarily distant species
  • There’s no evidence that humans have 50% of detectable orthologs with a banana. In my analysis, I show between 17 and 24%, depending on which method was used. As scientists, we have to do a better job communicating science with each other and with the general public.

Even though we don’t have 50% genes in common with banana, we still have ~20% which is nothing to scoff at! The functions of these genes are most likely basic housekeeping proteins involved in metabolic processes that are necessary for most, if not all of eukaryotic life. It is amazing that these genes have been conserved over 1.5 billion years of evolution!


  1. Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103–1108 (2006).
  2. Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999).
  3. Armstrong, J., Fiddes, I. T., Diekhans, M. & Paten, B. Whole-Genome Alignment and Comparative Annotation. Annu Rev Anim Biosci 7, 41–64 (2019).
  4. Ransohoff, J. D., Wei, Y. & Khavari, P. A. The functions and unique features of long intergenic non-coding RNA. Nat. Rev. Mol. Cell Biol. 19, 143–157 (2018).
  5. Diederichs, S. The four dimensions of noncoding RNA conservation. Trends Genet. 30, 121–123 (2014).
  6. Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins 77, 499–508 (2009).
  7. Zheng, W. et al. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput. Biol. 15, e1007411 (2019).
  8. Vakirlis, N., Carvunis, A.-R. & McLysaght, A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Cold Spring Harbor Laboratory 735175 (2019) doi:10.1101/735175.
  9. Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. doi:10.1093/nar/gkaa1007.
  10. Nevers, Y. et al. OrthoInspector 3.0: open portal for comparative genomics. Nucleic Acids Res. 47, D411–D418 (2019).
  11. Moreno-Hagelsieb, G. & Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319–324 (2008).
  12. Reijnders, M. J. & Waterhouse, R. M. Summary Visualisations of Gene Ontology Terms with GO-Figure! Cold Spring Harbor Laboratory 2020.12.02.408534 (2020) doi:10.1101/2020.12.02.408534.

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Progress in genomic checkers

• Author: Nastassia Gobet •

When I started using word processors, the spell checker was only looking at small and common typing errors and was often trying to correct acceptable words due to lack of vocabulary. A few years later, they not only are better at it and use more developed dictionaries, but they can also capture grammar mistakes and redundant phrases. A similar story is happening with the detection of genomic variants.

The genome as a big text

The genome can be considered as a big text, written in a 4-letter alphabet (A, C, G, T). When comparing the genomic words from two individuals, we can look at single or few letter(s) differences (single nucleotide variants, SNVs) and longer patterns (structural variants, SVs) such as words, sentences, and paragraphs that are added (insertions) or missing (deletions), exchanged (translocations), repeated (duplications and copy number variations, CNVs), inverted (inversions) or combinations of these (complex SVs).

Discovering the importance of SVs

About ten years ago, the focus was mainly on SNVs as these are numerous and many methods to detect them were developed. They were studied in deep and indexed in dictionaries (databases) that also document their frequencies. However, one letter differences do not necessarily have a significant effect on the meaning of the text (the phenotypes). On the other hand, although SVs were underestimated and consequently understudied, they were discovered to have a profound phenotypic impact on gene regulation, dosage, and function. Therefore, they are important in a wide variety of medical conditions: cancers, neurological diseases (Parkinson, Huntington), and mental disorders (autism, schizophrenia).

Challenges in SV identification

Methods were recently developed and are currently being developed to detect SVs. A number of challenges need to be dealt with. First, short read sequencing greatly limits the detection of large events exceeding read length. Consequently, using longer read technologies (PacBio and ONT) is improving the range of detectable SVs, but this comes at the cost of decreased sequencing accuracy and higher price. Hybrid strategies combining short and long reads are therefore promising. Second, SVs are hard to classify as the variant type depends on variant sequence context: a sequence can be considered an insertion, duplication, or translocation depending on the source (Figure 1). In addition, the number of possible SVs is infinite, whereas for SNVs there are 3 variants per position in the worst case. SVs are thus hard to compare: which criteria should we use to determine if two slightly different calls correspond to the same event or not? This affects SV reporting and frequencies. Due to the relative youth of the field, standards and best practices have yet to be established. Different initiatives (eg. Genome in a Bottle and SEQC2) aim at better characterizing false positives and false negatives in SV calling. This should help implement more objective benchmarking and comparison between the various detection methods.

Redesign OMA Browser

Figure 1: An SV was called for a sequence from a sample differing from the reference sequence. Three possible scenarios of formation could explain the SV observed: an insertion, a duplication or a translocation.


Future of genomic spelling and grammar checkers

Standards and objective benchmarking for SV detection are still missing, so one must be careful with results obtained from current methods. However, SVs are increasingly recognized as being important and technologies to detect them are evolving rapidly. I think their use will become a more common practice in genomic variation studies in a few years, similar to spelling and grammar checkers in text processors. And you, which genome checker will you use?



Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. 2019. Structural variant calling: the long and the short of it. Genome Biol 20:246. doi:10.1186/s13059-019-1828-7.


If you want to get involved in improving SV variant detection, consider joining this Hackathon, to be hold remotely Oct. 11-14, 2020.

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

OMA Standalone made easy: a step-by-step guide

• Author: Natasha Glover •

Got newly sequenced genomes with protein annotations? Need to quickly and easily define the homologous relationships between the genes?

OMA Standalone is a software developed by our lab which can be used to infer homologs from whole genomes, including orthologs, paralogs, and Hierarchical Orthologous Groups (Altenhoff et al 2019).

The OMA Standalone algorithm works like this:

OMA Standalone pipeliner

In short, it takes as input user-contributed custom genomes (with the option of combining them with reference genomes already in the OMA database), and proceeds through three main parts:

  1. Quality and consistency checks of the genomes that will be used to run OMA Standalone;
  2. All-against-all alignments of every protein sequence to all other protein sequences;
  3. Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020).

Although the OMA Standalone is well-documented and straightforward, one of the challenges can be running it on an High Performance Cluster (HPC).

In order to understand the bare necessities needed to run OMA Standalone, we wrote an OMA Standalone Cheat Sheet, which you can download and follow the step-by-step instructions on running the software on an HPC. We use the cluster Wally as an example, as that is one of the HPCs here at the University of Lausanne. Wally uses SLURM as the scheduler for submitting jobs, so all the examples will be shown with that. We plan in the future to provide additional information on running with other schedulers, such as LSF or SGE. In the Cheat Sheet, you will find tips, hints, commands, and example scripts to run OMA Standalone on Wally.

Additionally, we prepared a video which walks the user through the process of running OMA Standalone from start to finish, including:

  • Downloading the software
  • Preparing your genomes for running
  • Editing the necessary parameters file
  • Creating the job scripts and
  • Submitting your jobs

The video can be found on our lab’s YouTube channel, at OMA standalone: how to efficiently identify orthologs using a cluster, and is also embedded here for your convenience:


We hope these resources can be helpful if you need help getting started running OMA Standalone. But don’t forget, there is also plenty of information that can be found on the OMA Standalone webpage or in the OMA Standalone paper. If all else fails, don’t hesitate to contact us on Biostars.


  1. Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).
  2. Zahn-Zabal, M., Dessimoz, C. & Glover, N. M. Identifying orthologs with OMA: A primer. F1000Res. 9, 27 (2020).

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Creating a bibliography with links to PubMed and PubMedCentral

• Author: Christophe Dessimoz •

We just submitted a paper to Nucleic Acids Research Web Server issue. As it turns out, the editor requires a bibliography with links to DOI, PubMed, and PubMedCentral entries.

This is a brief tutorial to generate such a bibliography from a bibtex file which contains the relevant entries, loosely based on the explanations provided in this TeX StackExchange entry.

Generating the bibtex file

In the lab, we mainly use Paperpile as bibliography management system, but most system allow to export records in bibtex format. If available, Paperpile includes DOI, PubMed IDs, and PubMedCentral IDs as follows:

    title    = "{Assigning confidence scores to homoeologs using fuzzy logic}",  
    author   = "Glover, Natasha M and Altenhoff, Adrian and Dessimoz, Christophe",
    journal  = "PeerJ",
    volume   =  6,
    pages    = "e6231",
    year     =  2019,
    doi      = "10.7717/peerj.6231",
    pmid     = "30648004",
    pmc      = "PMC6330999"

In this example, we store the bibliography in a file named ref.bib.

Extending biblatex to include PMID and PMC links in the bibliography

DOI are already supported by most bibliography systems. To also include PMID and PMCIDs, the trick is to use the flexible BibLatex package.

In a separate definition file, which we named adn.dbx, add the additional definitions for PMID and PMCIDs.


We can now include this file in the library definition in the main LaTeX file (in the preamble, i.e. before \begin{document}, and define the links to PubMed and PubMedCentral entries:

  PubMed ID:\addcolon\space

We can generate the bibliography by citing every paper using the \cite{} command, and printing the bibliography.


Polishing: highlighting the links with colour

To make the links more visible, define the hyperref package accordingly:


OK, thanks but could I just have the files please?

Of course! Here they are.

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

How to get published: interview series

• Author: Natasha Glover •

You’ve wrapped up the research on project. You’ve gotten good results. It’s time to publish them—your PhD/postdoc/career depends on it.

But you’re drawing a blank. Staring at the screen for hours, not knowing how to get started, and the only thing you can produce is an empty document. One of the most daunting tasks for young scientists is writing and publishing a paper, especially the first one.

In the context of a tutorial of the UNIL Quantitative Biology PhD Program, I prepared a series of three short videos featuring interviews with professors on tips to successfully publish a scientific paper. These videos were aimed towards PhD students, but contain useful advice for anyone, at any stage of their career. Here’s what they had to say:

Part 1: the writing process


Part 2: the journal selection


Part 3: responding to reviewers


Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Exclusive: European Tour of Antonis Rokas

• Author: Christophe Dessimoz •

We are delighted to host Prof. Antonis Rokas, Vanderbilt, for two special seminars at University College London and at the University of Lausanne!

Genomics and the making of biodiversity across the budding yeast subphylum

Prof. Antonis Rokas, Vanderbilt University

London: Tue 13 Nov 2018, 11am, UCL, Roberts Building 309
Lausanne: Wed 14 Nov 2018, 11am, UNIL, Genopode auditorium A


Yeasts are unicellular fungi that do not form fruiting bodies. Although the yeast lifestyle has evolved multiple times, most known species belong to the subphylum Saccharomycotina (hereafter yeasts). This diverse group includes the premier eukaryotic model system, Saccharomyces cerevisiae; the common human commensal and opportunistic pathogen, Candida albicans; and over 1,000 other known species (with more continuing to be discovered). Yeasts are found in every biome and continent and are more genetically diverse than either plants or bilaterian animals. Ease of culture, simple life cycles, and small genomes (10– 20 Mbp) have made yeasts exceptional models for molecular genetics, biotechnology, and evolutionary genomics. Since only a tiny fraction of yeast biodiversity and metabolic capabilities has been tapped by industry and science, expanding the taxonomic breadth of deep genomic investigations will further illuminate how genome function evolves to encode their diverse metabolisms and ecologies. As part of National Science Foundation’s Dimensions of Biodiversity program, we have undertaken a large-scale comparative genomic study to uncover the genetic basis of metabolic diversity in the entire Saccharomycotina subphylum. In my talk, I will discuss the team’s evolutionary analyses of 332 genomes spanning the diversity of the subphylum. These include establishing a robust genus-level phylogeny and timetree for the subphylum, quantification of the extent of horizontal gene transfer for the subphylum, and characterization of the evolution of approximately 50 metabolic traits (and, in some cases, their underlying genes and pathways). These analyses allow us, for the first time, to infer the key metabolic characteristics of the Last Yeast Common Ancestor (LYCA) and characterize the tempo and mode of genome evolution across an entire subphylum.

All welcome!

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Predicting QTL genes by integrating functional data across species

• Author: Christophe Dessimoz •

The problem in a nutshell

Quantitative Trait Loci (QTL) are regions of a genome for which genetic variants correlate with particular traits. To take a simple example in plants, one might observe that the average seed size (trait) is significantly larger when considering the subset of a population which has a C at a particular position in the genome than a subpopulation with a T.

The reason QTL identifies genomic regions and not precise positions is that neighbouring variants tend to be inherited together. These regions typically contain hundreds of genes, making it difficult to say which one(s) are causal to the trait variation—if any at all (the causal genetic variation(s) can be in non-coding regions too).

Thus, to prioritise candidate causal genes within a QTL region, researchers typically consider previous knowledge on these genes, to see whether a particular gene “makes sense”. In the case of seed size, it might be a gene previously implicated in growth or regulation, or a gene known to influence seed size in a different species. This process is however requires substantial manual interpretation, and is thus labour-intensive and haphazard.

Enter QTLsearch

We realised that our framework of hierarchical orthologous groups, which relates genes across many species, could be extended to integrate QTL results with previous gene function annotations.

Conceptual overview of QTLsearch

Conceptual overview of QTLsearch

If we go back to the seed size example, it might be that among the genes in the window, one has an ortholog in a different species previously annotated with the GO term “reproductive system development”. This could be a good candidate causal gene.

One risk however in integrating lots of previous knowledge across many species is that we might also find some spurious patterns. We therefore had to devise a way of controlling for random associations between QTL regions and evolutionarily propagated knowledge. Such “null distribution” depends on the specificity or the terms in question, the amount of annotations, the size of the QTL regions, and the species sampling. To cope with this complexity, we chose to implement a non-parametric permutation test.

We implemented the tool as an open source package called QTLsearch, available here.

QTLsearch infers more candidate causal genes than manual analyses

We used QTLsearch to reanalyse two previous studies. In both cases, we could call more candidate genes than the original studies. But more importantly, the evidence behind our calls is fully traceable and statistically supported.

Barchart of QTLsearch performance

QTLsearch could identify more candidate genes than the original study, but in an automated, reproducible, and statistically meaningful way.

Thus we think this will greatly facilitate future QTL analyses, particularly those that are done in non-model species for which the previous experimental knowledge is very limited.

Behind the paper

This is the third paper that resulted from our collaboration with Bayer CropScience (now BASF CropScience), after our work on homoeologs and on detecting split genes.

The project was conceived by Henning Redestig, collaborator at Bayer at the start of the project (now at DuPont). Henning had contributed to a QTL study and knew how labor intensive the search for putative causal genes is. He realised that HOGs could provide a natural way of integrating functional knowledge across multiple species, to combine the QTL information with previous functional data.

Alex Warwick Vesztrocy, PhD student on the project and first author, ran with the idea—promptly implementing and testing it. Early results looked promising, but Alex soon realised that the mapping between metabolites and GO terms could be improved. He also realised that some terms were quite common, so he devised the approach to compute the significance scores.

Our manuscript was accepted as proceedings paper at the European Conference on Computational Biology (ECCB). In our lab, we like proceedings paper. It’s nice to be able to present the work and publish the paper, particularly since the ECCB proceedings appear in a good journal. More importantly, conferences impose hard deadlines. Deadlines for submission of course, but also for peer-reviewing and for deciding acceptance or not!


Alex Warwick Vesztrocy, Christophe Dessimoz*, Henning Redestig*, Prioritising Candidate Genes Causing QTL using Hierarchical Orthologous Groups, Bioinformatics, 2018, 34:17, pp. i612–i619 (ECCB 2018 proceedings) [Open Access Full Text]

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

(Post updated on 4 Oct 2018 and on 13 Nov 2018)

Our lab has an open position, and so do collaborators and colleagues across Switzerland and Europe.

Please help us spread the word by forwarding this post. If you have computational biology jobs to announce, let me know and I will gladly add a link.

Postdoc position in our lab

PhD openings with colleagues

Postdoc openings with colleagues

Group leader positions

Bonus position (not computational but what the heck)

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.

Creative Commons
                    License The Dessimoz Lab blog is licensed under a Creative Commons Attribution 4.0 International License.