Open Reading Frame

Phylogenetics from AI-predicted Protein Structures: it works!!

David Moi — Sun, 24 Sep 2023 14:04:55 +0100

Breakthroughs don’t come every day, but the consequences of AlphaFold largely solving the 3D structure prediction problem has reshaped biology in profound ways. The sudden availability of protein structures for billions of proteins opens up many new possibilities. Last week’s two papers on the sequencing universe provide a compelling glimpse of the possibilities (here and here).

As someone who has been interested in tracing back the evolutionary origins of selected proteins—such as the cell fusion-mediating proteins fsx1 in plants, viruses, and archaea, or odorant receptors in insects—I have attempted to reconstruct phylogenies from structure in the past.

But I have faced two major issues:

Until AlphaFold came along, there typically wasn’t sufficient high-quality structure predictions as “starting material” to perform structure-based phylogenetics.
Even when I could obtain reasonably high confidence structures, the trees inferred from them were often met with skepticism—how reliable are these trees?

So now that high quality structure predictions are widely available, we could finally ask: are structures any good as starting material to infer trees? Specifically, how accurate are the reconstructed trees compared to sequences?

Today, we are super excited to report that structural phylogenetics works! What’s more, we found an approach that doesn’t just outperform traditional sequence-based methods for distant relationships; it also excels in resolving phylogenetic trees for closely related proteins. This post gives the gist of what we found—the full study is released as a preprint (1).

What’s the big deal with structural phylogenetics?

Before presenting our results, let’s take a step back. Why is structural phylogenetics potentially a big deal? Traditional phylogenetics, the study of evolutionary relationships among species or genes, has long relied on comparing the sequences of DNA, RNA, or proteins. While this approach has been immensely valuable, it does have its limitations. The primary challenge lies in the fact that the sequences of these biomolecules can change rapidly over time due to mutations and other factors, making it difficult to trace back their evolutionary history accurately when the divergence is very high. By contrast, proteins have unique three-dimensional structures that are intricately linked to their functions; these structures tend to change more slowly over evolutionary timescales compared to the sequences of the amino acids that make up the proteins since they are closely tied to the function of the protein.

In this particular example close to my heart, we can see structural homology between functionally homologous proteins at wide evolutionary ranges. The examples shown span plants, metazoans, viruses and archaea. They share virtually no sequence homology. Ref: (2)

When we set out to do our work, however, we were not at all sure that it would work, let alone outperform sequence based methods. On the one hand, there have been decades of intensive tool and model refinements for sequence-based approaches, unlike its structure-based counterpart. But also, complications related to structure, such as allostery, flexible regions, and functional constraints could conceivably confound the evolutionary signal that can be extracted from structures.

Evidence that structure-based trees can outperform sequence-based trees

We tested a few structural approaches, and settled on an approach reconstructing distance trees using Foldseek’s “local structural alphabet” approach, which was developed in the lab of our collaborator Martin Steinegger to search for similar structures very rapidly—by encoding local structure motifs in a 20-letter alphabet and repurposing highly optimized alignment software originally developed to align amino acid sequences (3).

Testing and comparing the quality of phylogenetic trees empirically is tricky business. Most comparisons are based on simulated data, or by comparing the fit of data to different models. But how to compare trees that are reconstructed from entirely different kinds of input data? Luckily, our lab has accumulated quite some experience in these kinds of empirical observations, used previously to compare the accuracy of alignment (4 and 5) or orthology (6 and 7) methods. We used an approach which compares the propensity of inferred trees to recapitulate the known taxonomy of the species from which the proteins are sampled from.

When comparing the taxonomic plausibility of thousands of trees derived from homologous protein families, Foldtree outperforms sequence-based phylogenetics. (In the paper, we show that after filtering the input set to families with high quality structures, the structural phylogenies perform even better!)

Amazingly, the trees we inferred in this way were more in line with the known taxonomy than those defined by sequence similarity! The input data can either be experimental crystal structures or AI structural models. Using good quality structures positively impacts the quality of the trees produced which means that as structural prediction methods get better, so will our structural trees.

The RRNPPA family: a first unifying phylogeny for peptidic quorum sensing proteins

To put our method to the test, we focused on a particularly complex gene family - the RRNPPA quorum sensing receptors (8). These receptors play a pivotal role in enabling communication and coordination among gram-positive bacteria, plasmids, and bacteriophages for crucial behaviors like sporulation, virulence, antibiotic resistance, conjugation, and phage lysis/lysogeny decisions.

The complex evolutionary pattern of this family is revealed in its name. Before AI structures, new homologs were previously only detectable after having been crystallized and each subfamily was added piecemeal to the overall picture, resulting in their particularly long acronym. As the family expanded researchers also attempted to piece together its evolutionary history, using a diverse set of methods, some of which relied on structural analysis. Using Foldtree we decoded the evolutionary diversification of these genes, shedding new light on their intricate history.

Compared to the sequence-based phylogeny, the Foldtree reconstruction of the RRNPPA family’s history is remarkably parsimonious. Several events such as domain architecture changes or transfers to the viral world appear only once in the tree.

Foldtree: infer a structural phylogeny for your favorite protein family

To make it easy to try this approach, as well as facilitate methodological improvements, we are releasing this new approach as an open source tool we call Foldtree. It’s available for download on GitHub (https://github.com/DessimozLab/fold_tree). Try it on your favorite protein family and let us know how it performs!

Exciting new research directions

High-accuracy structural phylogenetics has the potential to uncover deeper evolutionary relationships, elucidate unknown protein functions, and even refine the design of bioengineered molecules. The evolutionary histories of protein families in the viral domain, the start of eukaryotic life and the role of asgard archaea as well as the evolution of the prokaryotic mobilome are just a few cases where the fast pace of evolution has confounded sequence-based analyses and could be revisited. We believe this work represents an important step in investigating how structures are polished by the processes of evolution and how we can use this signal to peer further into the past than ever before.

References

Moi D, Bernard C, Steinegger M, Nevers Y, Langleib M, Dessimoz C. Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses. bioRxiv 2023.09.19.558401; doi: https://doi.org/10.1101/2023.09.19.558401
Moi D, Nishio S, Li X, Valansi C, Langleib M, Brukman NG, et al. Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins. Nat Commun. 2022;13: 3880. doi:10.1038/s41467-022-31564-1
van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023. doi:10.1038/s41587-023-01773-0
Tan G, Gil M, Löytynoja AP, Goldman N, Dessimoz C. Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proceedings of the National Academy of Sciences of the United States of America. 2015. pp. E99–100. doi:10.1073/pnas.1417526112
Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 2010;11: R37. doi:10.1186/gb-2010-11-4-r37
Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009;5: e1000262. doi:10.1371/journal.pcbi.1000262
Altenhoff AM, Boeckmann B, Capella-Gutierrez S, Dalquen DA, DeLuca T, Forslund K, et al. Standardized benchmarking in the quest for orthologs. Nat Methods. 2016;13: 425–430. doi:10.1038/nmeth.3830
Bernard C, Li Y, Lopez P, Bapteste E. Large-scale identification of known and novel RRNPP quorum sensing systems by RRNPP_detector captures novel features of bacterial, plasmidic and viral co-evolution. Mol Biol Evol. 2023. doi:10.1093/molbev/msad062

The Surprising Uniformity of Protein Length Distribution Across the Tree of Life

Yannis Nevers & Christophe Dessimoz — Tue, 13 Jun 2023 09:55:06 +0100

Proteins are fundamental to all life forms, dictating the complex biochemical interactions that maintain and drive the existence of every species. The functionality of a protein hinges on its structural domain organization, and the protein’s length is a direct manifestation of this. Given that every species has evolved under varying evolutionary pressures, one would intuitively expect protein length distribution to differ significantly across species.

Well, we report in a paper just published in Genome Biology that this is not the case.

Unexpected Homogeneity in Protein Length Distribution

In our study, we examined the protein length distribution across 2,326 species encompassing 1,688 bacteria, 153 archaea, and 485 eukaryotes. Counter to expectations, we observed a striking consistency in protein length distribution across these species. Though eukaryotic proteins were somewhat longer, the variation in protein length distribution was notably low compared to other genomic features such as genome size, gene length, number of proteins, GC content, and isoelectric points of proteins.

Features directly related to protein length are much more conserved than other features.

Exceptions: Errors or Biological Peculiarities?

We did note a few atypical cases of protein length distribution, but these were typically due to inaccuracies in gene annotation: no well-annotated model species displayed enrichment in small proteins, and those with a high number of small proteins were more likely to have incomplete or fragmented genome annotations.

Indeed, the outliers tended to include many more genomes scoring low in BUSCO quality score. The only exception we observed was the prevalence of longer proteins in the Ustilago fungal genus and the Apicomplexa phylum, known for their intracellular parasitic lifestyles.

This suggests that the actual variation in protein length distribution might be even smaller than what we reported. Hopefully, resequencing and reannotation efforts will help solve this issue in the future: we already noticed a few species getting updated proteomes where the length distributions gets more similar to the typical one!

A Universal Selection Force at Play

The startling uniformity of protein length distribution across diverse species suggests a strong, universal selective pressure, maintaining a high proportion of the coding sequence within a specific length range. In the discussion part of the paper, we articulate a number of potential explanations, but these remain highly speculative.

More positively put, the evolutionary forces behind the uniformity of protein distribution and their potential impact on fitness remain exciting areas of exploration!

Protein Length Distribution: A New Criterion for Gene Quality?

This observation led us to propose the use of protein length distribution as a new criterion of protein-coding gene quality upon publication. Considering that the overabundance of spurious proteins could potentially bias downstream analyses, this quality measure could aid in identifying and rectifying annotation errors. We also encourage everyone to take a look at this simple criterion when selecting proteomes for comparative genomics analysis.

Story behind the paper

The basic premise of the paper, exploring protein length distribution across the tree of life, may seem straightforward at first glance. Not quite. It started as part of Yannis’s PhD in Odile Lecompte’s lab in Strasbourg—and a few questions: what are the characteristics of the thousands of publicly available proteomes? How to decide which to include in large scale analyses? It took another three years of Yannis’s postdoc, with about half of that time spent in the peer-review process.

Perhaps the most revealing testament to the depth of this work is the supplementary PDF, a 68-page document filled with detailed data and analyses. Moreover, anyone interested in the peer-review history of our paper can delve into the 18-page record available here.

The journey is the reward, they say; well in this instance, we are quite happy to have reached our destination!

Reference:

Nevers, Y., Glover, N.M., Dessimoz, C, Lecompte, O. Protein length distribution is remarkably uniform across the tree of life. Genome Biol 24, 135 (2023). https://doi.org/10.1186/s13059-023-02973-2

Read2Tree infers phylogenetic trees from raw sequencing reads quick and easy

Christophe Dessimoz & Fritz Sedlazeck — Sun, 23 Apr 2023 20:56:37 +0100

We just published a method to build phylogenetic trees directly from raw reads, bypassing time-consuming steps such as genome assembly. This post gives the short story and the backstory. In particular, find out below what Read2Tree has in common with “Smoke on the Water” from the band Deep Purple.

In biology, phylogenetic trees are everywhere. They help us understand the relationships between species, genes, or cells—how they evolved, and how they’re related.

The sequencing revolution provides the raw material to infer phylogenetic trees, but building state-of-the-art phylogenetic trees requires tedious steps from read curation, de novo assembly, gene annotation, ortholog identification to tree inference, which can take many months to run—millions of CPU hours invested in this process are not uncommon—and specialised knowledge to oversee this process.

That’s where Read2Tree comes in. Our new approach to tree inference bypasses the usual steps of genome assembly, annotation, and orthology inference. Instead, it uses existing knowledge of the protein sequence universe to directly reconstruct comprehensive sequence alignments from raw sequencing reads.

The approach is vastly faster than traditional methods and in many cases more accurate—the exception being when sequencing coverage is high and reference species very distant. Read2Tree is also flexible, working with genome and transcriptome, short and long reads, and sequencing coverage as low as 0.1x.

We were encouraged by the buzz the Read2Tree manuscript elicited on bioRxiv last year, and are delighted it has now been published in Nature Biotechnology.

What is Read2Tree good for?

A nice illustration of Read2Tree’s potential was the reconstruction of a phylogeny of coronaviruses, which processed on the same tree diverse Coronaviridae sequences as well as 10,000 raw SARS-CoV-2 datasets from the Short Read Archive. The reconstructed tree was consistent with the lineage classification obtained from the UniProt reference proteomes, accurately recovering the main coronavirus genera and all subgenera (Figure 1). At the same time, the same phylogeny accurately clustered the sequences according to CDC variants of concerns classification. These results demonstrate the versatility and scalability of Read2Tree, making it suitable for both zoonotic surveillance and human epidemiology.

Figure 1—Zoomed-in display of a tree inferred using Read2Tree on 10,283 samples whole genome SARS-CoV-2 samples. Classification in colour was obtained from [https://harvestvariants.info](https://harvestvariants.info), where grey leaves are unclassified according to the CDC label. The colour clustering shows that the Read2Tree-based tree recovers consistent classification. Click on the tree to see it full screen.

The ability to reconstruct phylogenetic trees from raw reads has additional advantages. Some genomes are deposited with poor or even entirely absent protein annotation sets. Processing genomes directly from raw reads can avoid this limitation and also decrease biases that arise from relying too heavily on specific reference genomes. Although some efforts have been made to “dehumanize” non-human great ape genomes, other clades still face similar biases that can be significantly reduced by processing raw reads.

Who might find it useful?

We think Read2Tree will be especially useful for small labs with limited bioinformatics expertise and computational resources, allowing them to perform state-of-the-art phylogenomics on particular species or environments of interest.

But it’s not just small labs that can benefit from Read2Tree. Large consortia can also use it to regularly update their trees as new genomes are sequenced. This is especially important as more and more projects around comparative genomics are underway, such as the Earth BioGenome, the Darwin Tree of Life, or the European Reference Genome Atlas projects.

In addition, Read2Tree’s ability to infer trees from much lower coverage than traditional methods means it can also be useful for quality control early in the process. This makes it a valuable tool for environmental and metagenomic applications, especially when combined with genome binning techniques.

Overall, Read2Tree is a powerful method for inferring phylogenetic trees directly from raw sequencing reads. We hope it will help make phylogenetic tree inference faster, more accurate, and more accessible to a wider range of researchers.

What’s next?

Now that the introductory Read2Tree paper is published, we are excited to explore new potential applications that we haven’t been able to tackle so far. For instance, we have already received inquiries from researchers interested in using Read2Tree for ancient DNA applications or for monitoring systems that require fast turnaround time and low coverage.

Moving forward, we have two main goals. First, we aim to expand Read2Tree’s capabilities to handle multi-species samples, which will enable an even broader range of applications in the metagenomics field. While long-read applications may offer the most benefit, we are confident that Read2Tree’s ability to perform well with short-reads will also prove valuable in detangling multiple species.

Secondly, we plan to explore the use of Read2Tree in single-cell sequencing. This rapidly growing field involves sequencing individual cells, including cancer cells, and analysing their genetic information. Given Read2Tree’s ability to operate with low coverage levels (down to 0.2x), we believe it could facilitate fast and accurate characterization of tumour or cell evolution.

We hope that Read2Tree will help streamline and democratise comparative genomics analyses. We are excited to see how researchers will apply this tool to further advance our understanding of genetics and evolution.

What’s the backstory?

Both of our labs (Fritz Sedlazeck’s and Christophe Dessimoz’s) have been collaborating for many years, and we’ve always enjoyed exchanging ideas even though our research interests are quite diverse. One of our interests over the years is how to combine our expertise in sequence analysis and ortholog comparison to develop new methodologies and gain new insights into biology.

It was during one of Fritz’s visits to Christophe’s lab in Lausanne, Switzerland, that we started brainstorming ideas for a project that led to Read2Tree. Our goal was to overcome the limitations and bottlenecks of comparative genomics. We had some amazing cheese risotto, and the beautiful scenery fueled our discussions further (Figure 2).

Figure 2 — Fritz alleges that the epiphany of Read2Tree took place with this view from his hotel room in Montreux, Switzerland, during a collaborative visit to Christophe’s group. It’s not entirely implausible, considering this very view [inspired the song “Smoke on the Water” by Deep Purple](https://en.wikipedia.org/wiki/Smoke_on_the_Water#History).

David Dylus, the first author, was convinced that it was possible to bring our ideas to life, although he did not anticipate how much time and effort it would take (Figure 3). Even after he moved on to a new role in the pharmaceutical industry, he continued to work on Read2Tree after regular work hours. And when the COVID-19 pandemic hit, we had to face additional challenges, such as maintaining regular meetings and pushing the manuscript forward while not compromising on quality. We also faced technical issues, such as hard disk crashes and cluster updates that led to data loss, but David hang on.

Completing the paper was not an easy task, and one of the biggest challenges was organising and identifying all of the SRA data sets, including those related to yeast and COVID-19. Despite these challenges, we were able to bring the project to completion. It was a special joy to present the work at ISMB 2022, where Fritz and Christophe had the wonderful opportunity to meet in person, and we continued to discuss our work while enjoying good food and drinks by beautiful Mendota lake in Madison, Wisconsin.

In summary, nice food and lakeside views were instrumental in the making of Read2Tree.

Figure 3 — First author David Dylus performing on stage (centre, crouching) on the occasion of SIB Swiss Institute of Bioinformatics’s 20th anniversary—a period of rapid progress in the development of Read2Tree. Though no-one is entirely certain, rumour has it that David is miming “sipping a cup of tea while looking into the distance”, in line with our theme of sustenance, inspiring landscapes, and scientific progress.

Note this blog post was first published on the Nature Communities blog here.

Reference

Dylus, D., Altenhoff, A., Majidian, S. et al. Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree. Nat Biotechnol (2023). doi:10.1038/s41587-023-01753-4.

OMArk: assessing proteome quality, quick and easy

Yannis Nevers — Mon, 12 Dec 2022 09:39:58 +0000

I am excited to introduce our preprint for our new tool OMArk. We hope our software will help fill a gap in assessing the quality of gene annotation sets.

Many studies directly rely on the protein-coding gene repertoires (“proteomes”) predicted from genome assemblies to perform their comparisons. Doing so, they rely on the assumption that the predicted gene content of all genomes are of homogeneous quality and an accurate reflection of reality. Yet in practice, this assumption is rarely met, with protein-coding genes often missing or fragmented in the reported proteomes, non-coding sequences wrongly annotated as coding genes by gene predictors, or contamination from other species wrongly included among the reported sequences.

Why a new proteome quality tool?

Our new method, OMArk, provides a way to easily and comprehensively measure different aspects of proteome quality: completeness of the gene repertoire, consistency of the included genes at the taxonomic level, whether they have doubtful gene structures, and presence or not of inter or intra-domain contamination. Furthermore, contrary to existing methods, OMArk does not rely on a manual selection of reference dataset; instead, it automatically identifies the most likely taxonomic classification of the test proteomes. It can thus process any test proteome across the tree of life using a universal reference database.

Conceptual overview of OMArk (left) and how the innovative consistency assessment is computed (right).

OMArk is accurate and provides new insights

We performed extensive validation of the method by introducing controlled amounts of noise, fragmentation and contamination to reference proteomes and accurately estimating these amounts using OMArk. We also performed a large-scale analysis of 1805 eukaryotic UniProt Reference Proteomes with our software and were able to detect unambiguous cases of quality issues, either caused from incompleteness, contamination, or inclusion of translated non-coding sequences. In the most extreme case, we found a plant proteome with contamination from eight different species—fungi and bacteria.

OMArk results on 1805 Eukaryotic proteomes from UniProt. Interactively check results on the OMArk webserver, e.g. for the current cowpea weevil reference proteome.

Why does the consistency metric matter? For example, comparing the Ensembl gene set for two assemblies of Bombus impatiens. We can detect a major improvement in consistency (including contamination removal) for a similar completeness.

OMArk can reveal improvements in genome assemblies/annotations even if completeness has not substantially changed.

OMArk is quick and easy to run

OMArk can be easily used as a command line tool or on our OMArk webserver. On the webserver, you can submit a FASTA file of your proteome and get results in about 30 minutes. Nothing more required. You can visualize the results and directly compare it to precomputed results from closely related species (UniProt reference proteomes).

More details can be found in the preprint linked below. Please let us know how the tool works for you!

Reference

Yannis Nevers, Alex Warwick Vesztrocy, Victor Rossier, Clément-Marie Train, Adrian Altenhoff, Christophe Dessimoz, Natasha M Glover Quality assessment of gene repertoire annotations with OMArk Nature Biotechnology 2024 doi:10.1038/s41587-024-02147-w

“Mathematical and Computational Evolutionary Biology” annual conference, edition 2022

Charles Bernard — Sun, 06 Nov 2022 14:51:04 +0000

This year, the MCEB international conference was held in Switzerland for the first time in its history.

During five days, from June 26th to 30th 2022, experts in mathematical and computational evolutionary biology from all over the world exchanged about their passion in Chateau d’Oex, in the heart of the welcoming mountains of the regional natural park of Gruyeres.

Here is a little summary on this annual not-to-be-missed event for evolutionary biology aficinionados.

The edition 2022 of the international MCEB conference brought together a hundreds of scientists from diverse disciplines and at different stages of their career, from PhD students to world-renown senior scientists. During five consecutive days, the Chateau d’Oex has hence been the improbable scene of inspiring exchanges about evolution between mathematicians, computational biologists, evolutionary biologists, ecologists, epidemiologists and cancer biologists.

View from the conference venue (left) and cheese making social activity (right).

In total, 6 one-hour talks and 20 short talks were given to present epistemological perspectives, recent methodological advances and challenges yet to be addressed for reconstructing the evolutionary history of the genes, genomes, populations and species observed then and today on Earth.

As a mirror of the truly interdisciplinary nature of the event, a wide range of phylogenetic structures have been discussed during these five days. Networks of gene flows, phylogenetic trees, or genealogical trees predicted by coalescent theory were on the menu of this year.

Experts specialized in introgressive events such as horizontal gene transfers or endosymbioses provided insights on the methods and challenges to model reticulate evolution. With this respect, an inspirational talk was given on how ghost lineages that went extinct in the past but nonetheless exchanged some DNA with ancestors of extant species could mislead our interpretation of the directionality of gene flows within phylogenetic networks.

Lectures on the theoretical advances that were made throughout the last 50 years in the reconstruction of gene and species phylogenetic trees were then given by world leaders in the field of phylogeny. In particular, they provided mathematical evidence that contrary to what is practiced today to reconstruct species trees, neither the consensus tree of several gene trees nor the tree inferred from the concatenated alignment of these genes actually give a good approximation of the phylogeny of the different species encoding these genes, which highlights the urgent need to pursue methodological efforts to better model species evolution.

On shorter evolutionary timescales, numerous mathematical models to infer geneological trees of human populations or cancer cell lineages were also presented during the conference.

Poster session (left) and group photo (right).

Finally, a strong focus was placed this year on methods for coupling phylogenetic inferences with phenotypical, ecological, archaeological, geographical, epidemiological and medical data in order to study how traits or diseases evolved across space and time. Striking examples of these integrative analyses were provided by methodologies to retrace with accuracy the evolution of the recent Sars-Cov2 and MERS-Cov pandemics over time and space.

In addition to these talks of exceptional scientific quality, two poster sessions animated by junior researchers and students took place during the conference and were truly appreciated by every participants for the scientific excellence of the posters and the conviviality of the moments.

Overall, through five days of scientific presentations, poster sessions, dinners, parties and social activities such as hiking in the Alpes or visiting a cheese factory, scientific exchanges and informal talks were fostered and allowed to create news bonds within this community of researchers.

The MCEB 2022 conference was a great success as it enabled a diversity of scientists from all over the world to meet, exchange on their work and build new collaborations!

The Banana Conjecture

Natasha Glover — Tue, 08 Dec 2020 16:24:06 +0000

I recently became aware of the memes and popular science articles going around the internet claiming that we share 50% of our DNA with bananas. For example:

Source: http://www.quickmeme.com/meme/36gnaz

I work in the Dessimoz lab at the University of Lausanne, and here we are in the business of comparing genes. In fact, I’ve had a similar question before一 what percentage of our protein-coding genes do we share with another plant, Arabidopsis thaliana. I computed the number as being closer to 17%.

I wanted to get to the bottom of this question once and for all: What percentage of a human’s “genetic material” is shared with a banana? There have been several other blog posts from scientists touching on this question (Neil Saunders: “50% bananas”, Stack Exchange skeptics: “Do humans share 50% of their DNA with bananas?”, Sanogenetics: “Are We Genetically Similar To Bananas And Why Is This Important For Research In Disease?”).

However, I wanted to go a little deeper into:

Where this number came from, and the extent of it being spread on the internet.
What exactly do we mean by “shared genetic material”?
Some results I computed in attempts to put this controversy to rest once and for all.

In this blog post I will attempt to address these questions.

Where did the mythical 50% come from anyway?

After performing a quick google search, it seems that the relatedness between a human and a banana has been a popular question. With a cursory, non-exhaustive search, I show in the table below eight sources who report that 44-60% of the human genome is “shared” with banana.

Source	quote
Irishnews.com	“But we are also genetically related to bananas – with whom we share 50% of our DNA – and slugs – with whom we share 70% of our DNA.”
getscience.com	“Banana: more than 60 percent identical. Many of the “housekeeping” genes that are necessary for basic cellular function, such as for replicating DNA, controlling the cell cycle, and helping cells divide are shared between many plants (including bananas) and animals.”
thenakedscientists.com	“So where does this banana statistic come from? Is it just complete nonsense? Well, no. We do in fact share about 50% of our genes with plants – including bananas.”
PopSci.com	“Bananas have 44.1% of genetic makeup in common with humans.”
MythBusters (tv show) facebook	“#sciencefact: Humans share approximately 98% of their DNA with chimps, 70% with slugs, and 50% with bananas! http://bit.ly/qsWX8p”
mirror.co.uk	“Humans share 50% of our DNA with a banana.”
Business Insider	“The genetic similarity between a human and a banana is 60%.” Source: National Human Genome Research Institute (However, no link and when I tried to search)
Sundaypost.com	“Yes, and we share 50% with bananas. It’s not surprising, if you look at the basic mechanism of biochemistry.”

What is disconcerting is that at least half of these sources come from popular science websites or science sections of newspapers, yet few have any sort of citation at all. The only exceptions were Popular Science, which gave DataScope as a source, and Business Insider, who cites the National Human Genome Research Institute. However, neither of these articles give a link or further information to follow up on.

Upon further digging, I found one recent article published on howstuffworks entitled “Do People and Bananas Really Share 50 Percent of the Same DNA?”, which contains an interview with one of the scientists from the Human Genome Research Institute, where he explains how they arrived at that number.

“Brody says the experiment was not published, as most scientific research is. Instead, it was generated to be included as part of an educational Smithsonian Museum of Natural History video called ‘The Animated Genome.’ That video noted that DNA between a human and a banana is ‘41 percent similar.’””

The article goes on to explain that this 41% figure comes from a blast search between protein sequences of human and banana. They found about 7,000 hits, and the average percent identity of these hits was 41%. He goes on to note:

“This is the average similarity between proteins (gene products), not genes… Of course, there are many, many genes in our genome that do not have a recognizable counterpart in the banana genome and vice versa.”

So when we get to the bottom of it, the 50% figure is actually 40% average amino acid percent identity between 7000 blast hits of human and banana.

What do we mean when we say “we share 50% of our DNA with a banana”?

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. What determines how many genes in common depends on how far back in time the two species shared a common ancestor. For example, humans and chimps share such a high percentage of genes, because we only diverged ~6 MYA¹. However, human and banana (more specifically the common ancestors which led to human and banana) split around 1.5 BILLION years ago². Talk about a banana split! Therefore we would expect a lot less to be conserved.

As brought up by Neil Saunders in his blog post, “What does ‘we share 50% of our DNA’ really mean?” A non-biologist perhaps might not see the nuance in this question. If I were going to play the devil’s advocate, I could say that a child shares 50% of its DNA with their parents. Or even that every organism shares 100% DNA, as it is all made up of Gs, Cs, As, and Ts. Thus, it is important to be specific on what we’re talking about.

This shared DNA could be referring to a number of things: protein-coding genes, non-coding genes, transposable elements, the percent that gets aligned in a whole genome alignment³, etc. Each of these specific features evolve at different rates, and thus will be more or less conserved between any given species.

Well, how do we know if these genomic features are conserved?

Generally, sequences are compared by making an alignment, and then computing the percent identity or evolutionary distance between the two sequences. If the sequences are sufficiently similar, they can be declared as conserved. Thus, “conserved” can be seen as either categorical (i.e. conserved or not), and then specified as a quantitative value (conserved to a certain degree). For more information, see the Wiki page on conserved sequences.

What are the genomic features the most likely to be conserved?

“Conservation indicates that a sequence has been maintained by natural selection” (wiki). Genes, or DNA sequences, encode for the proteins. Proteins are slower to evolve and change than the DNA, due to the redundancy of the genetic code. Thus proteins are the genomic feature most likely to be conserved between evolutionary distant species. While it is true that other genomic features such as non-coding regulatory sequences or non-coding RNA can be conserved over long evolutionary distances, they are far more likely to diverge in sequence than proteins^4,5. Other genetic features such as transposable elements, or intergenic “junk DNA” are even less likely to be conserved, as their sequences are under less selection pressure and accumulate mutations at an even higher rate.

It is important to note that while we generally declare sequences to be conserved on the basis of sequence similarity, sequences may be still conserved and lack similarity. For example, two sequences might be conserved in the structure of the protein, indicating homology ^6,7. Additionally, sequences might be in a syntenic position, indicating ancestral conservation, but may also lack sequence similarity ⁸. Thus, it is possible for some genes to be shared between evolutionary distant species, but they may fly under the radar of our current homology-inference tools. So, in order to investigate the 50% shared DNA claim, we can only focus on sequence conservation which we are able to detect.

To understand how much of the genome is conserved between banana and human, I will look at proteins because it’s the feature most likely to be conserved between human and banana. This is to be as permissive as possible in attempts to give the benefit of the doubt to the 50% meme.

Now the question is, how do we compare all the proteins in one species to all the proteins in another species and see which ones “match”, i.e. descended from a common ancestral gene? This is a fundamental problem important for studying evolution. Orthologs are the term we use for genes in different species that started diverging due to a speciation event, i.e. “corresponding” genes between species. This is where our lab’s expertise comes in: we maintain Orthologous Matrix, which is a method and database for finding orthologs between many species.

Orthologs in common between human and banana

I wanted to see what percent of human’s genes are orthologous to banana genes一and vice versa一what percent of banana’s genes are orthologous to human’s. To compare several different methods, I tested three common methods for finding orthologs: OMA⁹, OrthoInspector¹⁰, and best-bidirectional hit (using BLASTP)¹¹. For each method, I divided the number of orthologs found by the number genes in the genome to come up with a percentage of each genome that is shared. You can find all the details here jupyter notebook, but the results are summed up in the graph below:

Comparison of ortholog methods

As you can see, all the orthology-inference methods tested show a maximum of 25% of human genes to be orthologous to banana. Again, these results give the most leeway, as we used protein sequences, which are the genomic elements the most likely to be conserved.

Additionally, I investigated the percentage of a whole-genome alignment that would be shared between banana and human. Since this is computationally intensive, I used Ensembl Compara, which has precomputed pairwise whole-genome alignments between a number of species. A whole-genome alignment looks at the whole genome, not just genes, as well as compares DNA rather than proteins. They didn’t have results between human and banana, but here are the results between human and chimp, mouse, and zebrafish:

Data obtained from https://uswest.ensembl.org/info/genome/compara/analyses.html#

As we get progressively further in evolutionary distance, we get a smaller and smaller percentage of the genome which is able to be aligned. We can presume that plants would be even less than 1%, a far cry from the 50% as reported by internet memes.

So whichever way you slice it, humans share at most ¼, not ½ of its genetic material with banana (at least what we are able to detect)!

What do these human-banana orthologs DO?

Now that we have found the human-banana orthologs, we can try to gain some insight into what these genes do. To do this, I performed a Gene Ontology (GO) enrichment analysis of the human genes. GO enrichment works by assigning functional annotations to all of the sequences, then looking for a statistical overrepresentation of certain functions in a subset of genes compared to the entire genome.

I used the PANTHER Overrepresentation Test web server for the GO enrichment, then used GO-Figure¹² for summarizing and visualizing the most enriched Biological Processes. All the details are in the jupyter notebook.

The top 10 overrepresented GO terms, i.e. a summary of the most common functions of the human genes with orthologs, is shown below:

Top 10 overrepresented GO Biological Processes for human protein-coding genes with banana orthologs

We can see that the human-banana orthologs are highly enriched for basic, metabolic processes such as “cellular metabolic process,” “gene expression,” and “RNA processing.” These biological functions are likely genes which encode for cellular processes that are essential for eukaryotic life!

Take home message

“Humans share 50% of DNA with banana” is a statement that has very little meaning.
We must be careful to be precise in our language. We have to clarify what we mean when we give a percentage of “shared genetic material/DNA/genome.” I argue that the percentage of protein-coding genes is currently the best way to compare evolutionarily distant species
There’s no evidence that humans have 50% of detectable orthologs with a banana. In my analysis, I show between 17 and 24%, depending on which method was used. As scientists, we have to do a better job communicating science with each other and with the general public.

Even though we don’t have 50% genes in common with banana, we still have ~20% which is nothing to scoff at! The functions of these genes are most likely basic housekeeping proteins involved in metabolic processes that are necessary for most, if not all of eukaryotic life. It is amazing that these genes have been conserved over 1.5 billion years of evolution!

References

Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. & Reich, D. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103–1108 (2006).
Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999).
Armstrong, J., Fiddes, I. T., Diekhans, M. & Paten, B. Whole-Genome Alignment and Comparative Annotation. Annu Rev Anim Biosci 7, 41–64 (2019).
Ransohoff, J. D., Wei, Y. & Khavari, P. A. The functions and unique features of long intergenic non-coding RNA. Nat. Rev. Mol. Cell Biol. 19, 143–157 (2018).
Diederichs, S. The four dimensions of noncoding RNA conservation. Trends Genet. 30, 121–123 (2014).
Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins 77, 499–508 (2009).
Zheng, W. et al. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput. Biol. 15, e1007411 (2019).
Vakirlis, N., Carvunis, A.-R. & McLysaght, A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Cold Spring Harbor Laboratory 735175 (2019) doi:10.1101/735175.
Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. doi:10.1093/nar/gkaa1007.
Nevers, Y. et al. OrthoInspector 3.0: open portal for comparative genomics. Nucleic Acids Res. 47, D411–D418 (2019).
Moreno-Hagelsieb, G. & Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319–324 (2008).
Reijnders, M. J. & Waterhouse, R. M. Summary Visualisations of Gene Ontology Terms with GO-Figure! Cold Spring Harbor Laboratory 2020.12.02.408534 (2020) doi:10.1101/2020.12.02.408534.

Progress in genomic checkers

Nastassia Gobet — Wed, 23 Sep 2020 16:53:19 +0100

When I started using word processors, the spell checker was only looking at small and common typing errors and was often trying to correct acceptable words due to lack of vocabulary. A few years later, they not only are better at it and use more developed dictionaries, but they can also capture grammar mistakes and redundant phrases. A similar story is happening with the detection of genomic variants.

The genome as a big text

The genome can be considered as a big text, written in a 4-letter alphabet (A, C, G, T). When comparing the genomic words from two individuals, we can look at single or few letter(s) differences (single nucleotide variants, SNVs) and longer patterns (structural variants, SVs) such as words, sentences, and paragraphs that are added (insertions) or missing (deletions), exchanged (translocations), repeated (duplications and copy number variations, CNVs), inverted (inversions) or combinations of these (complex SVs).

Discovering the importance of SVs

About ten years ago, the focus was mainly on SNVs as these are numerous and many methods to detect them were developed. They were studied in deep and indexed in dictionaries (databases) that also document their frequencies. However, one letter differences do not necessarily have a significant effect on the meaning of the text (the phenotypes). On the other hand, although SVs were underestimated and consequently understudied, they were discovered to have a profound phenotypic impact on gene regulation, dosage, and function. Therefore, they are important in a wide variety of medical conditions: cancers, neurological diseases (Parkinson, Huntington), and mental disorders (autism, schizophrenia).

Challenges in SV identification

Methods were recently developed and are currently being developed to detect SVs. A number of challenges need to be dealt with. First, short read sequencing greatly limits the detection of large events exceeding read length. Consequently, using longer read technologies (PacBio and ONT) is improving the range of detectable SVs, but this comes at the cost of decreased sequencing accuracy and higher price. Hybrid strategies combining short and long reads are therefore promising. Second, SVs are hard to classify as the variant type depends on variant sequence context: a sequence can be considered an insertion, duplication, or translocation depending on the source (Figure 1). In addition, the number of possible SVs is infinite, whereas for SNVs there are 3 variants per position in the worst case. SVs are thus hard to compare: which criteria should we use to determine if two slightly different calls correspond to the same event or not? This affects SV reporting and frequencies. Due to the relative youth of the field, standards and best practices have yet to be established. Different initiatives (eg. Genome in a Bottle and SEQC2) aim at better characterizing false positives and false negatives in SV calling. This should help implement more objective benchmarking and comparison between the various detection methods.

Figure 1: An SV was called for a sequence from a sample differing from the reference sequence. Three possible scenarios of formation could explain the SV observed: an insertion, a duplication or a translocation.

Future of genomic spelling and grammar checkers

Standards and objective benchmarking for SV detection are still missing, so one must be careful with results obtained from current methods. However, SVs are increasingly recognized as being important and technologies to detect them are evolving rapidly. I think their use will become a more common practice in genomic variation studies in a few years, similar to spelling and grammar checkers in text processors. And you, which genome checker will you use?

Reference

Mahmoud M, Gobet N, Cruz-Dávalos DI, Mounier N, Dessimoz C, Sedlazeck FJ. 2019. Structural variant calling: the long and the short of it. Genome Biol 20:246. doi:10.1186/s13059-019-1828-7.

If you want to get involved in improving SV variant detection, consider joining this Hackathon, to be hold remotely Oct. 11-14, 2020.

OMA Standalone made easy: a step-by-step guide

Natasha Glover — Mon, 06 Apr 2020 10:34:17 +0100

Got newly sequenced genomes with protein annotations? Need to quickly and easily define the homologous relationships between the genes?

OMA Standalone is a software developed by our lab which can be used to infer homologs from whole genomes, including orthologs, paralogs, and Hierarchical Orthologous Groups (Altenhoff et al 2019).

The OMA Standalone algorithm works like this:

In short, it takes as input user-contributed custom genomes (with the option of combining them with reference genomes already in the OMA database), and proceeds through three main parts:

Quality and consistency checks of the genomes that will be used to run OMA Standalone;
All-against-all alignments of every protein sequence to all other protein sequences;
Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see OMA: A Primer (Zahn-Zabal et al. 2020).

Although the OMA Standalone is well-documented and straightforward, one of the challenges can be running it on an High Performance Cluster (HPC).

In order to understand the bare necessities needed to run OMA Standalone, we wrote an OMA Standalone Cheat Sheet, which you can download and follow the step-by-step instructions on running the software on an HPC. We use the cluster Wally as an example, as that is one of the HPCs here at the University of Lausanne. Wally uses SLURM as the scheduler for submitting jobs, so all the examples will be shown with that. We plan in the future to provide additional information on running with other schedulers, such as LSF or SGE. In the Cheat Sheet, you will find tips, hints, commands, and example scripts to run OMA Standalone on Wally.

Additionally, we prepared a video which walks the user through the process of running OMA Standalone from start to finish, including:

Downloading the software
Preparing your genomes for running
Editing the necessary parameters file
Creating the job scripts and
Submitting your jobs

The video can be found on our lab’s YouTube channel, at OMA standalone: how to efficiently identify orthologs using a cluster, and is also embedded here for your convenience:

We hope these resources can be helpful if you need help getting started running OMA Standalone. But don’t forget, there is also plenty of information that can be found on the OMA Standalone webpage or in the OMA Standalone paper. If all else fails, don’t hesitate to contact us on Biostars.

References

Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152–1163 (2019).
Zahn-Zabal, M., Dessimoz, C. & Glover, N. M. Identifying orthologs with OMA: A primer. F1000Res. 9, 27 (2020).

Creating a bibliography with links to PubMed and PubMedCentral

Christophe Dessimoz — Mon, 24 Feb 2020 12:57:07 +0000

We just submitted a paper to Nucleic Acids Research Web Server issue. As it turns out, the editor requires a bibliography with links to DOI, PubMed, and PubMedCentral entries.

This is a brief tutorial to generate such a bibliography from a bibtex file which contains the relevant entries, loosely based on the explanations provided in this TeX StackExchange entry.

Generating the bibtex file

In the lab, we mainly use Paperpile as bibliography management system, but most system allow to export records in bibtex format. If available, Paperpile includes DOI, PubMed IDs, and PubMedCentral IDs as follows:

@ARTICLE{Glover2019-xs, 
    title    = "{Assigning confidence scores to homoeologs using fuzzy logic}",  
    author   = "Glover, Natasha M and Altenhoff, Adrian and Dessimoz, Christophe",
    journal  = "PeerJ",
    volume   =  6,
    pages    = "e6231",
    year     =  2019,
    doi      = "10.7717/peerj.6231",
    pmid     = "30648004",
    pmc      = "PMC6330999"
}

In this example, we store the bibliography in a file named ref.bib.

Extending biblatex to include PMID and PMC links in the bibliography

DOI are already supported by most bibliography systems. To also include PMID and PMCIDs, the trick is to use the flexible BibLatex package.

In a separate definition file, which we named adn.dbx, add the additional definitions for PMID and PMCIDs.

\DeclareDatamodelFields[type=field,datatype=literal]{
    pmid,
    pmc,
}

We can now include this file in the library definition in the main LaTeX file (in the preamble, i.e. before \begin{document}, and define the links to PubMed and PubMedCentral entries:

\usepackage[backend=biber,style=authoryear,datamodel=adn]{biblatex}
\DeclareFieldFormat{pmc}{%
  PubMedCentral\addcolon\space
  \ifhyperref
    {\href{http://www.ncbi.nlm.nih.gov/pmc/articles/#1}{\nolinkurl{#1}}}
    {\nolinkurl{#1}}}
\DeclareFieldFormat{pmid}{%
  PubMed ID:\addcolon\space
  \ifhyperref
    {\href{https://www.ncbi.nlm.nih.gov/pubmed/#1}{\nolinkurl{#1}}}
    {\nolinkurl{#1}}}
\renewbibmacro*{finentry}{%
  \printfield{pmid}%
  \newunit
  \printfield{pmc}%
  \newunit
  \finentry}
\addbibresource{ref.bib}

We can generate the bibliography by citing every paper using the \cite{} command, and printing the bibliography.

\cite{Glover2019-xs}
\newpage
\printbibliography

Polishing: highlighting the links with colour

To make the links more visible, define the hyperref package accordingly:

\usepackage[colorlinks]{hyperref}

OK, thanks but could I just have the files please?

Of course! Here they are.

How to get published: interview series

Natasha Glover — Wed, 20 Feb 2019 08:34:59 +0000

You’ve wrapped up the research on project. You’ve gotten good results. It’s time to publish them—your PhD/postdoc/career depends on it.

But you’re drawing a blank. Staring at the screen for hours, not knowing how to get started, and the only thing you can produce is an empty document. One of the most daunting tasks for young scientists is writing and publishing a paper, especially the first one.

In the context of a tutorial of the UNIL Quantitative Biology PhD Program, I prepared a series of three short videos featuring interviews with professors on tips to successfully publish a scientific paper. These videos were aimed towards PhD students, but contain useful advice for anyone, at any stage of their career. Here’s what they had to say:

Part 1: the writing process

Part 2: the journal selection

Part 3: responding to reviewers

Exclusive: European Tour of Antonis Rokas

Christophe Dessimoz — Tue, 23 Oct 2018 08:54:22 +0100

We are delighted to host Prof. Antonis Rokas, Vanderbilt, for two special seminars at University College London and at the University of Lausanne!

Genomics and the making of biodiversity across the budding yeast subphylum

Prof. Antonis Rokas, Vanderbilt University

London: Tue 13 Nov 2018, 11am, UCL, Roberts Building 309
Lausanne: Wed 14 Nov 2018, 11am, UNIL, Genopode auditorium A

Abstract

Yeasts are unicellular fungi that do not form fruiting bodies. Although the yeast lifestyle has evolved multiple times, most known species belong to the subphylum Saccharomycotina (hereafter yeasts). This diverse group includes the premier eukaryotic model system, Saccharomyces cerevisiae; the common human commensal and opportunistic pathogen, Candida albicans; and over 1,000 other known species (with more continuing to be discovered). Yeasts are found in every biome and continent and are more genetically diverse than either plants or bilaterian animals. Ease of culture, simple life cycles, and small genomes (10– 20 Mbp) have made yeasts exceptional models for molecular genetics, biotechnology, and evolutionary genomics. Since only a tiny fraction of yeast biodiversity and metabolic capabilities has been tapped by industry and science, expanding the taxonomic breadth of deep genomic investigations will further illuminate how genome function evolves to encode their diverse metabolisms and ecologies. As part of National Science Foundation’s Dimensions of Biodiversity program, we have undertaken a large-scale comparative genomic study to uncover the genetic basis of metabolic diversity in the entire Saccharomycotina subphylum. In my talk, I will discuss the team’s evolutionary analyses of 332 genomes spanning the diversity of the subphylum. These include establishing a robust genus-level phylogeny and timetree for the subphylum, quantification of the extent of horizontal gene transfer for the subphylum, and characterization of the evolution of approximately 50 metabolic traits (and, in some cases, their underlying genes and pathways). These analyses allow us, for the first time, to infer the key metabolic characteristics of the Last Yeast Common Ancestor (LYCA) and characterize the tempo and mode of genome evolution across an entire subphylum.

All welcome!

Predicting QTL genes by integrating functional data across species

Christophe Dessimoz — Fri, 05 Oct 2018 07:51:14 +0100

The problem in a nutshell

Quantitative Trait Loci (QTL) are regions of a genome for which genetic variants correlate with particular traits. To take a simple example in plants, one might observe that the average seed size (trait) is significantly larger when considering the subset of a population which has a C at a particular position in the genome than a subpopulation with a T.

The reason QTL identifies genomic regions and not precise positions is that neighbouring variants tend to be inherited together. These regions typically contain hundreds of genes, making it difficult to say which one(s) are causal to the trait variation—if any at all (the causal genetic variation(s) can be in non-coding regions too).

Thus, to prioritise candidate causal genes within a QTL region, researchers typically consider previous knowledge on these genes, to see whether a particular gene “makes sense”. In the case of seed size, it might be a gene previously implicated in growth or regulation, or a gene known to influence seed size in a different species. This process is however requires substantial manual interpretation, and is thus labour-intensive and haphazard.

Enter QTLsearch

We realised that our framework of hierarchical orthologous groups, which relates genes across many species, could be extended to integrate QTL results with previous gene function annotations.

Conceptual overview of QTLsearch

If we go back to the seed size example, it might be that among the genes in the window, one has an ortholog in a different species previously annotated with the GO term “reproductive system development”. This could be a good candidate causal gene.

One risk however in integrating lots of previous knowledge across many species is that we might also find some spurious patterns. We therefore had to devise a way of controlling for random associations between QTL regions and evolutionarily propagated knowledge. Such “null distribution” depends on the specificity or the terms in question, the amount of annotations, the size of the QTL regions, and the species sampling. To cope with this complexity, we chose to implement a non-parametric permutation test.

We implemented the tool as an open source package called QTLsearch, available here.

QTLsearch infers more candidate causal genes than manual analyses

We used QTLsearch to reanalyse two previous studies. In both cases, we could call more candidate genes than the original studies. But more importantly, the evidence behind our calls is fully traceable and statistically supported.

QTLsearch could identify more candidate genes than the original study, but in an automated, reproducible, and statistically meaningful way.

Thus we think this will greatly facilitate future QTL analyses, particularly those that are done in non-model species for which the previous experimental knowledge is very limited.

Behind the paper

This is the third paper that resulted from our collaboration with Bayer CropScience (now BASF CropScience), after our work on homoeologs and on detecting split genes.

The project was conceived by Henning Redestig, collaborator at Bayer at the start of the project (now at DuPont). Henning had contributed to a QTL study and knew how labor intensive the search for putative causal genes is. He realised that HOGs could provide a natural way of integrating functional knowledge across multiple species, to combine the QTL information with previous functional data.

Alex Warwick Vesztrocy, PhD student on the project and first author, ran with the idea—promptly implementing and testing it. Early results looked promising, but Alex soon realised that the mapping between metabolites and GO terms could be improved. He also realised that some terms were quite common, so he devised the approach to compute the significance scores.

Our manuscript was accepted as proceedings paper at the European Conference on Computational Biology (ECCB). In our lab, we like proceedings paper. It’s nice to be able to present the work and publish the paper, particularly since the ECCB proceedings appear in a good journal. More importantly, conferences impose hard deadlines. Deadlines for submission of course, but also for peer-reviewing and for deciding acceptance or not!

Reference

Alex Warwick Vesztrocy, Christophe Dessimoz*, Henning Redestig*, Prioritising Candidate Genes Causing QTL using Hierarchical Orthologous Groups, Bioinformatics, 2018, 34:17, pp. i612–i619 (ECCB 2018 proceedings) [Open Access Full Text]

Over 10 computational biology positions at PhD, postdoc and group leader level

Christophe Dessimoz — Wed, 03 Oct 2018 14:16:24 +0100

(Post updated on 4 Oct 2018 and on 13 Nov 2018)

Our lab has an open position, and so do collaborators and colleagues across Switzerland and Europe.

Please help us spread the word by forwarding this post. If you have computational biology jobs to announce, let me know and I will gladly add a link.

Postdoc position in our lab

Postdoc position in evolutionary bioinformatics (closing soon!)

PhD openings with colleagues

Postdoc openings with colleagues

Group leader positions

Group leader position at EMBL-EBI (closing soon!)
Head of Tree of Life initiative at Sanger

Bonus position (not computational but what the heck)

Postdoc position on comparative single cell sequencing with Max Telford at UCL

What genes do I have in common with a plant?

Natasha Glover — Mon, 01 Oct 2018 08:38:37 +0100

All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. Therefore, we can take any two species on the tree of life and find out what set of genes were present in the common ancestor of the two species by looking for the genes that they still share to this day. When two species are closely related, i.e. close together on the tree of life, we can observe that much of their genome is the same. But what do I mean by “same”, and what do I mean by “shared”?

Genomes are complex, with many layers of control (DNA-level, RNA-level, protein-level, epigenetic level, alternative splicing-level, protein-protein interaction level, transposable element-level, etc). Because of this complexity, there could be many ways to compare the genomes of two different species to determine which portion of the genome is shared.

In this blog post, when I refer to “shared” parts of the genome, I’m referring to orthologs, which are genes in different species that started diverging due to a speciation event. Orthologs can be inferred from complete genome sequences, and there are many different methods to do this. In particular, I will focus on the Dessimoz lab’s method and database, Orthologous MAtrix, or OMA. OMA uses protein sequences from whole genome annotations to compute orthologs between over 2000 species. The OMA algorithm is graph-based, compares all protein sequences to all others to find the closest between two genomes, allows for duplicates, and uses information from related genomes to improve the calling. OMA is available at https://omabrowser.org/.

It’s quite well-publicized that humans and chimps are extremely similar in terms of gene content, which is why they could be considered our evolutionary cousins¹. However, when two species are more distantly related, there is less of their genomes which are the same. But how distant can we go in the tree of life and still see the same genes between different species? For example, how many genes do I (a human) have in common with a plant? Can we even detect orthologs between two species so evolutionarily distant? And if so, what biological/functional role do these genes play?

In this blog post, I will show how to use OMA and some of its tools to find out how much of the human genome we share with plants. There are many different plant species with their genomes’ sequenced, so I will use the extensively-studied model species Arabidopsis thaliana as the representative plant to compare the human protein-coding genes to.

Setup

For the remainder of this blog post, I will show how to use the OMA database, accessed via the python REST API, to get ortholog pairs between two genomes.

In python, I use the requests library to send queries for information to the server housing the OMA database. Pandas is a well-known python library for working with dataframes, so I will use that for analysis. For more information, see:

Let’s start our python code by importing libraries:

import requests
import json
import pandas as pd

The OMA API is accessible from the following link, so we will save it for later use in our code:

api_url = "https://omabrowser.org/api"

Getting pairwise orthologs with the OMA API

Pairwise orthologs are computed between all pairs of species in OMA as part of the normal OMA pipeline. Accessing the data will result in a list of pairs of genes, one from one species and the other from the other species. Any one gene may be involved in several different pairs, depending of if its ortholog has undergone duplication or not.

It is important to note that here we are only using pairs of orthologs. In OMA, we report several types of orthologs, namely pairs of orthologs or groups of orthologs (Hierarchical Orthologous Groups (HOGs). We use the pairs as the basis for building HOGs, which aggregates orthologous pairs together into clusters at different taxonomic levels. Some ortholog pairs are removed, and some new pairs are inferred when creating HOGs. Therefore, the set of orthologs deduced from HOGs will be slightly different than the set of orthologs purely from the pairs.

Here we will use the API to request all the pairwise orthologs between Homo sapiens and Arabidopsis thaliana. OMA uses either the NCBI taxon identifier or the 5-letter UniProt species code to identify species. I will use the UniProt codes, which are “HUMAN” and “ARATH.” To request the ortholog pairs between HUMAN and ARATH, the request would simply look like:

requests.get(api_url + '/pairs/HUMAN/ARATH/')

However, when sending a request that returns long lists, i.e. lists of thousands of genes, the OMA API uses pagination. This means all the results won’t be returned at once, but sequentially page by page, with only 100 results per page. This is a safeguard to make sure the OMA servers don’t get overloaded with requests. Here I show a simple workaround to get all of the pairs.

Get all pairs

genome1 = "HUMAN"
genome2 = "ARATH"

#get first page
response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2))

#use the header to calculate total number of pages
total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
print("There are {} pages that need to be requested.".format(total_nb_pages))

> There are 128 pages that need to be requested.

Since we want all the pages, not just the first one, we have to use a loop to go through and request all the pages.

#get responses for all pages
responses = []
for page in range(1, total_nb_pages + 1):
    tmp_response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2)+"?page="+str(page))
    responses.append(tmp_response.json())

#some basic tidying up so we only have a list of pairs at the end
pairs = []
for response in responses:
    for pair in response:
        pairs.append(pair)

#Example of a pair
pairs[0]

> {'entry_1': {'entry_nr': 8066469,
>   'entry_url': 'https://omabrowser.org/api/protein/8066469/',
>   'omaid': 'HUMAN00009',
>   'canonicalid': 'NOC2L_HUMAN',
>   'sequence_md5': 'dc91b2521daf594037a6c318f3b04d5a',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.4b.15b.5a.3b',
>   'chromosome': '1',
>   'locus': {'start': 944694, 'end': 959240, 'strand': -1},
>   'is_main_isoform': True},
>  'entry_2': {'entry_nr': 12384097,
>   'entry_url': 'https://omabrowser.org/api/protein/12384097/',
>   'omaid': 'ARATH12334',
>   'canonicalid': 'NOC2L_ARATH',
>   'sequence_md5': 'f19ac310a5e56443f2ce0e4e832addb3',
>   'oma_group': 840151,
>   'oma_hog_id': 'HOG:0420566.1c',
>   'chromosome': '2',
>   'locus': {'start': 7928254, 'end': 7931851, 'strand': 1},
>   'is_main_isoform': True},
>  'rel_type': '1:1',
>  'distance': 138.0,
>  'score': 1094.969970703125}

We can see that the data is in the form of a big list of pairs, with each pair being a dictionary. In this dictionary, one key is entry_1, which is the human gene, and the second key is entry_2, which is the arabidopsis gene. The values for each of these keys are dictionaries with information about the gene. Additionally, there are other key value pairs in the pair dictionary which gives information computed about the pair, such as rel_type, distance, and score (will be explained later).

Make dataframe with all pairs

I personally like to work with pandas because I find it easy to use, so I will import the information about the pairs into a dataframe.

#use pandas to make dataframe
df = pd.DataFrame.from_dict(pairs)

def make_columns_from_entry_dict(df, columns_to_make):
    '''Parses out the keys from the entry dictionary and adds them to the dataframe 
    with an appropriate column header.'''

    genome1 = df['entry_1'][0]['omaid'][:5]
    genome2 = df['entry_2'][0]['omaid'][:5]

    df[genome1+"_"+columns_to_make] = df.apply(lambda x: x['entry_1'][columns_to_make], axis=1)
    df[genome2+"_"+columns_to_make] = df.apply(lambda x: x['entry_2'][columns_to_make], axis=1)

    return df

#clean up the columns of the entry_1 and entry_2 dictionaries
df = make_columns_from_entry_dict(df, "omaid")    
df = df[['HUMAN_omaid','ARATH_omaid','rel_type','distance','score']]

#Here's a snippet of the dataframe
df[:10]

	HUMAN_omaid	ARATH_omaid	rel_type	distance	score
0	HUMAN00009	ARATH12334	1:1	138.0	1094.969971
1	HUMAN00029	ARATH38493	1:1	168.0	246.330002
2	HUMAN00032	ARATH38034	1:1	69.0	773.869995
3	HUMAN00034	ARATH39850	m:n	140.0	635.900024
4	HUMAN00034	ARATH33319	m:n	144.0	655.409973
5	HUMAN00034	ARATH01678	m:n	137.0	723.580017
6	HUMAN00034	ARATH07547	m:n	134.0	761.450012
7	HUMAN00036	ARATH01479	1:1	103.0	554.590027
8	HUMAN00037	ARATH10852	1:1	60.0	2325.889893
9	HUMAN00052	ARATH02626	1:1	95.0	328.959991

Here you can see that each row is a pair of orthologs, with some information about each. The rel_type is the relationship cardinality, between the orthologs. 1:1 means only one ortholog in human found in arabidopsis, whereas 1:m would mean it duplicated in ARATH. m:n means there were lineage-specific duplications on both sides.

How many pairs between HUMAN and ARATH?

print("There are {} pairs of orthologs between {} and {}.".format(len(df), genome1, genome2))

> There are 12792 pairs of orthologs between HUMAN and ARATH.

What percentage is this of the human genome? What percentage of the Arabidopsis genome? To answer this we need to 1) Get the number of genes in each genome that this represents, because there can be more than 1 pair of orthologs per gene. 2) We also need to get the total number of genes per genome.

First we get the number of genes for each species that has at least one ortholog, using the dataframe from above. We don’t have to worry about alternative splic variants (ASVs) because OMA chooses a representative isoform when computing orthologs.

human_proteins_with_ortholog = set(df['HUMAN_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(human_proteins_with_ortholog), "HUMAN","ARATH"))

arath_proteins_with_ortholog = set(df['ARATH_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(arath_proteins_with_ortholog), "ARATH","HUMAN"))

> There are 3769 genes in HUMAN with at least 1 ortholog in ARATH.
> There are 5177 genes in ARATH with at least 1 ortholog in HUMAN.

Get the genomes of HUMAN and ARATH

Now we will again use the API to get all the proteins in the genome. (See https://omabrowser.org/api/docs#genome-proteins). Here I will write a function using the same procedure as above to deal with pagination.

#Use the API to get all the human genes
def get_all_proteins(genome):
    '''Gets all proteins using OMA REST API'''
    responses = []
    response = requests.get(api_url + '/genome/{}/proteins/'.format(genome))
    total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
    for page in range(1, total_nb_pages + 1):
        tmp_response = requests.get(api_url + '/genome/{}/proteins/'.format(genome)+"?page="+str(page))
        responses.append(tmp_response.json())

    proteins = []
    for response in responses:
        for entry in response:
            proteins.append(entry)
    return proteins


human_proteins = get_all_proteins("HUMAN")

#Here is an example entry
human_proteins[0]

> {'entry_nr': 8066461,
>  'entry_url': 'https://omabrowser.org/api/protein/8066461/',
>  'omaid': 'HUMAN00001',
>  'canonicalid': 'OR4F5_HUMAN',
>  'sequence_md5': 'df953df7a11ee7be5484e511551ce8a4',
>  'oma_group': 650473,
>  'oma_hog_id': 'HOG:0361626.2a.7a',
>  'chromosome': '1',
>  'locus': {'start': 69091, 'end': 70008, 'strand': 1},
>  'is_main_isoform': True}

It is important to note that this list of genes/proteins includes ASVs for many species. Therefore we need to take care to remove them to have one represenative isoform per gene. We can do this by filtering the list of entries based on the ‘is_main_isoform’ key. Here is a small function to do that.

#get canonical isoform
def get_canonical_proteins(list_of_proteins):
    proteins_no_ASVs = []
    for protein in list_of_proteins:
        if protein['is_main_isoform'] == True:
            proteins_no_ASVs.append(protein)
    return proteins_no_ASVs

print("HUMAN\n-----")
print("The number of genes before removing alternative splice variants: {}".format(len(human_proteins)))

human_proteins_no_ASVs = get_canonical_proteins(human_proteins)

print("The number of genes after removing alternative splice variants: {}".format(len(human_proteins_no_ASVs)))


> HUMAN
> -----
> The number of genes before removing alternative splice variants: 30700
> The number of genes after removing alternative splice variants: 20152



#Do the same for ARATH
print("ARATH\n-----")
arath_proteins = get_all_proteins("ARATH")

print("The number of genes before removing alternative splice variants: {}".format(len(arath_proteins)))
arath_proteins_no_ASVs = get_canonical_proteins(arath_proteins)
print("The number of genes after removing alternative splice variants: {}".format(len(arath_proteins_no_ASVs)))

> ARATH
> -----
> The number of genes before removing alternative splice variants: 40999
> The number of genes after removing alternative splice variants: 27627

Proportion of the genomes which have orthologs in the other

Now that we have all the necessary information we can see what percentage of the human and arabidopsis genomes are shared, in terms of proportion of genes with at least 1 ortholog in the other species.

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("HUMAN", "ARATH", \
              round((len(human_proteins_with_ortholog)/len(human_proteins_no_ASVs)*100), 2)))

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("ARATH", "HUMAN", \
              round((len(arath_proteins_with_ortholog)/len(arath_proteins_no_ASVs)*100), 2)))

> The percentage of genes in HUMAN with at least 1 ortholog in ARATH: 18.7%
> The percentage of genes in ARATH with at least 1 ortholog in HUMAN: 18.74%

So the answer to the original questions is that BOTH humans and arabidopsis have 18.7% of their genome shared with each other. A weird coincidence that it turns out to be the same proportion of both genomes!

Conclusion

Using the OMA database for orthology inference, we found that:

There are 12792 pairs of orthologs between HUMAN and ARATH.
There are 3769 genes in human that have at least 1 ortholog in arabidopsis.
There are 5177 genes in arabidopsis that have at least 1 ortholog in human.
These numbers represent about 19% of both genomes.

Therefore, we can conclude that over the course of evolution since the animal and plant lineages split (~1.5 billion years ago²), about 19% of the protein-coding genes remain and are able to be detected by OMA.

Stay tuned for the next blog post, where I will show what are the functions of these shared genes!

References

Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69–87 (2005). DOI: https://doi.org/10.1038/nature04072
Wang, D. Y., Kumar, S. & Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163–171 (1999). DOI: 10.1098/rspb.1999.0617

pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs)

Clement Train — Thu, 29 Jun 2017 17:13:50 +0100

(This entry was updated 19 Sep 2018 to reflect recent feature updates)

pyHam (‘python HOG analysis method’) makes it possible to extract useful information from HOGs encoded in standard OrthoXML format. It is available both as a python library and as a set of command-line scripts. Input HOGs in OrthoXML format are available from multiple bioinformatics resources, including OMA, Ensembl and HieranoidDB.

This post is a brief primer to pyham, with an emphasis on what it can do for you.

How to get pyHam?

pyHam is available as python package on the pypi server and is compatible python 2 and python 3. You can easily install via pip using the following bash command:

pip install pyham

You can check the official pyham website for further information about how to use pyham, documentation and the source code.

What are Hierarchical Orthologous Groups (HOGs)?

You don’t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:

You can learn more about this in our previous blog post.

Where to find HOGs?

HOGs inferred on public genomes can be downloaded from the OMA orthology database. Other databases, such as Eggnog, OrthoDb or HieranoiDB also infer HOGs, but not all of these databases offer them in OrthoXML format. You can check which database serves hogs as orthoxml here. If you want to use your custom genomes to infer HOGs you can use the OMA standalone software.

In order to facilitate the use of pyHam on single gene family, we provide the option to let pyham fetch required data directly from a comptabilble databases (for now only OMA is available for this feature). The user simply have to give the id of a gene inside the gene family (HOGs) of insterest along with the name of the compatibible database where to get the data and pyHam will do the rest.

For example, if you are interest by the P_53 gene in rat (P53 rat gene page in OMA) you simply have to run the following python code to set-up your pyHam session:

my_gene_query = 'P53_RAT'
database_to_query = 'oma'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from=database_to_query)

How does pyham help you investigate on HOGs?

The main features of pyHam are: (i) given a clade of interest, extract all the relevant HOGs, each of which ideally corresponds to a distinct ancestral gene in the last common ancestor of the clade; (ii) given a branch on the species tree, report the HOGs that duplicated on the branch, got lost on the branch, first appeared on that branch, or were simply retained; (iii) repeat the previous point along the entire species tree, and plot an overview of the gene evolution dynamics along the tree; and (iv) given a set of nested HOGs for a specific gene family of interest, generate a local iHam web page to visualize its evolutionary history.

What is the number of genes in a particular ancestral genome? (i)

In pyHam, ancestral genomes are attached to one specific internal node in the inputted species tree and denoted by the name of this taxon. Ancestral genes are then infered by fetching all the HOGs at the same level.

# Get the ancestral genome by name
rodents_genome = ham_analysis.get_ancestral_genome_by_name("Rodents")

# Get the related ancestral genes (HOGs)
rodents_ancestral_genes = rodents_genome.genes

# Get the number of ancestral genes at level of Rodents
print(len(rodents_ancestral_genes)

How can I figure out the evolutionary history of genes in a given genome? (ii)

pyHam provides a feature to trace for HOGs/genes along a branch that span across one or multiple taxonomic ranges and report the HOGs that duplicated on this branch, got lost on this branch, first appeared on that branch, or were simply retained. The ‘vertical map’ (see further information on map here) allows for retrieval of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).

# Get the genome of interest
human = ham_analysis.get_extant_genome_by_name("HUMAN")
vertebrates = ham_analysis.get_ancestral_genome_by_name("Vertebrata")

# Instanciate the gene mapping !
vertical_human_vertebrates = ham_analysis.compare_genomes_vertically(human, vertebrates) # The order doesn't matter!

# The identical genes (that stay single copies) 
# one HOG at vertebrates -> one descendant gene in human
vertical_human_vertebrates.get_retained())

# The duplicated genes (that have duplicated) 
# one HOG at vertebrates -> list of its descendants gene in human
vertical_human_vertebrates.get_duplicated())

# The gained genes (that emerged in between)
# list of gene that appeared after vertebrates taxon
vertical_human_vertebrates.get_gained()

# The lost genes (that been lost in between) 
HOG at vertebrates that have been lost before human taxon
vertical_human_vertebrates.get_lost()

How can I get an overview of the gene evolution dynamics along the tree that occured in my genomic setup? (iii)

pyHam includes treeProfile (extension of the Phylo.io tool), a tool to visualise an annotated species tree with evolutionary events (genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a minimalist and intuitive way to visualise the number of evolutionary events that occurred on each branch or the numbers of ancestral genes along the species tree.

# create a local treeprofile web page
treeprofile = ham_analysis.create_tree_profile(outfile="treeprofile_example.html")

As you can see in the figure above, the treeprofile is composed of the reference species used to perform the pyham analysis. Each internal node is displayed with its related histogram of phylogenetic events (number of genes duplicated, lost, gained, or retained) that occurred on each branch. The tree profile either display the number of genes resulting from phylogenetics events or the number of phylogenetic events on themself; the switch can be made by opening the settings panel (histogram icon on top right) and selecting between ‘genes’ or ‘events’.

How can I visualise the evolutionary history of a gene family (HOG)? (iv)

pyHam embeds iHam, an interactive tool to visualise gene family evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.

# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)

# create and export the hog vis as .html
output_filename = "hogvis_example.html"
pyham_analysis.create_hog_visualisation(hog=hog_of_interest,outfile=output_filename)

Then, you simply have to double click on the .html file to open it in your default internet browser. We provide you an example below of what you should see. A brief video tutorial on iHam is available at this URL.

iHam is composed of two panels: a species tree that allows you to select the taonomic range of interest, a genes panel where each grey square represents an extant gene and each row a species.

We can see for example that at the level of mammals (click on the related node and select ‘Freeze at this node’) all genes of this gene family are descendant from a single comon ancestral gene.

Now, if we look at the level of Euarchontoglires (redo the same procedure as for mammals to freeze the vis at this level) we observe that the genes are now split by a vertical line. This vertical line separates 2 group of genes that are each descendants from a same single ancestral gene. This is the result of a duplication in between Mammals and Euarchontoglires.

This small example demonstrate the simplicity of iHam usefulness to identify evolutionary events that occured in gene families (e.g. when a duplication occured, which species have lost genes or how big genes families evolved).

Sex, alcohol, and structural variants in fission yeast

Fritz Sedlazeck, Dan Jeffares & Christophe Dessimoz — Wed, 08 Feb 2017 10:36:55 +0000

Our latest study just came out (Jeffares et al., Nature Comm 2017). In it, we carefully catalogued high-confidence structural variants among all known strains of the fission yeast population, and assessed their impact on spore viability, winemaking and other traits. This post gives a summary and the story behind the paper.

Structural variants (SVs) measure genetic variation beyond single nucleotide changes …

Next generation sequencing is enabling the study of genomic diversity on unprecedented levels. While most of this research has focused on single base pair differences (single nucleotide polymorphisms, SNPs), larger genomic differences (called structural variations, SVs) can also have an impact on the evolution of an organism, on traits and on diseases. SVs are usually loosely defined as events that are at least 50 base pair long. They are often classified in five subtypes: deletions, duplications, new sequence insertions, inversions and translocations.

Over the recent years the impact of SVs has been characterized in many organisms. For example, SVs play a role in cancer, when duplications often lead to multiple copies of important oncogenes. Furthermore, SVs are known to play a role in other human disorders such as autism, obesity, etc.

… but calling structural variants remains challenging

In principle, identifying SVs seems trivial: just map paired-end reads to a reference genome, look for any abnormally spaced pairs or split reads (i.e. reads with parts mapping to different regions), and—boom—structural variants!

In practice, things are much harder. This is partly due to the frustrating tendency for SVs occur in or near repetitive regions where short read sequencing struggles to disambiguate the reads. Or in highly variable regions of genome such as the chromosome ends, which tend to be the tinkering workshop of the genome.

As a result, a large proportion of SVs—typically at least 30-40%—remain undetected. As for false discovery rates (proportion of wrongly inferred SVs), they are mostly not well known because validating SVs on real data is very laborious.

Fission yeast: a compelling model to study structural variants

Studying structural variants in Schizosaccharomyces pombe is especially suited because:

The genome is small, well-annotated and simple (few repeats, haploid).
We had 40x or more coverage over 161 genomes covering the worldwide known population of S. pombe.
We had more than 220 accurate trait measurements for these strains at hand. Since the traits are measured under strictly controlled conditions, they contain little (if any) environmental variance—in stark contrast to human traits.

SURVIVOR makes the most out of (imperfect) SV callers

To infer accurate SVs calls, we introduced SURVIVOR, a consensus method to reduce the false discovery rate, while maintaining high sensitivity. Using simulated data, we observed that consensus calls obtained from two to three different SV callers could recover most SV while keeping the false-discovery rate in check. For example, SURIVOR performed second best with a 70% sensitivity (best was Delly: 75%), while the false discovery rate was significantly reduced to 1% (Delly: 13%) (but remember these figures are based on simulation; performance on real data is likely worse.) Furthermore, we equipped SURVIVOR with different methods to simulate data sets and evaluate callers; merge data from different samples; compute bad map ability regions (BED file) over the different regions, etc. SURVIVOR is written in C++ so it’s fast enough to run on large genomes as well. Since then, we are running it on multiple human data sets, which takes only a few minutes on a laptop. SURVIVOR is available on GitHub.

SVs: now you see me, now you don’t

We applied SURVIVOR to our 161 genomic data sets, and then manually vetted all our calls to obtain a trustworthy set of SVs. We then discovered something suspicious. Some groups of strains that were very closely related (essentially clonal, differing by <150 SNPs) had different numbers of duplications, or different numbers of copies in duplications (1x, 2x, even 6x). This observation was also validated with lab experiments.

Interestingly we identified 15 duplications that were shared between the more diverse non-clonal strains (so these must have been shared during evolution) but could not be explained by the tree inferred from SNPs (Figure 1). To confirm this we compared the local phylogeny of SNPs in 20kb windows up and downstream of the duplications with the variance in copy numbers. Oddly the copy number variance was not highly correlated with the SNP tree. This lead to the conclusion that some SVs are transient and thus are gained or lost faster than SNPs.

Duplications happen within near-clonal populations Phylogenetic tree of the strains reconstructed from SNPs data, with eight pairs of very close strains that nonetheless show structural variation. Click to enlarge.

Though this transience came as a surprise, there is actually supporting evidence from laboratory experiments carried out by Tony Carr back in 1989 that duplications can occur frequently in laboratory-reared S. pombe, and can revert. (Carr et al. 1989). The high turnover raises the possibility that SVs could be an important source for environmental adaptation.

SVs affect spore viability and are associated with several traits

We then investigated the phenotypic impact of these SVs. We used the 220 trait measurements from previous publications. We observed an inverse correlation between rearrangement distance and spore viability, confirming reports in other species that SVs can contribute to reproductive isolation. We also found a link between copy number variation and two traits relevant to wine making (malic acid accumulation and glucose+fructose ultilisation) (Benito et al. PLOS ONE 2016).

Structural variants, reproductive isolation, and wine. A) Making crosses between fission yeast strains often results in low offspring survival. The theory is that rearrangements (inversions and translocations) cause errors during meiosis, so we might expect them to affect offspring viability. If we compare offspring viability from crosses with the number of rearrangements that the parents differ by, there is a correlation, and a ‘forbidden triangle’ in the top right of the plot (it seem impossible to produce high viability spores when parents have many unshared rearrangements). B) SVs also affect traits. For > 200 traits (vertical bars) we used [LDAK](http://dougspeed.com/ldak/) to estimate the proportion of the narrow sense heritability that was caused by copy number variants (red), rearrangements (black) and SNPs (grey). Some traits are very strongly affected by copy number variants, such as the wine-making traits (wine-colored bars along the x-axis). C) Fission yeast wine tasting at UCL—how much of the taste is due structural variants? (Jürg Bähler at right).

We used the estimation of narrow sense heritability from Doug Speed’s LDAK program. Narrow sense heritability estimates how much of a difference in a trait between individuals can be explained by adding up all the tiny effects of the genomic differences (in our case SNPs; deletions and duplications; inversions and translocations and all combined). Overall, we found the heritability was better explained when combining the SNP data as well as the SVs data. In 45 traits SVs explained 25% or more of the trait variability. Five traits that were explained by over 90% heritability using SNPs and SVs came from different growth conditions in liquid medium. This may highlight again the influence of environmental conditions on the genomic structure. For 74 traits (~30% of those we analyzed) SVs explain more of the trait than the SNPs. These high SV-affected traits include malic acid, acetic acid and glucose/fructose contents of wine, key components of taste.

A collaborative effort

On a personal note, the paper concludes a wonderful team effort over two and a half years.

The project started as a summer project for Clemency Jolly, who had then just completed her 3rd undergraduate year at UCL, in the Dessimoz and Bähler labs. Dan Jeffares and the rest of the Bähler lab had just published their 161 fission yeast genomes, with an in-depth analysis of the association between SNPs and quantitative traits (Jeffares et al., Nature Genetics 2015). Studying SVs was the logical next step, but given the challenging nature of reliable SV calling, we also recruited to the team Fritz Sedlazeck, collaborator and expert in tool development for NGS data analysis then based in Mike Schatz’s lab at Cold Spring Harbor Laboratory.

At the end of the summer, it was clear that we were onto something, but there was still a lot be done. Clemency turned the work into her Master’s project, with Dan and Fritz redoubling their efforts until Clemency graduation in summer 2015. It took another year of intense work lead by Dan and Fritz to verify the calls, perform the GWAS and heritability analyses, and publish the work. Since then, Clemency has started her PhD at the Crick Institute, Fritz has moved to John Hopkins University, and Dan has started his own lab at the University of York.

References:

Jeffares, D., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., Balloux, F., Dessimoz, C., Bähler, J., & Sedlazeck, F. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast Nature Communications, 8 DOI: 10.1038/ncomms14061

Carr AM, MacNeill SA, Hayles J, & Nurse P (1989). Molecular cloning and sequence analysis of mutant alleles of the fission yeast cdc2 protein kinase gene: implications for cdc2+ protein structure and function. Molecular & general genetics : MGG, 218 (1), 41-9 PMID: 2674650

Jeffares, D., Rallis, C., Rieux, A., Speed, D., Převorovský, M., Mourier, T., Marsellach, F., Iqbal, Z., Lau, W., Cheng, T., Pracana, R., Mülleder, M., Lawson, J., Chessel, A., Bala, S., Hellenthal, G., O’Fallon, B., Keane, T., Simpson, J., Bischof, L., Tomiczek, B., Bitton, D., Sideri, T., Codlin, S., Hellberg, J., van Trigt, L., Jeffery, L., Li, J., Atkinson, S., Thodberg, M., Febrer, M., McLay, K., Drou, N., Brown, W., Hayles, J., Salas, R., Ralser, M., Maniatis, N., Balding, D., Balloux, F., Durbin, R., & Bähler, J. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe Nature Genetics, 47 (3), 235-241 DOI: 10.1038/ng.3215

Benito, A., Jeffares, D., Palomero, F., Calderón, F., Bai, F., Bähler, J., & Benito, S. (2016). Selected Schizosaccharomyces pombe Strains Have Characteristics That Are Beneficial for Winemaking PLOS ONE, 11 (3) DOI: 10.1371/journal.pone.0151102

More info

How to access scientific papers for free?

Christophe Dessimoz — Thu, 12 Jan 2017 14:55:39 +0000

You have a reference to a research paper of interest.

Perhaps this one:

Iantorno S et al, Who watches the watchmen? an appraisal of benchmarks for multiple sequence alignment, Multiple Sequence Alignment Methods (D Russell, Editor), Methods in Molecular Biology, 2014, Springer Humana, Vol. 1079. doi:10.1007/978-1-62703-646-7_4

How can you retrieve the full article?

Gold open access

If the paper is published as “gold” open access, you can download the PDF from the publisher’s website. In such cases, it’s easiest to paste the title and author names in Google and look for the first hit in the journal.

Alternatively, if you know the Digital Object Identifer (DOI) of the article, prepending http://doi.org should directly lead you to the article. In this case, the DOI is included at the end of the citation. (If you don’t know its DOI, you can find it on the publisher’s website or in a database such as PubMed). We get:

http://doi.org/10.1007/978-1-62703-646-7_4

Unfortunately, we see that this particular paper is behind a paywall at the publisher.

Green open access

There is, however, still a chance that it might be deposited in a preprint server or in an institutional repository. This is referred to as “green” open access. One way to find out is to look at the article record in Google Scholar and look for a link in the right margin:

In this case, the paper is thus available on arXiv.org.

If you know the DOI, an even quicker way of looking for a deposited version is by using the http://oadoi.org redirection tool, which works analogously to doi.org but redirect to a green open access version whenever possible:

http://oadoi.org/10.1007/978-1-62703-646-7_4

Here, a free version of the article deposited in the UCL institutional repository is found.

If you use Chrome or Firefox, you can also use the Unpaywall browser extension to automatically get a link to green open access alternatives as you land on paywalled articles.

On the author’s homepage

Sometimes, the paper is available on the homepage of one of the authors. In this case, a link to the preprint is provided on the homepage of the corresponding author (item #36).

ResearchGate

Instead of an institutional homepage, some authors self-archive their articles on ResearchGate. In the case of our paper, the full-text version is indeed directly available.

And otherwise, if one of the authors is active on ResearchGate, it’s also possible to send a full-text request at the click of a button.

Pirated version off Sci-Hub

Sci-Hub serves bootleg copies of pay-walled articles. This is illegal, so I only mention it for educational purposes. This works most reliably using, again, DOIs:

http://sci-hub.tw/10.1007/978-1-62703-646-7_4

If you, purely hypothetically of course, pasted that URL in your browser, you would or would not get a PDF of the entire book in which the referenced article appears.

#icanhazpdf

It’s also possible to request full-text articles via Twitter. As described in Wikipedia, this works by tweeting the article title, its DOI, an email address (to indicate to whom the article should be sent), and the hashtag #icanhazpdf. Someone with access to the article might send a copy via email. Once the article is received, the tweet is deleted. Again, I mention this for educational purpose only—don’t break the law.

Image credit: Field of Science

Altmetrics.com wrote an interesting post on #icanhazpdf a few years ago.

Email the corresponding author

Finally, you can always ask the corresponding author by email for a copy of their article. They will happily oblige.

[Update (19 Mar 2017): added mention of unpaywall to seemlessly retrieve green open access]

Life as an academic: my 2016 in numbers

Christophe Dessimoz — Thu, 29 Dec 2016 22:25:52 +0000

Life as an academic is varied and busy. Students sometimes believe that all we do is teach. In fact, we do quite a few other things. Here’s my 2016 in numbers.

number of papers published: 10
number of paper rejections: 7
number of books edited: 1
number of grant proposals submitted: 8
number of research contracts negotiated with the industry: 2
number of blog posts: 5
number of tweets: 474 (66% were retweets)
number of YouTube videos: 1
number of papers reviewed: 24
number of papers edited: 3
number of grants reviewed: 3
number of PhD theses examined: 2
number of emails received (excluding spam and mailing-lists): 12,695
number of emails written: 4,377 (!)
number of minutes videoconferencing on GoToMeeting: 13,236 (!!)
number of Geneva-London-Geneva roundtrips: 12
number of meetings with >50 attendees co-organised: 6
number of seminars hosted: 4
number of conferences attended: 3
number of talks given: 11
number of semester-long courses organised: 2
number of hours lectured: 32
number of 2000-word student papers marked: 47
number of summer students supervised: 4
number of overnight retreats attended: 4
number of work Christmas dinners attended: 3
number of annual reports written: 3 (this does not count)
number of Tête de Moine eaten at lab celebrations: 4
number of times moved home: 0 (noteworthy since we moved 5 times in the preceding 5 years…)

I wish you, Dear Reader, all the best in 2017!

What are Hierarchical Orthologous Groups (HOGs)?

Christophe Dessimoz — Thu, 08 Dec 2016 14:14:15 +0000

One central concept in the OMA project and other work we do to infer relationships between genes is that of Hierarchical Orthologous Groups, or “HOGs” for the initiated.

We’ve written several papers on aspects pertaining to HOGs—how to infer them, how to evaluate them, them being increasingly adopted by orthology resources, etc.—but there is still a great deal of confusion as to what HOGs are and why they matter.

Natasha Glover, talented postdoc in the lab, has produced a brief video to introduce HOGs and convey why we are mad about them!

References

Altenhoff, A., Gil, M., Gonnet, G., & Dessimoz, C. (2013). Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs PLoS ONE, 8 (1) DOI: 10.1371/journal.pone.0053786

Boeckmann, B., Robinson-Rechavi, M., Xenarios, I., & Dessimoz, C. (2011). Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees Briefings in Bioinformatics, 12 (5), 423-435 DOI: 10.1093/bib/bbr034

Sonnhammer, E., Gabaldon, T., Sousa da Silva, A., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P., Dessimoz, C., & , . (2014). Big data and other challenges in the quest for orthologs Bioinformatics, 30 (21), 2993-2998 DOI: 10.1093/bioinformatics/btu492

Interview with Tunca Doğan, OMA Visiting Fellow 2016

Tunca Doğan — Sun, 27 Nov 2016 21:10:03 +0000

Note: the “Life in the Lab” series features interviews of interns and visitors. This post is by our second 2016 OMA Visiting Fellow Tunca Doğan, who spent a month with us earlier this year. You can follow Tunca on Twitter at @tuncadogan. —Christophe

Please introduce yourself in a few sentences.

My name is Tunca Doğan. I received my PhD in 2013 with a thesis study in the fields of bioinformatics and computational biology where we developed methods for the clustering of the protein sequences using unsupervised machine learning techniques (Dogan and Karacali, 2013). I’ve since been working as a post-doctoral fellow in the EMBL-EBI, UK under the Protein Function Development team (UniProt Database) leaded by Dr Maria Martin. Here I’m developing new tools and methods for the automated functional annotation of protein records in the UniProtKB using a variety of features including domain architectures (Dogan et al., 2016). I’m also conducting research in the field of computational drug discovery. As of 2016, I’m also affiliated to the Department of Health Informatics, METU, Turkey both as a senior research fellow and a faculty candidate.

Why did you choose to apply to the OMA visiting fellowship programme?

The team behind OMA is world-leading in the field of phylogenomics, and they authored many highly cited publications in this area. Moreover, OMA is considered to be one of the most reliable and comprehensive resources offering phylogenomic information on various species. I’ve applied to this programme in order to develop my knowledge in phylogenomic research, particularly about the OMA production. My specific research aim was to investigate if and how the information in OMA can be utilized in order to increase the coverage and the quality of the automated functional annotation of proteins in the UniProt database.

Discussions on UNIL campus with Leonardo de Oliveria Martins, Surag Nair, Clément Train, David Dylus and Tunca Doğan (from l. to r.)

What project did you work on during your visit?

The project I worked on had two sides: 1) investigating novel ways of quality checking of the data produced in the OMA pipeline (especially HOGs) using the Domain Architecture Alignment and Classification (DAAC) method I previously developed in UniProt; 2) investigating the use of OMA groups and HOGs to propagate the functional annotation between the (homologous) member proteins of the same clusters/classes.

Was there any highlight or low point you’d like to share?

It was a great experience for me both professionally and socially. I’ve learnt a great deal in just one month and we still keep our collaboration with the continuation of the abovementioned project. Everyone I met in the group: Christophe, Adrian, David, Leonardo and Clement were all knowledgeable, helpful and friendly that I had great time during my stay. It was a great pleasure to meet and to work with them all…

UNIL/EPFL campus is just beautiful, at the shores of lake Geneva. The campus is also well-equipped for all possible needs. This was also my first time in Switzerland and I was enchanted by the beauty of this country… The only downside for a foreign visitor could be the expensiveness of life in Switzerland, which was also manageable with a little prior investigation and planning.

Do you have any practical tip for future OMA visiting fellows?

I definitely recommend any researcher (at PhD or post-doc level) that has an interest in phylogenomics to apply to this programme. You’ll learn a great deal and have a good time at the same time. Also (for the foreigners) do not forget about travelling around this beautiful country in your spare time…

Editor’s note: If you are interested in the OMA visiting fellowship programme, consult this page.

References:

Doğan T, & Karaçalı B (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PloS one, 8 (9) PMID: 24069417

Doğan T, MacDougall A, Saidi R, Poggioli D, Bateman A, O’Donovan C, & Martin MJ (2016). UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics (Oxford, England), 32 (15), 2264-71 PMID: 27153729