Authors: Fritz Sedlazeck, Dan Jeffares & Christophe Dessimoz •
Our latest study just came out (Jeffares et al., Nature Comm 2017). In it, we carefully catalogued high-confidence structural variants among all known strains of the fission yeast population, and assessed their impact on spore viability, winemaking and other traits. This post gives a summary and the story behind the paper.
Next generation sequencing is enabling the study of genomic diversity on unprecedented levels. While most of this research has focused on single base pair differences (single nucleotide polymorphisms, SNPs), larger genomic differences (called structural variations, SVs) can also have an impact on the evolution of an organism, on traits and on diseases. SVs are usually loosely defined as events that are at least 50 base pair long. They are often classified in five subtypes: deletions, duplications, new sequence insertions, inversions and translocations.
Over the recent years the impact of SVs has been characterized in many organisms. For example, SVs play a role in cancer, when duplications often lead to multiple copies of important oncogenes. Furthermore, SVs are known to play a role in other human disorders such as autism, obesity, etc.
… but calling structural variants remains challenging
In principle, identifying SVs seems trivial: just map paired-end reads to a reference genome, look for any abnormally spaced pairs or split reads (i.e. reads with parts mapping to different regions), and—boom—structural variants!
In practice, things are much harder. This is partly due to the frustrating tendency for SVs occur in or near repetitive regions where short read sequencing struggles to disambiguate the reads. Or in highly variable regions of genome such as the chromosome ends, which tend to be the tinkering workshop of the genome.
As a result, a large proportion of SVs—typically at least 30-40%—remain undetected. As for false discovery rates (proportion of wrongly inferred SVs), they are mostly not well known because validating SVs on real data is very laborious.
Fission yeast: a compelling model to study structural variants
Studying structural variants in Schizosaccharomyces pombe is especially suited because:
The genome is small, well-annotated and simple (few repeats, haploid).
We had 40x or more coverage over 161 genomes covering the worldwide known population of S. pombe.
We had more than 220 accurate trait measurements for these strains at hand. Since the traits are measured under strictly controlled conditions, they contain little (if any) environmental variance—in stark contrast to human traits.
SURVIVOR makes the most out of (imperfect) SV callers
To infer accurate SVs calls, we introduced SURVIVOR, a consensus method to reduce the false discovery rate, while maintaining high sensitivity. Using simulated data, we observed that consensus calls obtained from two to three different SV callers could recover most SV while keeping the false-discovery rate in check. For example, SURIVOR performed second best with a 70% sensitivity (best was Delly: 75%), while the false discovery rate was significantly reduced to 1% (Delly: 13%) (but remember these figures are based on simulation; performance on real data is likely worse.) Furthermore, we equipped SURVIVOR with different methods to simulate data sets and evaluate callers; merge data from different samples; compute bad map ability regions (BED file) over the different regions, etc. SURVIVOR is written in C++ so it’s fast enough to run on large genomes as well. Since then, we are running it on multiple human data sets, which takes only a few minutes on a laptop. SURVIVOR is available on GitHub.
SVs: now you see me, now you don’t
We applied SURVIVOR to our 161 genomic data sets, and then manually vetted all our calls to obtain a trustworthy set of SVs. We then discovered something suspicious. Some groups of strains that were very closely related (essentially clonal, differing by <150 SNPs) had different numbers of duplications, or different numbers of copies in duplications (1x, 2x, even 6x). This observation was also validated with lab experiments.
Interestingly we identified 15 duplications that were shared between the more diverse non-clonal strains (so these must have been shared during evolution) but could not be explained by the tree inferred from SNPs (Figure 1). To confirm this we compared the local phylogeny of SNPs in 20kb windows up and downstream of the duplications with the variance in copy numbers. Oddly the copy number variance was not highly correlated with the SNP tree. This lead to the conclusion that some SVs are transient and thus are gained or lost faster than SNPs.
Duplications happen within near-clonal populations Phylogenetic tree of the strains reconstructed from SNPs data, with eight pairs of very close strains that nonetheless show structural variation. Click to enlarge.
Though this transience came as a surprise, there is actually supporting evidence from laboratory experiments carried out by Tony Carr back in 1989 that duplications can occur frequently in laboratory-reared S. pombe, and can revert. (Carr et al. 1989). The high turnover raises the possibility that SVs could be an important source for environmental adaptation.
SVs affect spore viability and are associated with several traits
We then investigated the phenotypic impact of these SVs. We used the 220 trait measurements from previous publications. We observed an inverse correlation between rearrangement distance and spore viability, confirming reports in other species that SVs can contribute to reproductive isolation. We also found a link between copy number variation and two traits relevant to wine making (malic acid accumulation and glucose+fructose ultilisation) (Benito et al. PLOS ONE 2016).
Structural variants, reproductive isolation, and wine. A) Making crosses between fission yeast strains often results in low offspring survival. The theory is that rearrangements (inversions and translocations) cause errors during meiosis, so we might expect them to affect offspring viability. If we compare offspring viability from crosses with the number of rearrangements that the parents differ by, there is a correlation, and a ‘forbidden triangle’ in the top right of the plot (it seem impossible to produce high viability spores when parents have many unshared rearrangements). B) SVs also affect traits. For > 200 traits (vertical bars) we used [LDAK](http://dougspeed.com/ldak/) to estimate the proportion of the narrow sense heritability that was caused by copy number variants (red), rearrangements (black) and SNPs (grey). Some traits are very strongly affected by copy number variants, such as the wine-making traits (wine-colored bars along the x-axis). C) Fission yeast wine tasting at UCL—how much of the taste is due structural variants? (Jürg Bähler at right).
We used the estimation of narrow sense heritability from Doug Speed’s LDAK program. Narrow sense heritability estimates how much of a difference in a trait between individuals can be explained by adding up all the tiny effects of the genomic differences (in our case SNPs; deletions and duplications; inversions and translocations and all combined). Overall, we found the heritability was better explained when combining the SNP data as well as the SVs data. In 45 traits SVs explained 25% or more of the trait variability. Five traits that were explained by over 90% heritability using SNPs and SVs came from different growth conditions in liquid medium. This may highlight again the influence of environmental conditions on the genomic structure. For 74 traits (~30% of those we analyzed) SVs explain more of the trait than the SNPs. These high SV-affected traits include malic acid, acetic acid and glucose/fructose contents of wine, key components of taste.
A collaborative effort
On a personal note, the paper concludes a wonderful team effort over two and a half years.
The project started as a summer project for Clemency Jolly, who had then just completed her 3rd undergraduate year at UCL, in the Dessimoz and Bähler labs. Dan Jeffares and the rest of the Bähler lab had just published their 161 fission yeast genomes, with an in-depth analysis of the association between SNPs and quantitative traits (Jeffares et al., Nature Genetics 2015). Studying SVs was the logical next step, but given the challenging nature of reliable SV calling, we also recruited to the team Fritz Sedlazeck, collaborator and expert in tool development for NGS data analysis then based in Mike Schatz’s lab at Cold Spring Harbor Laboratory.
At the end of the summer, it was clear that we were onto something, but there was still a lot be done. Clemency turned the work into her Master’s project, with Dan and Fritz redoubling their efforts until Clemency graduation in summer 2015. It took another year of intense work lead by Dan and Fritz to verify the calls, perform the GWAS and heritability analyses, and publish the work. Since then, Clemency has started her PhD at the Crick Institute, Fritz has moved to John Hopkins University, and Dan has started his own lab at the University of York.
Jeffares, D., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., Balloux, F., Dessimoz, C., Bähler, J., & Sedlazeck, F. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast Nature Communications, 8 DOI: 10.1038/ncomms14061
Carr AM, MacNeill SA, Hayles J, & Nurse P (1989). Molecular cloning and sequence analysis of mutant alleles of the fission yeast cdc2 protein kinase gene: implications for cdc2+ protein structure and function. Molecular & general genetics : MGG, 218 (1), 41-9 PMID: 2674650
Jeffares, D., Rallis, C., Rieux, A., Speed, D., Převorovský, M., Mourier, T., Marsellach, F., Iqbal, Z., Lau, W., Cheng, T., Pracana, R., Mülleder, M., Lawson, J., Chessel, A., Bala, S., Hellenthal, G., O’Fallon, B., Keane, T., Simpson, J., Bischof, L., Tomiczek, B., Bitton, D., Sideri, T., Codlin, S., Hellberg, J., van Trigt, L., Jeffery, L., Li, J., Atkinson, S., Thodberg, M., Febrer, M., McLay, K., Drou, N., Brown, W., Hayles, J., Salas, R., Ralser, M., Maniatis, N., Balding, D., Balloux, F., Durbin, R., & Bähler, J. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe Nature Genetics, 47 (3), 235-241 DOI: 10.1038/ng.3215
Benito, A., Jeffares, D., Palomero, F., Calderón, F., Bai, F., Bähler, J., & Benito, S. (2016). Selected Schizosaccharomyces pombe Strains Have Characteristics That Are Beneficial for Winemaking PLOS ONE, 11 (3) DOI: 10.1371/journal.pone.0151102
For my second postdoc, I was the fortunate receipient of a PLANT FELLOWS scholarship. PLANT FELLOWS is an international program that provides research grants to postdocs in the field of plant science. The fellows are based at many different host institutions throughout Europe. I myself am working at Bayer Crop Science in Gent, Belgium, in collaboration with the Dessimoz lab in London and Lausanne. Part of the PLANT FELLOWS mission is to provide training, mentoring, and networking to the postdocs—skills essential for career advancement.
Last year, the annual PF meeting was held in Männedorf, Switzerland from September 28 to October 1 2015. Training workshops took place at the Boldern Hotel, surrounded by meadows and with a nice view of Lake Zürich.
Group picture from the 3rd annual PLANT FELLOWS meeting
The meeting consisted of several days of trainings and workshops. For one of the days, I chose to participate in the workshop “Advanced Strategies for Dealing with the Publication Process.” I was especially keen on learning more about this particular subject. As a postdoc still trying to navigate the publication waters, I was looking for all the advice I could get. We’ve all heard the saying before: publish or perish. Publishing papers in your postdoc years is so important for an academic career.
There were about 15 postdocs in this day-long workshop. The facilitator, Philipp Mayer, came with a bunch of photocopied book chapters, articles, and USB keys full of pdfs for each of us to use on our laptops. The objective of the workshop was to, as a group, write a small paper about advanced publication strategies using the literature we were provided with. Our plan of attack was to pool our collective postdoc experience and come up with a list of our most useful recommendations on how to get a scientific paper published.
After feverishly reading websites, book chapters and papers, at the end of the day we came up with a draft: an introduction, our recommendations broken into 3 main sections, and a conclusion. We had a respectable number of references. But what would be the fate of our paper? About a third of the class was apathetic, a third thought we should aim for a blog post, and another third thought we should try for a “real” scientific journal. I had really enjoyed the workshop so I lobbied for publishing it in a real journal. I liked the experience of learning about a topic, working collaboratively with my peers, and then passing on the information for others to benefit.
I volunteered to take charge of the paper, edit it, and submit it to journals in hopes of getting it published. At the end of the day I left with a draft of the paper, many references, the contact information of all the attendees, and the full support of the facilitator (Philipp) for any future help that I might need. I looked at it as an opportunity take a leadership role in publishing a paper, from start to finish. And more importantly, it was a chance to put our own advice into practice.
Upon returning to Belgium, I quickly found out that one of the sentences we had written in the paper rang true: It is a common misconception among early career researchers that the presentation of the work in a manuscript is the last stage of a project. There is a long and complicated process associated with submission, review, and revision that must be taken into account. During the next month, I reread paper, finished writing short sections, added references, edited, and got feedback from the coauthors. We agreed on the author order, and shared the document using Authorea. Philipp and I went back and forth with several rounds of editing.
We decided to submit our manuscript to eLife, which is a prestigious peer reviewed open access journal with favorable policy toward early career researchers. I wrote a cover letter to the editor describing our paper and asking if the topic was suitable to be considered for eLife.
Within a few days, the editor read the manuscript but informed me that he was unable to send it out for review because it wasn’t “fresh” enough, meaning most of what we said had already be discussed many times in the scientific community. Despite the sting of having a paper rejected directly from the editor, I decided to take the advice we had written in the paper: Remove your personal feelings from the peer review process. Time to find the next journal.
During the following month and a half, the manuscript was pushed to the bottom of my To Do list, as other projects and tasks got my attention. Christmas holidays came and went, and admittedly this paper was the last thing on my mind.
In January, I sent a presubmission inquiry to PLOS Biology. The PLOS Biology editor wrote back within a few days to inform me that although they appreciated the attention to an important problem, they could not encourage us to submit because it didn’t present “novel strategies for increasing access to research, improving the quality of research results, or fixing flawed measures of impact.” Since this was the second time I had heard this same exact criticism, I realized it was time to take more advice from the paper: It is critical to highlight the novelty and importance in the article and cover letter. We were going to have to add something to the paper to make it more novel.
Shortly after, I contacted the Frontiers in Plant Science (FiPS) Editorial Office with a new and improved cover letter. FiPS is an open access online journal publishing many different peer reviewed articles: research, reviews, commentaries, and perspectives, among others. The editor and I discussed morphing the paper into something that would be more plant related, given the plant science background of all the coauthors. Over the next month, it was back to editing the paper. I proposed edits that would make our tips more plant-specific. We added advice about industry-academia collaborations, and more information about plant science journals. Philipp, the coauthors, and I went back and forth several times with rounds of edits, adding more references and polishing more details. I submitted the final version of the paper to Frontiers in Plant Science on March 15.
The experience of the collaborative peer review by FiPS was a pleasant and efficient one. Their website says “Frontiers reviews are standardized, rigorous, fair, constructive, efficient and transparent.” I enthusiastically agree. Within two weeks, we had received comments from the reviewers. There were some major points that needed to be addressed before Frontiers could offer publication. However, the points were all very relevant and only helped to make the paper stronger. During the process of the interactive review, I took more guidance from the paper: Go point by point through the reviewer comments and either make the suggested change or politely explain and clarify the misunderstanding.
April 21st : Acceptance achieved! Approximately 5 weeks after submitting the article, it was accepted and the provisional version of the manuscript was published online. This is an extremely fast turnover time, in part due to the responsiveness of the editor, quick but in-depth peer review, and the interactive, transparent review discussion.
What I learned
This collaboration with the PLANT FELLOWS postdocs resulted in a paper I can say I’m proud of. I learned many things about the publication process—not only through a literature review, but by actually experiencing the process first hand. Here are some of the main things that stuck with me:
There is a certain creative power in bringing people together in a beautiful location to brainstorm and produce an outcome within a short period of time. However, it is necessary for someone to take the reins and commit to the follow-through in order to get to a finished product. I think things like hackathons or other collaborative group efforts could lead to fruitful outcomes.
I learned how to coordinate a small project. This was a great collaborative effort, which gave me an opportunity to practice the recommendations we wrote about in the paper.
I discovered firsthand the importance of the initial contact with the editor. As soon as we reworked the paper to approach the topic from a plant-specific standpoint, this added novelty to the paper. We were able to highlight this novelty in the cover letter.
Don’t give up. Many times I got distracted or discouraged and thought to publish the manuscript on our blog, but I’m glad in the end we found a home for it at FiPS. Perseverance is key.
Glover, N., Antoniadi, I., George, G., Götzenberger, L., Gutzat, R., Koorem, K., Liancourt, P., Rutowicz, K., Saharan, K., You, W., & Mayer, P. (2016). A Pragmatic Approach to Getting Published: 35 Tips for Early Career Researchers Frontiers in Plant Science, 7 DOI: 10.3389/fpls.2016.00610
The paper introducing our new tree visualisation tool Phylo.io was just published in MBE.
Yet another tool to display trees, you might say, and indeed, so it is. But for all the tools that have been developed over the years, there are very few that scale to large trees, make it easy to compare trees side-by-side, and simply run in a browser on any computer or mobile device.
The project started as a student summer internship project, with the aim of producing a tree visualiser that facilitates comparison of trees built on the same set of leaves. After reading the project description, Oscar Robinson, a brilliant student from the Computer Science department at UCL, decided to work on this project during a three month internship. He saw a chance to apply his experience in the development of web tools and to develop his knowledge in the field of data visualisation, one of his major interests.
What is phylo.io and what can it do?
Phylo.io is a web tool that works in any modern browser. All computations are performed client-side and the only restriction on performace is the machine it is running on. Trees can be input in Newick and Extended Newick format. Phylo.io offers many features that other tree viewers have. Branches can be swapped, the rooting can be changed, the thickness, font and other parameters are adaptable. Many of these operations can be performed directly by clicking on a branch or a node in the tree. Importantly, it features an automatic subtree collapsing function: this facilitates the visualisation of large trees and hence the analysis of splits that are deep in the tree.
Next to basic tree visualisation/manipulation it features a compare mode. This mode allows to compare two trees computed using different tools or different models. Similarities and differences are highlighted using a colour scheme directly on the individual branches, making it clear where the differences in two topologies actually are. Additionally, since the output of different tools provides trees with very different rootings and leaf order, Phylo.io has a function to root one of the trees according to the other one and adapt the order of the leaves according to a fixed tree.
How do you use phylo.io?
To save you time, here is a one minute screencast highlighting some of the key features of Phylo.io:
Back in 2009, Adrian Altenhoff and I published a paper on ortholog benchmarking in PLOS Computational Biology. At the time, this was the first benchmark study with phylogeny-based tests. It also investigated an unprecedented number of methods. One of the most challenging aspect of this work—and by far the most tedious—was to compare inferences performed by different methods on only partly overlapping sets of genomes, often with inconsistent identifiers and releases—giving right to the cynics’s view that “bioinformatics is ninety percent identifier mapping…”
Enter the Quest for Orthologs consortium
Around that time, Eric Sonnhammer and Albert Vilella organised the first Quest for Orthologs (QfO) meeting at the beautiful Genome Campus in Hinxton, UK—the first of a series of collaborative meetings. We have published detailed reports on these meetings (2009, 2011, 2013; stay tuned for the 2015 meeting report…).
Out of these interactions, the Quest for Orthologs consortium was born, with the mission to benchmark, improve and standardise orthology predictions through collaboration, the use of shared reference datasets, and evaluation of emerging new methods.
The orthology benchmark service and other contributions of the paper
The consortium is organised in working groups. One of them is the benchmarking working group, in which Adrian and I have been very involved. This new paper presents several key outcome of the benchmarking working group.
First and foremost, we present a publicly-available, automated, web-based benchmark service. Accessible at http://orthology.benchmarkservice.org, the service lets method developers evaluate predictions performed on the 2011 QfO reference proteome set of 66 species. Within a few hours after submitting their predictions, they obtain detailed feedback on the performance of their method on various benchmarks compared with other methods. Optionally, they can make the results publicly available.
Conceptual overview of the benchmark service (Fig 1 of the paper; click to enlarge)
Second, we discuss the performance of 14 orthology methods on a battery of 20 different tests on a common dataset across all of life.
Third, one of the benchmark, the generalised species discordance test, is new and provides a way for testing pairwise orthology based on trusted species trees of arbitrary size and shape.
For developers of orthology prediction methods, this work sets minimum standards in orthology benchmarking. Methodological innovations should be reflected in competitive performance in at least a subset of the benchmarks (we recognise that different applications entail different trade-offs). Publication of new or update methods in journals should ideally be accompanied by publication of the associated results in the orthology benchmark service.
For end-users of orthology predictions, the benchmark service provides the most comprehensive survey of methods to date. And because it can process new submissions automatically and continuously, it holds the promise of remaining current and relevant over time. The benchmark service thus enables users to gauge the quality of the orthology calls upon which they depend, and to identify the methods most appropriate to the problem at hand.
Altenhoff, A., Boeckmann, B., Capella-Gutierrez, S., Dalquen, D., DeLuca, T., Forslund, K., Huerta-Cepas, J., Linard, B., Pereira, C., Pryszcz, L., Schreiber, F., da Silva, A., Szklarczyk, D., Train, C., Bork, P., Lecompte, O., von Mering, C., Xenarios, I., Sjölander, K., Jensen, L., Martin, M., Muffato, M., Altenhoff, A., Boeckmann, B., Capella-Gutierrez, S., DeLuca, T., Forslund, K., Huerta-Cepas, J., Linard, B., Pereira, C., Pryszcz, L., Schreiber, F., da Silva, A., Szklarczyk, D., Train, C., Lecompte, O., Xenarios, I., Sjölander, K., Martin, M., Muffato, M., Quest for Orthologs consortium, Gabaldón, T., Lewis, S., Thomas, P., Sonnhammer, E., Dessimoz, C., Gabaldón, T., Lewis, S., Thomas, P., Sonnhammer, E., & Dessimoz, C. (2016). Standardized benchmarking in the quest for orthologs Nature Methods DOI: 10.1038/nmeth.3830
Altenhoff, A., & Dessimoz, C. (2009). Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods PLoS Computational Biology, 5 (1) DOI: 10.1371/journal.pcbi.1000262
Gabaldón, T., Dessimoz, C., Huxley-Jones, J., Vilella, A., Sonnhammer, E., & Lewis, S. (2009). Joining forces in the quest for orthologs Genome Biology, 10 (9) DOI: 10.1186/gb-2009-10-9-403
Dessimoz, C., Gabaldon, T., Roos, D., Sonnhammer, E., Herrero, J., & Quest for Orthologs Consortium (2012). Toward community standards in the quest for orthologs Bioinformatics, 28 (6), 900-904 DOI: 10.1093/bioinformatics/bts050
Sonnhammer, E., Gabaldon, T., Sousa da Silva, A., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P., Dessimoz, C., & Quest for Orthologs Consortium. (2014). Big data and other challenges in the quest for orthologs Bioinformatics, 30 (21), 2993-2998 DOI: 10.1093/bioinformatics/btu492
We know homologs are genes related by common ancestry. But throw complex evolutionary events into the mix and things can get little dicey. Under the umbrella of homologs exist many different categories: orthologs, paralogs, ohnologs, xenologs, co-ortholog, in-paralogs, out-paralogs, paleologs, among others. All of these —log terms have a specific meaning (see my previous blog post on orthology and paralogy), but now we will focus on one in particular: homoeologs.
But before we get into the definition, let’s start at the beginning. When I started as a postdoc at Bayer CropScience working with Henning Redestig in collaboration with Christophe Dessimoz University College London, I was tasked with evaluating homoeolog predictions using the OMA algorithm.
What are homoeologs?
From my previous experience, I knew homoeologs as roughly “corresponding” genes between subgenomes of a polyploid organism. For example, the wheat genome is an allohexaploid, with 3 diploid subgenomes named A, B, and D. Given a gene on chromosome 3B, you will most likely find a nearly identical copy on chromosomes 3A and 3D, in roughly the same position. These corresponding copies across subgenomes are known as homoeologs. But this definition left something to be desired— it didn’t tell me anything about the evolutionary relationship between the homoeologs. Worse, it was ambiguous in that it required discretionary similarity thresholds in terms of sequence and positional conservation. How could we test for performance if there was no unambiguous definition of the target?
Time to hit the books
Like many researchers starting a new project, I went to the scientific literature to get more information. After many hours spent on google scholar, I found myself with more questions than answers. Firstly, what were the evolutionary events that give rise to homoeologs? How do they fit in with the other —log terms? Can they be found only in a certain type of polyploid, but not another? How do things like gene duplication and movement affect our understanding of what a homoeolog is? And finally, after seeing it the word written as homoeolog, homeolog, and homoeologue, how do you even spell it?
There are some excellent review papers out there on polyploidy which shed light on the biological consequences of homoeology. This, this, or this for example. However, when searching the whole of the literature, I found many inconsistent, vague, or even incorrect usages of the term homoeolog. Sometimes people defined homoeologs on the basis of their chromosome pairing patterns. Other times homoeologs were used to describe corresponding genes from different, although closely related species. Many papers said homoeologs were necessarily syntenic. Others don’t define the term at all.
Getting on the same page
These imprecise or incorrect definitions can lead to confusion. In recent years, advances in technology has afforded us the opportunity to sequence many new genomes, including polyploids. All these new techniques and have exploded the amount of data and brought about collaborations between geneticists, molecular biologists, plant breeders, bioinformaticians, phylogeneticists, and statisticians. Therefore we think it’s important to have a precise and evolutionary meaningful definition of homoeology as a reference point.
What we learned
Thus we went back to the earliest usage of the term we could find and synthesizing the literature to date. We define homoeologs as “pairs of genes or chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization”. For recent hybrids, as long as there was no rearrangement across subgenomes, homoeologs can be thought of as orthologs between these subgenomes. Here’s how they fit in with other common homologs:
We realized that homoeologs are not necessarily one-to-one or syntenic. Depending on the particular patterns of gene duplication and rearrangement in a given species, we may see homoeologs at a 1:many or across non-corresponding chromosomes.
We also reviewed homoeolog inference techniques, starting from low-throughput lab techniques to evolution-based computational methods. Orthology prediction is a booming area of active research, so many orthology inference methods can be applied to homoeology prediction.
Last but not least, we learned that even though homoeolog has alternatively been spelled “homeolog” (no extra o), homoeolog is the clear winner in terms of popularity. The “homoeo—” spelling has been used more than double the amount of times in the literature. Fortunately however, both are pronounced the same (“ho-mee-o-log”)
Check out the review paper in Trends in Plant Science (open access!). We hope this paper can serve as a jump off point for those interested in tackling homoeology, especially for those new to the field.
Glover, N., Redestig, H., & Dessimoz, C. (2016). Homoeologs: What Are They and How Do We Infer Them? Trends in Plant Science DOI: 10.1016/j.tplants.2016.02.005
This is how molecular systematics has worked since the sixties: you take some identifiable feature (e.g. a gene or a protein) common to a group of species and take some measurements of it (e.g. sequencing the DNA). By comparing the results of these measurements you can estimate the evolutionary tree that links the species. Shortly after people started doing this they realised there was a problem: when analyses are based different genes they often estimate different —incongruent — evolutionary trees. As technology has become more capable researchers have begun using more and more genes, so this problem of incongruent trees has moved to the foreground.
There have been lots of good ideas of what do about this problem, and this paper is our contribution. We tried to tackle incongruence by designing a method that groups genes together based on how similar their estimated trees are, without any assumption as to how any incongruence came about.
If all the genes more or less agree on the evolutionary tree, then you get one large group; if some disagree, then they are placed in their own groups. The most interesting case is if several genes disagree in the same way, because then you have an effect to try to explain, and you may have discovered something.
We did lots of simulation to test and refine our method, both in its ability to recognise different incongruent groups, and to estimate how many groups are present. Then, armed with a method that works well on simulation, we tested it on some real data, from yeasts, and from flies.
Our findings were that for the yeast data our method worked really well, and identified 3 distinct groups of genes. The majority of genes were a good fit to the widely accepted tree for the species we looked at. The other two groups showed some major differences, mostly involving two of the species. We had a close look at the data, and concluded that there were some wrong annotations in the data that had introduced sequences that didn’t belong there. This was not the biological result we were looking for, but nonetheless useful.
The flies data were more tricky, as they come from a genus where we aren’t sure how many separate species there are. We produced trees that show better species level resolution than the most recent molecular studies. We also showed high levels of incongruence in the order that the species appear, which can often be the case when species have diverged rapidly, due to a process called incomplete lineage sorting.
So be it to identify artifacts or genuine incongruence among your loci, we think that process-agnostic topology partitioning should become a routine step in phylogenetic analyses. To facilitate this process, we’ve released our code in a new open source software called “treeCl”, available at https://git.io/treeCl.
Gori K, Suchan T, Alvarez N, Goldman N, & Dessimoz C (2016). Clustering genes of common evolutionary history. Molecular biology and evolution PMID: 26893301
Does automatic alignment filtering lead to better trees?
One major use of multiple sequence alignments is for tree inference. Because aligners make mistakes, many practitioners like to mask the uncertain parts of the alignment. This is done by hand or using automated tools such as Gblocks, TrimAl, or Guidance.
The aim of our study was to compare different automated filtering methods and assess under which conditions filtering might be beneficial. We compared ten different approaches on several datasets covering hundreds of genomes across different domains of life—(including nearly all of the Ensembl database) as well as simulated data. We used several criteria to assess the impact of filtering on tree inference (comparing the congruence of resulting trees with undisputed species trees and counting the number of gene duplications implied). We sliced the data in many different ways (sequence length, divergence, “gappyness”).
The more we filter alignments, the worse trees become.
In all datasets, tests, and conditions we tried, we could hardly find any situation in which filtering methods lead to better trees; in many instances, the trees got worse:
Overall, the more alignments get filtered (x-axis in figure), the worse the trees become! This holds across different datasets and filtering methods. Furthermore, under default parameters, most methods filter far too many columns.
The results were rather unexpected, and potentially controversial, so we went to great lengths to ensure that they were not spurious. This included many control analyses and replication of the results on different datasets, and using different criteria of tree quality. We also used simulated data, for which the correct tree is known with certainty.
What could explain this surprising result?
It appears that tree inference is more robust to alignment errors than we
might think. One hypothesis for this might be that while alignment
errors introduce mostly random (unbiased) noise, correct columns (or partly
correct ones) contain crucial phylogenetic signal that can help discriminate
between the true and alternative topologies.
But why could this be the case? We are not sure, but here is an idea: aligners
tend to have most difficulty with highly distant sequences, because there are
many evolutionary scenarios that could have resulted in the same sequences. At
the limit, if the distance is very large (e.g. sites have undergone multiple
substitions on average), all alignments become equally likely, and it becomes
impossible to align the sequences. Also, the variance of the distance estimate
explodes. But relative to this enormous variance,
the bias introduced by alignment errors becomes negligible.
I stress that we don’t prove this in the paper and this is merely a conjecture
(some might call this posthoc rationalisation).
So is filtering an inherently bad idea?
Although alignment filtering does not improve tree accuracy, we can’t say that
it is inherently a bad idea. Moderate amounts of filtering did not seem to have much
impact—positive or negative—but can save some computation time.
Also, if we consider the accuracy of the alignment themselves, which we did in
simulations (such that we know the true alignment), filtering does decrease
the proportion of erroneous sites in the aligments (though, of course, these
alignments get shorter!). Thus for
applications more sensitive to alignment errors than tree inference, such as detection of
sites under positive selection, it is conceivable that filtering might, in
some circumstances, help. However, the literature on the topic is rather
ambivalent (see here, here, here, and here).
Why it took us so long: a brief chronology of the project
The project started in summer 2010 as a 3-week rotation project by Ge Tan, who was a talented MSc student at ETH Zurich at the time (he is now a PhD student at Imperial College London, in Boris Lenhard’s group). The project took a few months more than originally foreseen to complete, but early-on the results were already apparent. In his report, Ge concluded:
“In summary, the filtering methods do not help much in most cases.”
After a few follow-up analyses to complete the study, we submitted a first manuscript to MBE in Autumn 2011. This first submission was rejected after peer-review due to insufficient controls (e.g. lack of DNA alignments, no control for sequence length, proportion of gaps, etc.). The editor stated:
“Because the work is premature to reach the conclusion, I cannot help rejecting the paper at this stage”.
Meanwhile, having just moved to EMBL-EBI near Cambridge UK, I gave a seminar on the work. Puzzled by my conclusions, Matthieu Muffato and Javier Herrero from the Ensembl Compara team set out to replicate our results on the Ensembl Compara pipeline. They saw the same systematic worsening of their trees after alignment filtering.
We joined forces and combined our results in a revised manuscript, alongside additional controls requested by the reviewers from our original submission. The additional controls necessitated several additional months of computations but all confirmed our initial observations. We resubmitted the manuscript to MBE in late 2012 alongside a 10-page cover letter detailing the improvements.
Once again, the paper was rejected. Basically, the editor and one referee did not believe in the conclusions and no amount of controls were going to convince them of the contrary. We appealed. The editor-in-chief rebutted the appeal but now the reason was rather different:
“[Members of the Board of Editors] were not convinced that the finding that the automated filtering of multiple alignment does not improve the phylogenetic inference on average for a single-gene data set was sufficiently high impact for MBE.”
We moved on and submitted our work to Systematic Biology. Things worked out better there, but it nevertheless took another two years and three resubmissions—addressing a total of 147 major and minor points (total length of rebuttal letters: 43 pages)—before the work got accepted. Two of the four peer-reviewers went so far as to reanalyse our data as part of their report—one conceding that our results were correct and the other one holding out until the bitter end.
Why no preprint?
Some of the problem with this slow publication process could would have been
mitigated if we had submitted the paper as a preprint. In hindsight, it’s
obvious that we should have done so. Initially, however, I did not anticipate
that it would take so long. And with each resubmission, the paper was
strengthening so I thought during the whole time that it was just about to be accepted… Also, I surely
also fell for the Sunk Cost
Other perils of long-term projects
I’ll finish with a few amusing anectodes highlighting the perils of papers requiring many cycles of resubmissions:
More than once, we had to redo analyses with new filtering methods that got published after we started the project.
At some point, one referee asked why we were using such an outdated version of TCoffee (went from version 5 to version 10 during the project!).
The editor-in-chief of MBE changed, and alongside some of the editorial policy and manuscript format (the paper had to be restructured with the method section at the end).
Tan, G., Muffato, M., Ledergerber, C., Herrero, J., Goldman, N., Gil, M., & Dessimoz, C. (2015). Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference Systematic Biology, 64 (5), 778-791 DOI: 10.1093/sysbio/syv033
We have published our latest review, on methods to infer horizontal gene transfer, in Wikipedia. It was peer-reviewed and it simultaneously appeared in PLOS Computational Biology. After our review on approximate Bayesian computations published two years ago, this is our second contribution using an exciting new format called “Topic Page”. In this post, I reflect on our motivation and experience as Topic Page authors.
The difficult relationship between academia and Wikipedia
Academia has mixed feelings about Wikipedia. Although many academics—and certainly many students—consult Wikipedia frequently, I’d venture to say that most remain reluctant to cite Wikipedia or admit relying on it otherwise.
As for academics contributing to Wikipedia, things are even worse. As a result, the quality of Wikipedia articles on scientific topics is oftenquitepoor.
I think there are two main reasons for this reluctance to contribute. First, the lack of clear authorship and therefore credit makes it difficult for scientists to get recognition for contributing to Wikipedia. Given the intense competitiveness of contemporary science, this is more than a vanity issue; recognition is tightly coupled with funding and job success—i.e. survival in the profession. But perhaps just as importantly, many scientists are unfamiliar with Wikipedia’s conventions and practices and are thus (rightly!) concerned that their contributions might be “diluted” by further edits by others or even flat out turned down. I know of several disgruntled people who have given up on editing Wikipedia because of such bad experiences.
Simson Garfinkel provides an illuminating account of this sort of tensions in this article:
“I have attempted to retire from directing films in the alternative universe
that is the Wikipedia a number of times, but somebody always overrules me,”
Lanier wrote. “Every time my Wikipedia entry is corrected, within a day I’m
turned into a film director again.”
Since Lanier’s attempted edits to his own Wikipedia entry were based on
firsthand knowledge of his own career, he was in direct violation of
Wikipedia’s three core policies. He has a point of view; he was writing on
the basis of his own original research; and what he wrote couldn’t be
verified by following a link to some kind of legitimate, authoritative, and
For the tertiary source Wikipedia aims to be, these core policies are entirely reasonable but it’s easy to imagine situation where they might frustrate some contributors.
Wikipedia’s tremendous impact
It’s however worth considering the upsides of contributing to Wikipedia. For all the obsessions many of us have about publishing articles in generalist journals with broad readership, the lack of interest in Wikipedia feels like a missed opportunity. Consider the wikipedia page on
Phylogenetics. It was consulted
over 50,000 in the last 3
This is over twice as much as the median number of views of papers published
in, say, the 17 Oct 2013 issue of
Nature in a quarter of the time (as it
happens, this particular issue has “impact metrics” as its cover story…).
PLOS Topic Pages
Fortunately, the good folks at PLOS Computational Biology have worked out a great solution to this conundrum: the “Topic Page”. In short, authors contribute Wikipedia-style articles on topics not or only poorly covered in Wikipedia. These get peer-reviewed and published in the journal with attribution, a DOI, and all the bells and whistles that come with journal articles. But in addition, the page gets incorporated into Wikipedia, where it starts a new life.
This format solves the problem discussed above. Authors get credit for their work in a way that fits well to existing structures. There is a permanent record of the contribution, indexed in scholarly databases such as PubMed, Google Scholar, etc. The contribution benefits from additional feedback from the peer-review and editorial process. And, perhaps most importantly, the authors can relinquish control over their work—for better or worse—with the reassurance that an unadulterated version of their work will remain available no matter what.
We’ve been pleasantly surprised by the excellent reception of the ABC article. It has been viewed over 26,000 times on the PLOS site alone. It’s also consulted a few thousand times every month on Wikipedia.
Remarkably, since the article was publicly accessible while we were drafting it on the PLOS Topic Page wiki, it had already accumulated over 10,000 views even before publication (see counter at bottom of this page). It was also picked up by a prominent’s statistician’s blog.
But just as importantly, the editorial process itself was great. Editing the manuscript on the PLOS Topic Page wiki provided a natural environment for collaborative writing. The wiki-based, open peer-review process yielded constructive and timely reviews (we could start addressing referee reports as they rolled in!). Our editor Daniel Mietchen was helpfully hands-on and did a substantial number of edits directly on the manuscript itself.
The only caveat I can think of is that the neutral, factual, impersonal, intemporal style of Wikipedia articles is quite different from the type of review articles I am otherwise used to. This is definitely not the right outlet for opinion-type pieces!
On the other hand, this format is great for student work. In fact, both of our Topic Pages started as student assignment in my course Reviews in Computational Biology. That being said, although the course gave the initial impetus, in both cases extensive additional work (and the involvement of additional co-authors) was required to get them published.
What happened since?
As it’s been two years since we published the ABC review, we can start to discern some outcomes.
The Wikipedia version underwent 46 changes, all minor modifications or additions (typo corrections, additional links and “Wikifications”, additional entries in the list of relevant software packages, attempts to sneak in one’s own contributions, … the usual stuff).
It is also gratifying to see our work appearing as first hit in Google. Since the publication in early 2013, it’s already been cited over 50 times.
Way of the future
In conclusion, the Topic Page is a great format. I am surprised that only eight Topic Pages have been published thus far, but perhaps there is still a lack of awareness about the format. I hope that this blog post will inspire some readers to improve Wikipedia by contributing a Topic Page. We are certainly thinking of our Topic Page number three…
One fundamental step in sequence analysis is the identification of homologous
sequences, sequences related through common ancestry. There are many different
ways of identifying homolog but they broadly fall into two categories:
all-against-all comparisons and clustering.
The all-against-all approach aligns every sequence with every other one. This
is straighforward to implement, relatively sensitive, and robust to variations
in sequence lengths. The main downside of all-against-all comparisons is the
quadratic computational cost with respect to the number of sequences.
In contrast, clustering works by using one representative sequence or profile
per homologous family of genes (clusters), thus limiting the number of
required comparisons to one per cluster. Assuming a fixed (or nearly fixed)
number of clusters, the computational cost is (nearly) linear in the number of
input sequences. Clustering methods however tend to miss more homologous
relationships than the all-against-all.
Can the sensitivity of the all-against-all be achieved at the speed of clustering?
The OMA database—developed in our lab—currently
relies upon an all-against-all. With 8,798,758 protein sequences from 1706
genomes in the latest release, this represents 38.7 trillion alignments. We
could probably cope with a few thousands genomes more, but will struggle to
get to the next order of magnitude with the current pipeline.
Furthermore, it is difficult to accept that as we increasingly sample the protein
sequence universe, even though we know more and more about its diversity, the
marginal computational cost of adding sequences goes higher, not lower.
In this project, we thus set out to try to achieve the sensitivity of the
all-against-all at the speed of clustering.
Transitivity of homology
In principle, homology is a transitive relationship: if gene A is homologous
to gene B, and gene B is homologous to gene C, this implies that gene A is
homologous to gene C. Transitive relationships are typically a good fit for
In practice, however, things are more complicated. Homology can be difficult
to ascertain for very divergent sequences. Furthermore, homology is not always
transitive due to insertions, deletions, fusion, fissions, and other events
that may cause inconsistencies in terms of matching residues across multiple
homologs. This figure illustrates these problems and outlines the ideas
we implemented to address them:
Putting together the ideas outlined in the figure above, we were pleasantly
surprised to see that clustering can indeed be both sensitive and fast. We
obtained 4-5x speed-ups across various datasets while recovering ~99.9% of all
homologous relationships identified through all-against-all.
The results of our proof-of-concept implementation are thus very
encouraging. We have plans to follow up with a long list of refinement ideas,
many of which we discuss in the manuscript. One essential
refinement will be to parallelise the new approach. This is not as
straightforward as with all-against-all compraisons, but we think it can be
Meanwhile, the serial variant is available as part of the OMA
Wittwer, L., Piližota, I., Altenhoff, A., & Dessimoz, C. (2014). Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology PeerJ, 2:e607 DOI: 10.7717/peerj.607
The dilemma about computationally inferred function annotation
The Gene Ontology initative is the standard for protein function annotation. For 2011 alone, Google Scholar finds almost 10,000 scientific articles with the keyword “Gene Ontology”.
The trouble is that we know little about the quality of these annotations, especially the >98% inferred computationally. The community perceives them as unreliable—at best suited for relatively coarse exploratory analyses, such as term enrichment analyses (and even those are not without risks).
At the same time, virtually everything we know about the function of genes in non-model organisms is based on computational function inference.
Our approach: verify old, computationally-inferred annotations using new experimentally established annotations
Nives Škunca, first author of our study, came up with the fundamental idea: to use experimentally-backed annotations, considered the gold standard, to verify computational (“electronic”) annotations. And to avoid circularity, we made sure to only use experimental annotations added to the GO database (UniProt-GOA, to be precise) after the computational annotations under evaluation.
Based on this idea, we defined the average reliability of a GO term as the proportion of electronic terms in an older database release confirmed by new experimental annotations in a subsequent release (see figure below). Hence, every time a new experimental annotation confirms an electronic prediction, the reliability of the corresponding term increases. Conversely, every time a new experimental annotation contradicts an electronic annotation or, more subtly, every time an electronic annotation is subsequently removed from the database, the reliability of that term decreases. Our reliability measures attempts to capture the machine learning notion of precision.
To capture the machine learning notion of recall, we defined the coverage measure, the fraction of new experimental annotations computationally predicted (see figure above). For instance, a high coverage means that most new experimental annotation has been previously predicted as electronic annotation.
Reliability measure: not as straightforward as it might seem
At first sight, these definitions might seem quite mundane. But let’s have a closer look at the reliability measure, which proved to be much more tricky—even contentious, see next section—to devise than we had anticipated.
The open world assumption makes it difficult to falsify predictions. Consider a computational prediction assigning function x to a certain gene. If an experiment later demonstrates that this gene has function y, this does not imply that the original prediction was wrong. What we need to falsify the prediction is an experiment that demonstrates that gene does not have function x.
Such “negative results” can be captured in GO annotations using the NOT qualifier. But a search on EBI QuickGO reveals that <1% of current experimental annotations are negative annotations. In part, this state of affairs is a consequence of the general bias against negative results in the literature. Also, it is harder to make definitive statements about absence of function than about presence of function, as absence must be ascertained under all relevant conditions.
If you recall our definitions above, we include in the reliability measure a penalty for electronic annotations that are subsequently removed from the database (an annotation present in release n has disappeared in release n+1).
However, electronic annotations can disappear for reasons other than being wrong. As Emily Dimmer and colleagues from UniProt-GOA pointed out to us, removals can reflect tightening standards (e.g. by setting more conservative inference thresholds), responses to changes in the GO structure, or temporary omissions due to technical problems (e.g. integration failure from external resources).
Nevertheless, we reasoned that from the standpoint of a user, removed annotations do not inspire confidence and whatever the reason may be, removed annotations can hardly be considered “reliable”.
This discussion also highlights the importance of finding an appropriate name. Because a removal does not necessarily implies an error, calling our measure “correctness” or “accuracy” would have been too strong. Conversely, calling our measure “stability” would also not have been appropriate, as it goes beyond mere stability: electronic annotations that are left unchanged do not increase the reliability ratio of a term; only experimental confirmation does.
What we found
One main finding of our study is that electronic annotations have significantly improved in recent years. A way of seeing this is to look at the following interactive motion plot (click on image to load the flash app):
Better yet, we also observed that the reliability of electronic annotations is even higher than that of annotations inferred by curators (i.e. when they use evidence other than experiments from primary literature):
Comparison of reliability and coverage for electronic annotation on the left and curated annotations on the right (figure 8 of the paper)
Looking forward, we view this work as an essential step toward our long-term aim of improving computational function inference. Indeed, one thing that seems to often hold in computational biology is that there is no point coming up with a faster or more clever algorithm as long as one has not identified a dependable objective function (or assessment strategy), such as the quality measures introduced here. As late management-guru Peter Drucker said, “there is nothing so useless as doing efficiently that which should not be done at all”.
Electronic annotations are not as unreliable as often assumed.