Pyham is a python library for handling orthoXML files containing Hierarchical Orthologous Groups (HOGs). It facilitates the extraction
of evolutionary information contained in HOGs— either specific gene families or in aggregate. Depending on the functions, the
This post is a brief primer to pyham, with an emphasis on what it can do for you.
How to get pyham?
Pyham is available as python package on the pypi server and
is compatible python 2 and python 3. You can easily install via
pip using the following bash command:
pip install pyham
If you can check the official pyham website for
further information about how to use pyham, documentation
and others related resources.
What are Hierarchical Orthologous Groups (HOGs)?
You don’t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:
As input, pyham takes an orthoXML file containing HOGs and the related species tree.
Pyham creates gene and genome objects based on the information extracted from the input files and provides
an API to work directly on those phylogenetic objects (easy queries based on name or
phylogenetic relations). The input species tree serves as a guide to define evolutionary relationships between genes and genomes.
How can I figure out the evolutionary history of genes in a given genome?
Pyham provides a mapper object for HOGs/genes across multiple taxonomic ranges. Remember that each HOG at a given taxanomic level corresponds to 1 gene in that particular ancestral genome. The idea of pyham is to map the HOGs of an ancestral
genome to the HOGs/genes of its descendant genomes. The vertical mapper object allows for retreival of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).
compare_human_mammals = pyham_analysis.compare_genomes_vertically("Human","Mammals")
# Mammals HOGs with their single copy human descendant genes
# Mammals HOGs that been lost in between the two levels
# Human genes that have been "gained" in between the two levels
# Mammals HOGs with their multiple copy human descendant genes
What are the genes in an extant genome that have been ancestrally duplicated?
We can use logic operations in the previously described mapper object. In this case we can compare the genome
of interest with its ancestral parent and retreive the duplicated genes that will be specific to this branch. For example, we can find the genes in human which were duplicated sometime between the speciation of tetrapoda and the speciation of mammals.
compare_mamm_tetra = pyham_analysis.compare_genomes_vertically("Mammals","Tetrapods")
mammals_specific_dupl_hogs = compare_human_mammals.get_duplicated()
human_genes_duplicated_before_mammals_speciation = 
for hog in mammals_specific_dupl_hogs:
for gene in hog.get_descendant_genes()
if gene.genome.name == "Human":
What is the number of genes in a particular ancestral genome?
Ancestral genome objects act as proxy to fetch all hogs at specific taxon.
# return an ancestral genomes object
mammals_genome = pyham_analysis.get_ancestral_genome_by_name("Mammals")
# get the list of hogs in this ancestral genome
number_ancestral_geness_mammals = len(mammals_genome.genes)
How can I visualise the evolutionary history of a gene family (HOG)?
Pyham embeds HogVis, an interactive tool to visualise gene family
evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.
# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)
# create and export the hog vis as .html
output_filename = "hogvis_example.html"
As you can see in the figure below, HogVis is composed
of two panels: a species tree that allows you to select the
taonomic range of interest, a genes panel where each
grey square represents an extant gene and each row a
We can see for example in the figure above that at the level
of mammals all genes of this gene family are descendant
from a single comon ancestral gene.
If we are looking at the level of Euarchontoglires we
observe that the genes are now split by a vertical line.
This vertical line separates 2 group of genes that are
each descendants from a same single ancestral gene.
This is the result of a duplication in between Mammals
With a quick look we can easily identify when a duplication occured, which species have lost genes
or how big genes families evolved.
How can I visually represent the different evolutionary events that occured in my genomic setup?
Pyham includes treeprofile, a tool to visualise an annotated species tree with evolutionary events
(genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a
minimalist and intuitive way to visualise the number of evolutionary events that occurred on each
# create and export the treeprofile as .png (.svg, .pdf also available)
As you can see in the figure above, the treeprofile is composed of the reference species used
to perform the pyham analysis. Each internal node is displayed with its related histogram of
phylogenetic events (number of genes duplicated, lost, gained, or not changed) that occurred on each branch.
Can I have a one-page summary of this blog post for reference?
Of course you can, we have prepared a PDF version of this blog post that you can download here!
Authors: Fritz Sedlazeck, Dan Jeffares & Christophe Dessimoz •
Our latest study just came out (Jeffares et al., Nature Comm 2017). In it, we carefully catalogued high-confidence structural variants among all known strains of the fission yeast population, and assessed their impact on spore viability, winemaking and other traits. This post gives a summary and the story behind the paper.
Next generation sequencing is enabling the study of genomic diversity on unprecedented levels. While most of this research has focused on single base pair differences (single nucleotide polymorphisms, SNPs), larger genomic differences (called structural variations, SVs) can also have an impact on the evolution of an organism, on traits and on diseases. SVs are usually loosely defined as events that are at least 50 base pair long. They are often classified in five subtypes: deletions, duplications, new sequence insertions, inversions and translocations.
Over the recent years the impact of SVs has been characterized in many organisms. For example, SVs play a role in cancer, when duplications often lead to multiple copies of important oncogenes. Furthermore, SVs are known to play a role in other human disorders such as autism, obesity, etc.
… but calling structural variants remains challenging
In principle, identifying SVs seems trivial: just map paired-end reads to a reference genome, look for any abnormally spaced pairs or split reads (i.e. reads with parts mapping to different regions), and—boom—structural variants!
In practice, things are much harder. This is partly due to the frustrating tendency for SVs occur in or near repetitive regions where short read sequencing struggles to disambiguate the reads. Or in highly variable regions of genome such as the chromosome ends, which tend to be the tinkering workshop of the genome.
As a result, a large proportion of SVs—typically at least 30-40%—remain undetected. As for false discovery rates (proportion of wrongly inferred SVs), they are mostly not well known because validating SVs on real data is very laborious.
Fission yeast: a compelling model to study structural variants
Studying structural variants in Schizosaccharomyces pombe is especially suited because:
The genome is small, well-annotated and simple (few repeats, haploid).
We had 40x or more coverage over 161 genomes covering the worldwide known population of S. pombe.
We had more than 220 accurate trait measurements for these strains at hand. Since the traits are measured under strictly controlled conditions, they contain little (if any) environmental variance—in stark contrast to human traits.
SURVIVOR makes the most out of (imperfect) SV callers
To infer accurate SVs calls, we introduced SURVIVOR, a consensus method to reduce the false discovery rate, while maintaining high sensitivity. Using simulated data, we observed that consensus calls obtained from two to three different SV callers could recover most SV while keeping the false-discovery rate in check. For example, SURIVOR performed second best with a 70% sensitivity (best was Delly: 75%), while the false discovery rate was significantly reduced to 1% (Delly: 13%) (but remember these figures are based on simulation; performance on real data is likely worse.) Furthermore, we equipped SURVIVOR with different methods to simulate data sets and evaluate callers; merge data from different samples; compute bad map ability regions (BED file) over the different regions, etc. SURVIVOR is written in C++ so it’s fast enough to run on large genomes as well. Since then, we are running it on multiple human data sets, which takes only a few minutes on a laptop. SURVIVOR is available on GitHub.
SVs: now you see me, now you don’t
We applied SURVIVOR to our 161 genomic data sets, and then manually vetted all our calls to obtain a trustworthy set of SVs. We then discovered something suspicious. Some groups of strains that were very closely related (essentially clonal, differing by <150 SNPs) had different numbers of duplications, or different numbers of copies in duplications (1x, 2x, even 6x). This observation was also validated with lab experiments.
Interestingly we identified 15 duplications that were shared between the more diverse non-clonal strains (so these must have been shared during evolution) but could not be explained by the tree inferred from SNPs (Figure 1). To confirm this we compared the local phylogeny of SNPs in 20kb windows up and downstream of the duplications with the variance in copy numbers. Oddly the copy number variance was not highly correlated with the SNP tree. This lead to the conclusion that some SVs are transient and thus are gained or lost faster than SNPs.
Duplications happen within near-clonal populations Phylogenetic tree of the strains reconstructed from SNPs data, with eight pairs of very close strains that nonetheless show structural variation. Click to enlarge.
Though this transience came as a surprise, there is actually supporting evidence from laboratory experiments carried out by Tony Carr back in 1989 that duplications can occur frequently in laboratory-reared S. pombe, and can revert. (Carr et al. 1989). The high turnover raises the possibility that SVs could be an important source for environmental adaptation.
SVs affect spore viability and are associated with several traits
We then investigated the phenotypic impact of these SVs. We used the 220 trait measurements from previous publications. We observed an inverse correlation between rearrangement distance and spore viability, confirming reports in other species that SVs can contribute to reproductive isolation. We also found a link between copy number variation and two traits relevant to wine making (malic acid accumulation and glucose+fructose ultilisation) (Benito et al. PLOS ONE 2016).
Structural variants, reproductive isolation, and wine. A) Making crosses between fission yeast strains often results in low offspring survival. The theory is that rearrangements (inversions and translocations) cause errors during meiosis, so we might expect them to affect offspring viability. If we compare offspring viability from crosses with the number of rearrangements that the parents differ by, there is a correlation, and a ‘forbidden triangle’ in the top right of the plot (it seem impossible to produce high viability spores when parents have many unshared rearrangements). B) SVs also affect traits. For > 200 traits (vertical bars) we used [LDAK](http://dougspeed.com/ldak/) to estimate the proportion of the narrow sense heritability that was caused by copy number variants (red), rearrangements (black) and SNPs (grey). Some traits are very strongly affected by copy number variants, such as the wine-making traits (wine-colored bars along the x-axis). C) Fission yeast wine tasting at UCL—how much of the taste is due structural variants? (Jürg Bähler at right).
We used the estimation of narrow sense heritability from Doug Speed’s LDAK program. Narrow sense heritability estimates how much of a difference in a trait between individuals can be explained by adding up all the tiny effects of the genomic differences (in our case SNPs; deletions and duplications; inversions and translocations and all combined). Overall, we found the heritability was better explained when combining the SNP data as well as the SVs data. In 45 traits SVs explained 25% or more of the trait variability. Five traits that were explained by over 90% heritability using SNPs and SVs came from different growth conditions in liquid medium. This may highlight again the influence of environmental conditions on the genomic structure. For 74 traits (~30% of those we analyzed) SVs explain more of the trait than the SNPs. These high SV-affected traits include malic acid, acetic acid and glucose/fructose contents of wine, key components of taste.
A collaborative effort
On a personal note, the paper concludes a wonderful team effort over two and a half years.
The project started as a summer project for Clemency Jolly, who had then just completed her 3rd undergraduate year at UCL, in the Dessimoz and Bähler labs. Dan Jeffares and the rest of the Bähler lab had just published their 161 fission yeast genomes, with an in-depth analysis of the association between SNPs and quantitative traits (Jeffares et al., Nature Genetics 2015). Studying SVs was the logical next step, but given the challenging nature of reliable SV calling, we also recruited to the team Fritz Sedlazeck, collaborator and expert in tool development for NGS data analysis then based in Mike Schatz’s lab at Cold Spring Harbor Laboratory.
At the end of the summer, it was clear that we were onto something, but there was still a lot be done. Clemency turned the work into her Master’s project, with Dan and Fritz redoubling their efforts until Clemency graduation in summer 2015. It took another year of intense work lead by Dan and Fritz to verify the calls, perform the GWAS and heritability analyses, and publish the work. Since then, Clemency has started her PhD at the Crick Institute, Fritz has moved to John Hopkins University, and Dan has started his own lab at the University of York.
Jeffares, D., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., Balloux, F., Dessimoz, C., Bähler, J., & Sedlazeck, F. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast Nature Communications, 8 DOI: 10.1038/ncomms14061
Carr AM, MacNeill SA, Hayles J, & Nurse P (1989). Molecular cloning and sequence analysis of mutant alleles of the fission yeast cdc2 protein kinase gene: implications for cdc2+ protein structure and function. Molecular & general genetics : MGG, 218 (1), 41-9 PMID: 2674650
Jeffares, D., Rallis, C., Rieux, A., Speed, D., Převorovský, M., Mourier, T., Marsellach, F., Iqbal, Z., Lau, W., Cheng, T., Pracana, R., Mülleder, M., Lawson, J., Chessel, A., Bala, S., Hellenthal, G., O’Fallon, B., Keane, T., Simpson, J., Bischof, L., Tomiczek, B., Bitton, D., Sideri, T., Codlin, S., Hellberg, J., van Trigt, L., Jeffery, L., Li, J., Atkinson, S., Thodberg, M., Febrer, M., McLay, K., Drou, N., Brown, W., Hayles, J., Salas, R., Ralser, M., Maniatis, N., Balding, D., Balloux, F., Durbin, R., & Bähler, J. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe Nature Genetics, 47 (3), 235-241 DOI: 10.1038/ng.3215
Benito, A., Jeffares, D., Palomero, F., Calderón, F., Bai, F., Bähler, J., & Benito, S. (2016). Selected Schizosaccharomyces pombe Strains Have Characteristics That Are Beneficial for Winemaking PLOS ONE, 11 (3) DOI: 10.1371/journal.pone.0151102
You have a reference to a scientific paper of interest.
Perhaps this one:
Iantorno S et al, Who watches the watchmen? an appraisal of benchmarks for multiple sequence alignment, Multiple Sequence Alignment Methods (D Russell, Editor), Methods in Molecular Biology, 2014, Springer Humana, Vol. 1079. doi:10.1007/978-1-62703-646-7_4
Alternatively, if you know the Digital Object Identifer (DOI) of the article, prepending http://doi.org should directly lead you to the article. In this case, the
DOI is included at the end of the citation. (If you don’t know its DOI, you can find it on the publisher’s website or in a database such as PubMed). We get:
Unfortunately, we see that this particular paper is behind a paywall at the publisher.
Green open access
There is, however, still a chance that it might be deposited in a preprint server or in an institutional repository. This is referred to as “green” open access. One way to find out is to look at the article record in Google Scholar and look for a link in the right margin:
If you know the DOI, an even quicker way of looking for a deposited version is
by using the http://oadoi.org redirection tool, which works analogously to
doi.org but redirect to a green open access version whenever possible:
Here, a free version of the article deposited in the UCL institutional repository is found.
If you use Chrome or Firefox, you can also use the Unpaywall browser extension to automatically get a link to green open access alternatives as you land on paywalled articles.
On the author’s homepage
Sometimes, the paper is available on the homepage of one of the authors. In
this case, a link to the preprint is provided on the homepage of the corresponding
author (item #36), together with a link
to the preprint.
And otherwise, if one of the authors is active on ResearchGate, it’s also possible to send a full-text request at the click of a button.
Pirated version off Sci-Hub
Sci-Hub serves bootleg copies of pay-walled articles. This is illegal, so I only mention it for educational purposes. This works most reliably using, again, DOIs:
If you, purely hypothetically of course, pasted that URL in your browser, you would or would not get a PDF of the entire book in which the referenced article appears.
It’s also possible to request full-text articles via Twitter. As described in Wikipedia, this works by tweeting the article title, its DOI, an email address (to indicate to whom the article should be sent), and the hashtag #icanhazpdf. Someone with access to the article might send a copy via email. Once the article is received, the tweet is deleted. Again, I mention this for educational purpose only—don’t break the law.
Natasha Glover, talented postdoc in the lab, has produced a brief video to
introduce HOGs and convey why we are mad about them!
Altenhoff, A., Gil, M., Gonnet, G., & Dessimoz, C. (2013). Inferring Hierarchical Orthologous Groups from Orthologous Gene Pairs PLoS ONE, 8 (1) DOI: 10.1371/journal.pone.0053786
Boeckmann, B., Robinson-Rechavi, M., Xenarios, I., & Dessimoz, C. (2011). Conceptual framework and pilot study to benchmark phylogenomic databases based on reference gene trees Briefings in Bioinformatics, 12 (5), 423-435 DOI: 10.1093/bib/bbr034
Sonnhammer, E., Gabaldon, T., Sousa da Silva, A., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P., Dessimoz, C., & , . (2014). Big data and other challenges in the quest for orthologs Bioinformatics, 30 (21), 2993-2998 DOI: 10.1093/bioinformatics/btu492
Note: the “Life in the Lab” series features interviews of interns and visitors. This post is by our second 2016 OMA Visiting Fellow Tunca Doğan, who spent a month with us earlier this year. You can follow Tunca on Twitter at @tuncadogan. —Christophe
Please introduce yourself in a few sentences.
My name is Tunca Doğan. I received my PhD in 2013 with a thesis study in the fields of bioinformatics and computational biology where we developed methods for the clustering of the protein sequences using unsupervised machine learning techniques (Dogan and Karacali, 2013). I’ve since been working as a post-doctoral fellow in the EMBL-EBI, UK under the Protein Function Development team (UniProt Database) leaded by Dr Maria Martin. Here I’m developing new tools and methods for the automated functional annotation of protein records in the UniProtKB using a variety of features including domain architectures (Dogan et al., 2016). I’m also conducting research in the field of computational drug discovery. As of 2016, I’m also affiliated to the Department of Health Informatics, METU, Turkey both as a senior research fellow and a faculty candidate.
Why did you choose to apply to the OMA visiting fellowship programme?
The team behind OMA is world-leading in the field of phylogenomics, and they authored many highly cited publications in this area. Moreover, OMA is considered to be one of the most reliable and comprehensive resources offering phylogenomic information on various species. I’ve applied to this programme in order to develop my knowledge in phylogenomic research, particularly about the OMA production. My specific research aim was to investigate if and how the information in OMA can be utilized in order to increase the coverage and the quality of the automated functional annotation of proteins in the UniProt database.
Discussions on UNIL campus with Leonardo de Oliveria Martins, Surag Nair, Clément Train, David Dylus and Tunca Doğan (from l. to r.)
What project did you work on during your visit?
The project I worked on had two sides: 1) investigating novel ways of quality checking of the data produced in the OMA pipeline (especially HOGs) using the Domain Architecture Alignment and Classification (DAAC) method I previously developed in UniProt; 2) investigating the use of OMA groups and HOGs to propagate the functional annotation between the (homologous) member proteins of the same clusters/classes.
Was there any highlight or low point you’d like to share?
It was a great experience for me both professionally and socially. I’ve learnt a great deal in just one month and we still keep our collaboration with the continuation of the abovementioned project. Everyone I met in the group: Christophe, Adrian, David, Leonardo and Clement were all knowledgeable, helpful and friendly that I had great time during my stay. It was a great pleasure to meet and to work with them all…
UNIL/EPFL campus is just beautiful, at the shores of lake Geneva. The campus is also well-equipped for all possible needs. This was also my first time in Switzerland and I was enchanted by the beauty of this country… The only downside for a foreign visitor could be the expensiveness of life in Switzerland, which was also manageable with a little prior investigation and planning.
Do you have any practical tip for future OMA visiting fellows?
I definitely recommend any researcher (at PhD or post-doc level) that has an interest in phylogenomics to apply to this programme. You’ll learn a great deal and have a good time at the same time. Also (for the foreigners) do not forget about travelling around this beautiful country in your spare time…
Editor’s note: If you are interested in the OMA visiting fellowship programme, consult this page.
Doğan T, & Karaçalı B (2013). Automatic identification of highly conserved family regions and relationships in genome wide datasets including remote protein sequences. PloS one, 8 (9) PMID: 24069417
Doğan T, MacDougall A, Saidi R, Poggioli D, Bateman A, O’Donovan C, & Martin MJ (2016). UniProt-DAAC: domain architecture alignment and classification, a new method for automatic functional annotation in UniProtKB. Bioinformatics (Oxford, England), 32 (15), 2264-71 PMID: 27153729
Note: We are rebooting our “Life in the Lab” series, which features interviews of interns and visitors. This post is by our inaugural OMA Visiting Fellow Rosa Fernández García, who spent a month with us earlier this year. You can follow Rosa on Twitter at @Rosamygale. —Christophe
Please introduce yourself and your research interests.
I received my bachelor’s degree in Biology (major in Zoology) at Complutense University in Madrid, Spain. I got my master’s and PhD at the same university with a thesis about phylogeny and phylogeography of cosmopolitan earthworms. After that, I moved to the lab of Prof. Gonzalo Giribet at Harvard University where I was a postdoc during 3 years and a Research Associate for another year. In January 2017, I’ll move to Barcelona to work as a Research Fellow in the lab of Dr. Toni Gabaldón at the Center for Genomic Regulation.
My research addresses fundamental questions about evolution in invertebrates: in other words, I am fascinated by how, when and where biodiversity took its form, and why it is maintained. My main two animal groups of interest are terrestrial annelids (oligochaetes) and (pan)arthropods, particularly the earliest branching lineages and most scientifically neglected groups (chelicerates and myriapods).
How did biodiversity took its shape? Resolving the tree of life. Macroevolutionary patterns are generally what we see when we look at the large-scale history of life. It encompasses the grandest trends and transformations in evolution, such as the origin of bilateral animals or the radiation of arthropods. In order to understand how lineages are related to each other, I study macroevolutionary patterns in several groups of invertebrates through phylogenetics and phylogenomics. I currently lead a fruitful line of research dealing with phylogenomics of myriapods and chelicerates, having optimized protocols to sequence successfully single individuals of the rarest and smallest arthropods. We are getting closer to resolve the Arthropod Tree of Life!
Artist’s rendition of the Arthropod tree of life
When and where? I tried to understand the mode and tempo of animal diversification patterns through the integration of phylogeography, biogeography and paleogeography.
Why? Comparative transcriptomics and genomics is a very powerful tool to shed light on very interesting evolutionary questions, such as arthropod terrestrialization - one of my favorite new lines of research.
Why did you choose to apply to the OMA visiting fellowship programme?
Orthology inference is one of the key steps in phylogenomics. I had been using OMA for a few years and I wanted to learn how I could use it more efficiently in my ongoing projects.
What project did you work on during your visit?
My project focused on optimizing OMA runs for some big and challenging data sets that I was having problems with. Also, I was interested in learning how I could exploit hierarchical orthogroups for comparative genomics studies in arthropods.
Rosa and David Dylus at coffee break (photo by Arthur Dessimoz)
Was there any highlight or low point you’d like to share?
It was a great experience to be in the Dessimoz lab for a month. As a systematist with relatively limited bioinformatic background, it was absolutely great to exchange ideas with computer scientists interested in the same scientific problems but with a completely different perspective that mine. It was a very enriching experience.
Do you have any practical tip for future OMA visiting fellows?
One month was not enough for me, so try to stay longer if your project is ambitious. And ask Christophe to bring a Tête de Moine cheese in your last day, it’s delicious!
Editor’s note: If you are interested in the OMA visiting fellowship programme, consult this page.
Fernández R, Laumer CE, Vahtera V, Libro S, Kaluziak S, Sharma PP, Pérez-Porro AR, Edgecombe GD, & Giribet G (2014). Evaluating topological conflict in centipede phylogeny using transcriptomic data sets. Molecular biology and evolution, 31 (6), 1500-13 PMID: 24674821
Fernández, R., Hormiga, G., & Giribet, G. (2014). Phylogenomic Analysis of Spiders Reveals Nonmonophyly of Orb Weavers Current Biology, 24 (15), 1772-1777 DOI: 10.1016/j.cub.2014.06.035
Fernández R, & Giribet G (2015). Unnoticed in the tropics: phylogenomic resolution of the poorly known arachnid order Ricinulei (Arachnida). Royal Society open science, 2 (6) PMID: 26543583
Novo M, Fernández R, Andrade SC, Marchán DF, Cunha L, & Díaz Cosín DJ (2016). Phylogenomic analyses of a Mediterranean earthworm family (Annelida: Hormogastridae). Molecular phylogenetics and evolution, 94 (Pt B), 473-8 PMID: 26522608
Fernández R, Edgecombe GD, & Giribet G (2016). Exploring Phylogenetic Relationships within Myriapoda and the Effects of Matrix Composition and Occupancy on Phylogenomic Reconstruction. Systematic biology, 65 (5), 871-89 PMID: 27162151
Sharma, P., Fernandez, R., Esposito, L., Gonzalez-Santillan, E., & Monod, L. (2015). Phylogenomic resolution of scorpions reveals multilevel discordance with morphological phylogenetic signal Proceedings of the Royal Society B: Biological Sciences, 282 (1804), 20142953-20142953 DOI: 10.1098/rspb.2014.2953
Rosa Fernandez, Prashant Sharma, Ana LM Tourinho, & Gonzalo Giribet (2016). The Opiliones Tree of Life: shedding light on harvestmen relationships through transcriptomics BioRxiv DOI: 10.1101/077594
For my second postdoc, I was the fortunate receipient of a PLANT FELLOWS scholarship. PLANT FELLOWS is an international program that provides research grants to postdocs in the field of plant science. The fellows are based at many different host institutions throughout Europe. I myself am working at Bayer Crop Science in Gent, Belgium, in collaboration with the Dessimoz lab in London and Lausanne. Part of the PLANT FELLOWS mission is to provide training, mentoring, and networking to the postdocs—skills essential for career advancement.
Last year, the annual PF meeting was held in Männedorf, Switzerland from September 28 to October 1 2015. Training workshops took place at the Boldern Hotel, surrounded by meadows and with a nice view of Lake Zürich.
Group picture from the 3rd annual PLANT FELLOWS meeting
The meeting consisted of several days of trainings and workshops. For one of the days, I chose to participate in the workshop “Advanced Strategies for Dealing with the Publication Process.” I was especially keen on learning more about this particular subject. As a postdoc still trying to navigate the publication waters, I was looking for all the advice I could get. We’ve all heard the saying before: publish or perish. Publishing papers in your postdoc years is so important for an academic career.
There were about 15 postdocs in this day-long workshop. The facilitator, Philipp Mayer, came with a bunch of photocopied book chapters, articles, and USB keys full of pdfs for each of us to use on our laptops. The objective of the workshop was to, as a group, write a small paper about advanced publication strategies using the literature we were provided with. Our plan of attack was to pool our collective postdoc experience and come up with a list of our most useful recommendations on how to get a scientific paper published.
After feverishly reading websites, book chapters and papers, at the end of the day we came up with a draft: an introduction, our recommendations broken into 3 main sections, and a conclusion. We had a respectable number of references. But what would be the fate of our paper? About a third of the class was apathetic, a third thought we should aim for a blog post, and another third thought we should try for a “real” scientific journal. I had really enjoyed the workshop so I lobbied for publishing it in a real journal. I liked the experience of learning about a topic, working collaboratively with my peers, and then passing on the information for others to benefit.
I volunteered to take charge of the paper, edit it, and submit it to journals in hopes of getting it published. At the end of the day I left with a draft of the paper, many references, the contact information of all the attendees, and the full support of the facilitator (Philipp) for any future help that I might need. I looked at it as an opportunity take a leadership role in publishing a paper, from start to finish. And more importantly, it was a chance to put our own advice into practice.
Upon returning to Belgium, I quickly found out that one of the sentences we had written in the paper rang true: It is a common misconception among early career researchers that the presentation of the work in a manuscript is the last stage of a project. There is a long and complicated process associated with submission, review, and revision that must be taken into account. During the next month, I reread paper, finished writing short sections, added references, edited, and got feedback from the coauthors. We agreed on the author order, and shared the document using Authorea. Philipp and I went back and forth with several rounds of editing.
We decided to submit our manuscript to eLife, which is a prestigious peer reviewed open access journal with favorable policy toward early career researchers. I wrote a cover letter to the editor describing our paper and asking if the topic was suitable to be considered for eLife.
Within a few days, the editor read the manuscript but informed me that he was unable to send it out for review because it wasn’t “fresh” enough, meaning most of what we said had already be discussed many times in the scientific community. Despite the sting of having a paper rejected directly from the editor, I decided to take the advice we had written in the paper: Remove your personal feelings from the peer review process. Time to find the next journal.
During the following month and a half, the manuscript was pushed to the bottom of my To Do list, as other projects and tasks got my attention. Christmas holidays came and went, and admittedly this paper was the last thing on my mind.
In January, I sent a presubmission inquiry to PLOS Biology. The PLOS Biology editor wrote back within a few days to inform me that although they appreciated the attention to an important problem, they could not encourage us to submit because it didn’t present “novel strategies for increasing access to research, improving the quality of research results, or fixing flawed measures of impact.” Since this was the second time I had heard this same exact criticism, I realized it was time to take more advice from the paper: It is critical to highlight the novelty and importance in the article and cover letter. We were going to have to add something to the paper to make it more novel.
Shortly after, I contacted the Frontiers in Plant Science (FiPS) Editorial Office with a new and improved cover letter. FiPS is an open access online journal publishing many different peer reviewed articles: research, reviews, commentaries, and perspectives, among others. The editor and I discussed morphing the paper into something that would be more plant related, given the plant science background of all the coauthors. Over the next month, it was back to editing the paper. I proposed edits that would make our tips more plant-specific. We added advice about industry-academia collaborations, and more information about plant science journals. Philipp, the coauthors, and I went back and forth several times with rounds of edits, adding more references and polishing more details. I submitted the final version of the paper to Frontiers in Plant Science on March 15.
The experience of the collaborative peer review by FiPS was a pleasant and efficient one. Their website says “Frontiers reviews are standardized, rigorous, fair, constructive, efficient and transparent.” I enthusiastically agree. Within two weeks, we had received comments from the reviewers. There were some major points that needed to be addressed before Frontiers could offer publication. However, the points were all very relevant and only helped to make the paper stronger. During the process of the interactive review, I took more guidance from the paper: Go point by point through the reviewer comments and either make the suggested change or politely explain and clarify the misunderstanding.
April 21st : Acceptance achieved! Approximately 5 weeks after submitting the article, it was accepted and the provisional version of the manuscript was published online. This is an extremely fast turnover time, in part due to the responsiveness of the editor, quick but in-depth peer review, and the interactive, transparent review discussion.
What I learned
This collaboration with the PLANT FELLOWS postdocs resulted in a paper I can say I’m proud of. I learned many things about the publication process—not only through a literature review, but by actually experiencing the process first hand. Here are some of the main things that stuck with me:
There is a certain creative power in bringing people together in a beautiful location to brainstorm and produce an outcome within a short period of time. However, it is necessary for someone to take the reins and commit to the follow-through in order to get to a finished product. I think things like hackathons or other collaborative group efforts could lead to fruitful outcomes.
I learned how to coordinate a small project. This was a great collaborative effort, which gave me an opportunity to practice the recommendations we wrote about in the paper.
I discovered firsthand the importance of the initial contact with the editor. As soon as we reworked the paper to approach the topic from a plant-specific standpoint, this added novelty to the paper. We were able to highlight this novelty in the cover letter.
Don’t give up. Many times I got distracted or discouraged and thought to publish the manuscript on our blog, but I’m glad in the end we found a home for it at FiPS. Perseverance is key.
Glover, N., Antoniadi, I., George, G., Götzenberger, L., Gutzat, R., Koorem, K., Liancourt, P., Rutowicz, K., Saharan, K., You, W., & Mayer, P. (2016). A Pragmatic Approach to Getting Published: 35 Tips for Early Career Researchers Frontiers in Plant Science, 7 DOI: 10.3389/fpls.2016.00610
The paper introducing our new tree visualisation tool Phylo.io was just published in MBE.
Yet another tool to display trees, you might say, and indeed, so it is. But for all the tools that have been developed over the years, there are very few that scale to large trees, make it easy to compare trees side-by-side, and simply run in a browser on any computer or mobile device.
The project started as a student summer internship project, with the aim of producing a tree visualiser that facilitates comparison of trees built on the same set of leaves. After reading the project description, Oscar Robinson, a brilliant student from the Computer Science department at UCL, decided to work on this project during a three month internship. He saw a chance to apply his experience in the development of web tools and to develop his knowledge in the field of data visualisation, one of his major interests.
What is phylo.io and what can it do?
Phylo.io is a web tool that works in any modern browser. All computations are performed client-side and the only restriction on performace is the machine it is running on. Trees can be input in Newick and Extended Newick format. Phylo.io offers many features that other tree viewers have. Branches can be swapped, the rooting can be changed, the thickness, font and other parameters are adaptable. Many of these operations can be performed directly by clicking on a branch or a node in the tree. Importantly, it features an automatic subtree collapsing function: this facilitates the visualisation of large trees and hence the analysis of splits that are deep in the tree.
Next to basic tree visualisation/manipulation it features a compare mode. This mode allows to compare two trees computed using different tools or different models. Similarities and differences are highlighted using a colour scheme directly on the individual branches, making it clear where the differences in two topologies actually are. Additionally, since the output of different tools provides trees with very different rootings and leaf order, Phylo.io has a function to root one of the trees according to the other one and adapt the order of the leaves according to a fixed tree.
How do you use phylo.io?
To save you time, here is a one minute screencast highlighting some of the key features of Phylo.io:
Back in 2009, Adrian Altenhoff and I published a paper on ortholog benchmarking in PLOS Computational Biology. At the time, this was the first benchmark study with phylogeny-based tests. It also investigated an unprecedented number of methods. One of the most challenging aspect of this work—and by far the most tedious—was to compare inferences performed by different methods on only partly overlapping sets of genomes, often with inconsistent identifiers and releases—giving right to the cynics’s view that “bioinformatics is ninety percent identifier mapping…”
Enter the Quest for Orthologs consortium
Around that time, Eric Sonnhammer and Albert Vilella organised the first Quest for Orthologs (QfO) meeting at the beautiful Genome Campus in Hinxton, UK—the first of a series of collaborative meetings. We have published detailed reports on these meetings (2009, 2011, 2013; stay tuned for the 2015 meeting report…).
Out of these interactions, the Quest for Orthologs consortium was born, with the mission to benchmark, improve and standardise orthology predictions through collaboration, the use of shared reference datasets, and evaluation of emerging new methods.
The orthology benchmark service and other contributions of the paper
The consortium is organised in working groups. One of them is the benchmarking working group, in which Adrian and I have been very involved. This new paper presents several key outcome of the benchmarking working group.
First and foremost, we present a publicly-available, automated, web-based benchmark service. Accessible at http://orthology.benchmarkservice.org, the service lets method developers evaluate predictions performed on the 2011 QfO reference proteome set of 66 species. Within a few hours after submitting their predictions, they obtain detailed feedback on the performance of their method on various benchmarks compared with other methods. Optionally, they can make the results publicly available.
Conceptual overview of the benchmark service (Fig 1 of the paper; click to enlarge)
Second, we discuss the performance of 14 orthology methods on a battery of 20 different tests on a common dataset across all of life.
Third, one of the benchmark, the generalised species discordance test, is new and provides a way for testing pairwise orthology based on trusted species trees of arbitrary size and shape.
For developers of orthology prediction methods, this work sets minimum standards in orthology benchmarking. Methodological innovations should be reflected in competitive performance in at least a subset of the benchmarks (we recognise that different applications entail different trade-offs). Publication of new or update methods in journals should ideally be accompanied by publication of the associated results in the orthology benchmark service.
For end-users of orthology predictions, the benchmark service provides the most comprehensive survey of methods to date. And because it can process new submissions automatically and continuously, it holds the promise of remaining current and relevant over time. The benchmark service thus enables users to gauge the quality of the orthology calls upon which they depend, and to identify the methods most appropriate to the problem at hand.
Altenhoff, A., Boeckmann, B., Capella-Gutierrez, S., Dalquen, D., DeLuca, T., Forslund, K., Huerta-Cepas, J., Linard, B., Pereira, C., Pryszcz, L., Schreiber, F., da Silva, A., Szklarczyk, D., Train, C., Bork, P., Lecompte, O., von Mering, C., Xenarios, I., Sjölander, K., Jensen, L., Martin, M., Muffato, M., Altenhoff, A., Boeckmann, B., Capella-Gutierrez, S., DeLuca, T., Forslund, K., Huerta-Cepas, J., Linard, B., Pereira, C., Pryszcz, L., Schreiber, F., da Silva, A., Szklarczyk, D., Train, C., Lecompte, O., Xenarios, I., Sjölander, K., Martin, M., Muffato, M., Quest for Orthologs consortium, Gabaldón, T., Lewis, S., Thomas, P., Sonnhammer, E., Dessimoz, C., Gabaldón, T., Lewis, S., Thomas, P., Sonnhammer, E., & Dessimoz, C. (2016). Standardized benchmarking in the quest for orthologs Nature Methods DOI: 10.1038/nmeth.3830
Altenhoff, A., & Dessimoz, C. (2009). Phylogenetic and Functional Assessment of Orthologs Inference Projects and Methods PLoS Computational Biology, 5 (1) DOI: 10.1371/journal.pcbi.1000262
Gabaldón, T., Dessimoz, C., Huxley-Jones, J., Vilella, A., Sonnhammer, E., & Lewis, S. (2009). Joining forces in the quest for orthologs Genome Biology, 10 (9) DOI: 10.1186/gb-2009-10-9-403
Dessimoz, C., Gabaldon, T., Roos, D., Sonnhammer, E., Herrero, J., & Quest for Orthologs Consortium (2012). Toward community standards in the quest for orthologs Bioinformatics, 28 (6), 900-904 DOI: 10.1093/bioinformatics/bts050
Sonnhammer, E., Gabaldon, T., Sousa da Silva, A., Martin, M., Robinson-Rechavi, M., Boeckmann, B., Thomas, P., Dessimoz, C., & Quest for Orthologs Consortium. (2014). Big data and other challenges in the quest for orthologs Bioinformatics, 30 (21), 2993-2998 DOI: 10.1093/bioinformatics/btu492
A few months ago, we published a paper that spent four years in peer-review (story behind the paper). Because of this, I feel entitled to an opinion on the pre- vs post-publication review debate.
Background on preprints and their effect on peer-review
If you have been living under a rock, or if you are not on Twitter, you may not have noticed that preprints are becoming more widely accepted in biology—supported by initiatives such as Haldane’s Sieve and bioRxiv. This is particularly true in population genetics, evolutionary biology, bioinformatics, and genomics. Typically, a manuscript is made available as preprint just as it is submitted to a scientific journal, and therefore prior to peer-review. I am saying “made available” instead of “published” because although preprints can be read by anybody, the general view is that the canonical publication event lies with the journal, post peer-review. Because of this, many traditional journals tolerate this practice: peer-review technically remains “pre-publication” and the journals get to keep their gatekeeping function.
The key benefit of preprints is that they accelerate scientific communication. Indeed, peer-review can be long and frustrating for authors. Reviewers sometimes misjudge the importance of papers or request unreasonable amounts of additional work. The ability to bypass peer-review can thus be liberating for authors. Thus, if we instead recognised preprints as the canonical publication event, so goes the idea, peer-review would be relagated to a secondary role and journals would loose their gatekeeping function. This is the “post-publication” peer-review model.
For more background info, here are a few pointers:
Advantages of pre- and post-publication peer-review
What did our recent experience teach us? Spending four years in various stage of peer-review is a huge strain on the authors, reviewers, and editors. On the positive side, the final paper was more complete (some of the methods tested were published after our first submission!). Undoubtedly, it became a clearer and more solid paper. However, as I pointed out in my post on the paper, our main conclusions did not change. They could have been brought to everyone’s attention four years earlier.
So should pre-publication peer-review be abolished? In this particular case, it’s debatable. If we had known what awaited us, we would have released the manuscript as a preprint (eg on arXiv, bioRxiv, or PeerJ PrePrint)—something we have done with subsequent pieces of work.
However in general, I still think that pre-publication peer-review has many merits. First, thankfully this experience was extreme; on average things are much faster: 2-4 months including one revision cycle is quite typical in my experience. With some journals, this can be even faster (Bioinformatics, PeerJ or MBE jump to mind). Second, pre-publication peer-review can identify flaws or interesting points overlooked by the authors—to the point that in some cases (large multi-author studies!), peer-reviewers wind up contributing almost certainly more to the paper than some of the co-authors. Furthermore, while reviewers do not always agree in their comments, when they do, the authors better pay attention.
That being said, unorthodox or controversial results can be extremely difficult to publish under the pre-publication peer-review model, particularly if some of the reviewers have vested interests in the status quo.
Best practices at the age of pre- and post-publication peer-review
So what model am I arguing for? I think the emerging combination of preprints and journals can give us the best of the two worlds. Preprints ensure that advances can be quickly, broadly, and unimpededly disseminated. Journals add a layer of quality control and differentiation, and even glamour if so they choose. Importantly, this new paradigm shifts power from the publishers back to us, the researchers. And you all remember what comes with great powers, right?
As peer-reviewers, it is our job to identify specific issues with the work, and bring them to the authors and the editor, but ultimately, we should remember that the work we review is not our work. If the authors choose to ignore points we consider important, it may be more constructive and rewarding to write a rebuttal paper anyway. Post-publication peer-review as it were!
As editors, we should pay attention to potential conflicts of interests, focus on a limited set of key points that need addressing, and remember that every additional round of revisions costs precious time and resources. The additional delay could result in wasteful duplication of the work by others, or missed opportunities to build upon the findings. Thus we have a moral obligation to balance pre- and post-publication peer-review. Too often, editors lazily or cowardly repeatedly forward all reviewer comments back and forth without taking a stance, with little consideration of the burden this incurs to the authors and the rest of the community.
As authors, one simple but powerful thing we can do is to more openly acknowledge the shortcomings of our work and candidly disclose unresolved issues. In case of fundamental disagreeement with a peer-reviewer, the impasse may be overcome by including an account of the disagrement as part of the paper. In fact, this is precisely what we ended up doing in our paper. In the discussion section, we wrote:
And sixth, we disclose that in spite of the several lines of evidence and
numerous controls provided in this study, one anonymous referee remained
skeptical of our conclusions. His/her arguments were: (i) instead of using
default parameters or globally optimized ones, filtering parameters should
be adjusted for each data set; (ii) the observations that, in some cases,
phylogenies reconstructed using a least-squares distance method were more
accurate than phylogenies reconstructed using a maximum likelihood method
(Supplementary Figs. 7–10 available on Dryad at
http://dx.doi.org/10.5061/dryad.pc5j0), and that ClustalW performed
“surprisingly well” compared with other aligners, are indicative that the
data sets used for the species discordance test are flawed; (iii) the
parsimony criterion underlying the minimum duplication test and the Ensembl
analyses is questionable.
Indeed, not every issue can be resolved during peer-review. At some point, the debate should happen in the open. Any one single paper is rarely the “last word” on a question anyway. And as our editor admitted, a bit of controversy is good for the journal.
Tan G, Muffato M, Ledergerber C, Herrero J, Goldman N, Gil M, & Dessimoz C (2015). Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference. Systematic biology, 64 (5), 778-91 PMID: 26031838
We know homologs are genes related by common ancestry. But throw complex evolutionary events into the mix and things can get little dicey. Under the umbrella of homologs exist many different categories: orthologs, paralogs, ohnologs, xenologs, co-ortholog, in-paralogs, out-paralogs, paleologs, among others. All of these —log terms have a specific meaning (see my previous blog post on orthology and paralogy), but now we will focus on one in particular: homoeologs.
But before we get into the definition, let’s start at the beginning. When I started as a postdoc at Bayer CropScience working with Henning Redestig in collaboration with Christophe Dessimoz University College London, I was tasked with evaluating homoeolog predictions using the OMA algorithm.
What are homoeologs?
From my previous experience, I knew homoeologs as roughly “corresponding” genes between subgenomes of a polyploid organism. For example, the wheat genome is an allohexaploid, with 3 diploid subgenomes named A, B, and D. Given a gene on chromosome 3B, you will most likely find a nearly identical copy on chromosomes 3A and 3D, in roughly the same position. These corresponding copies across subgenomes are known as homoeologs. But this definition left something to be desired— it didn’t tell me anything about the evolutionary relationship between the homoeologs. Worse, it was ambiguous in that it required discretionary similarity thresholds in terms of sequence and positional conservation. How could we test for performance if there was no unambiguous definition of the target?
Time to hit the books
Like many researchers starting a new project, I went to the scientific literature to get more information. After many hours spent on google scholar, I found myself with more questions than answers. Firstly, what were the evolutionary events that give rise to homoeologs? How do they fit in with the other —log terms? Can they be found only in a certain type of polyploid, but not another? How do things like gene duplication and movement affect our understanding of what a homoeolog is? And finally, after seeing it the word written as homoeolog, homeolog, and homoeologue, how do you even spell it?
There are some excellent review papers out there on polyploidy which shed light on the biological consequences of homoeology. This, this, or this for example. However, when searching the whole of the literature, I found many inconsistent, vague, or even incorrect usages of the term homoeolog. Sometimes people defined homoeologs on the basis of their chromosome pairing patterns. Other times homoeologs were used to describe corresponding genes from different, although closely related species. Many papers said homoeologs were necessarily syntenic. Others don’t define the term at all.
Getting on the same page
These imprecise or incorrect definitions can lead to confusion. In recent years, advances in technology has afforded us the opportunity to sequence many new genomes, including polyploids. All these new techniques and have exploded the amount of data and brought about collaborations between geneticists, molecular biologists, plant breeders, bioinformaticians, phylogeneticists, and statisticians. Therefore we think it’s important to have a precise and evolutionary meaningful definition of homoeology as a reference point.
What we learned
Thus we went back to the earliest usage of the term we could find and synthesizing the literature to date. We define homoeologs as “pairs of genes or chromosomes in the same species that originated by speciation and were brought back together in the same genome by allopolyploidization”. For recent hybrids, as long as there was no rearrangement across subgenomes, homoeologs can be thought of as orthologs between these subgenomes. Here’s how they fit in with other common homologs:
We realized that homoeologs are not necessarily one-to-one or syntenic. Depending on the particular patterns of gene duplication and rearrangement in a given species, we may see homoeologs at a 1:many or across non-corresponding chromosomes.
We also reviewed homoeolog inference techniques, starting from low-throughput lab techniques to evolution-based computational methods. Orthology prediction is a booming area of active research, so many orthology inference methods can be applied to homoeology prediction.
Last but not least, we learned that even though homoeolog has alternatively been spelled “homeolog” (no extra o), homoeolog is the clear winner in terms of popularity. The “homoeo—” spelling has been used more than double the amount of times in the literature. Fortunately however, both are pronounced the same (“ho-mee-o-log”)
Check out the review paper in Trends in Plant Science (open access!). We hope this paper can serve as a jump off point for those interested in tackling homoeology, especially for those new to the field.
Glover, N., Redestig, H., & Dessimoz, C. (2016). Homoeologs: What Are They and How Do We Infer Them? Trends in Plant Science DOI: 10.1016/j.tplants.2016.02.005
This is how molecular systematics has worked since the sixties: you take some identifiable feature (e.g. a gene or a protein) common to a group of species and take some measurements of it (e.g. sequencing the DNA). By comparing the results of these measurements you can estimate the evolutionary tree that links the species. Shortly after people started doing this they realised there was a problem: when analyses are based different genes they often estimate different —incongruent — evolutionary trees. As technology has become more capable researchers have begun using more and more genes, so this problem of incongruent trees has moved to the foreground.
There have been lots of good ideas of what do about this problem, and this paper is our contribution. We tried to tackle incongruence by designing a method that groups genes together based on how similar their estimated trees are, without any assumption as to how any incongruence came about.
If all the genes more or less agree on the evolutionary tree, then you get one large group; if some disagree, then they are placed in their own groups. The most interesting case is if several genes disagree in the same way, because then you have an effect to try to explain, and you may have discovered something.
We did lots of simulation to test and refine our method, both in its ability to recognise different incongruent groups, and to estimate how many groups are present. Then, armed with a method that works well on simulation, we tested it on some real data, from yeasts, and from flies.
Our findings were that for the yeast data our method worked really well, and identified 3 distinct groups of genes. The majority of genes were a good fit to the widely accepted tree for the species we looked at. The other two groups showed some major differences, mostly involving two of the species. We had a close look at the data, and concluded that there were some wrong annotations in the data that had introduced sequences that didn’t belong there. This was not the biological result we were looking for, but nonetheless useful.
The flies data were more tricky, as they come from a genus where we aren’t sure how many separate species there are. We produced trees that show better species level resolution than the most recent molecular studies. We also showed high levels of incongruence in the order that the species appear, which can often be the case when species have diverged rapidly, due to a process called incomplete lineage sorting.
So be it to identify artifacts or genuine incongruence among your loci, we think that process-agnostic topology partitioning should become a routine step in phylogenetic analyses. To facilitate this process, we’ve released our code in a new open source software called “treeCl”, available at https://git.io/treeCl.
Gori K, Suchan T, Alvarez N, Goldman N, & Dessimoz C (2016). Clustering genes of common evolutionary history. Molecular biology and evolution PMID: 26893301
However, I keep a joint appointment at UCL, where part
of the lab remains. I’ll be flying back regularly and keep some of my
activities. UCL is a very special place—one which would be too hard for me to leave
entirely. For all the cynicism we hear about universities-as-businesses, the
overriding priority at UCL clearly remains on outstanding scholarship. My
departments (Genetics, Evolution, Environment and
Computer Science) are both highly collegial and
supportive. Compared to the previous institutions I have worked for, the
organisational culture at UCL is very much bottom-up. The pervasive chaos is
perceived as a shortcoming by some, but it’s actually a huge competitive
advantage—one that leaves ample room for initiative and flexibility. One
colleague once told me that I could build a nuclear reactor in my lab and no
one would ask a question—provided I secure the funding for it of course…
So how are we going to manage working in two different sites? Well, the
situation is not new. We have had a distributed lab for several years and have
developed a system for remote collaboration. Currently, we have lab members
primarily based in London, Zurich, Ghent, and Cambridge. Our weekly lab
meeting and monthly journal club are done via videoconference (with
GoToMeeting). I try to have at least fortnightly 1:1
meetings with all remote members. During the day, the lab stays in touch via
instant messaging (using HipChat). We have shared code
(git) and data (sshfs) repositories. We tend to write collaborative papers
using Google Docs (with Paperpile as reference
manager). Importantly, we have a lab retreat every four months where we meet
in person, reflect on our work, and
supplement this with collaborative visits as needed. The system is not
perfect—please share your experience if you’ve found other good ways of
collaborating remotely—but overall it’s working quite well.
Does automatic alignment filtering lead to better trees?
One major use of multiple sequence alignments is for tree inference. Because aligners make mistakes, many practitioners like to mask the uncertain parts of the alignment. This is done by hand or using automated tools such as Gblocks, TrimAl, or Guidance.
The aim of our study was to compare different automated filtering methods and assess under which conditions filtering might be beneficial. We compared ten different approaches on several datasets covering hundreds of genomes across different domains of life—(including nearly all of the Ensembl database) as well as simulated data. We used several criteria to assess the impact of filtering on tree inference (comparing the congruence of resulting trees with undisputed species trees and counting the number of gene duplications implied). We sliced the data in many different ways (sequence length, divergence, “gappyness”).
The more we filter alignments, the worse trees become.
In all datasets, tests, and conditions we tried, we could hardly find any situation in which filtering methods lead to better trees; in many instances, the trees got worse:
Overall, the more alignments get filtered (x-axis in figure), the worse the trees become! This holds across different datasets and filtering methods. Furthermore, under default parameters, most methods filter far too many columns.
The results were rather unexpected, and potentially controversial, so we went to great lengths to ensure that they were not spurious. This included many control analyses and replication of the results on different datasets, and using different criteria of tree quality. We also used simulated data, for which the correct tree is known with certainty.
What could explain this surprising result?
It appears that tree inference is more robust to alignment errors than we
might think. One hypothesis for this might be that while alignment
errors introduce mostly random (unbiased) noise, correct columns (or partly
correct ones) contain crucial phylogenetic signal that can help discriminate
between the true and alternative topologies.
But why could this be the case? We are not sure, but here is an idea: aligners
tend to have most difficulty with highly distant sequences, because there are
many evolutionary scenarios that could have resulted in the same sequences. At
the limit, if the distance is very large (e.g. sites have undergone multiple
substitions on average), all alignments become equally likely, and it becomes
impossible to align the sequences. Also, the variance of the distance estimate
explodes. But relative to this enormous variance,
the bias introduced by alignment errors becomes negligible.
I stress that we don’t prove this in the paper and this is merely a conjecture
(some might call this posthoc rationalisation).
So is filtering an inherently bad idea?
Although alignment filtering does not improve tree accuracy, we can’t say that
it is inherently a bad idea. Moderate amounts of filtering did not seem to have much
impact—positive or negative—but can save some computation time.
Also, if we consider the accuracy of the alignment themselves, which we did in
simulations (such that we know the true alignment), filtering does decrease
the proportion of erroneous sites in the aligments (though, of course, these
alignments get shorter!). Thus for
applications more sensitive to alignment errors than tree inference, such as detection of
sites under positive selection, it is conceivable that filtering might, in
some circumstances, help. However, the literature on the topic is rather
ambivalent (see here, here, here, and here).
Why it took us so long: a brief chronology of the project
The project started in summer 2010 as a 3-week rotation project by Ge Tan, who was a talented MSc student at ETH Zurich at the time (he is now a PhD student at Imperial College London, in Boris Lenhard’s group). The project took a few months more than originally foreseen to complete, but early-on the results were already apparent. In his report, Ge concluded:
“In summary, the filtering methods do not help much in most cases.”
After a few follow-up analyses to complete the study, we submitted a first manuscript to MBE in Autumn 2011. This first submission was rejected after peer-review due to insufficient controls (e.g. lack of DNA alignments, no control for sequence length, proportion of gaps, etc.). The editor stated:
“Because the work is premature to reach the conclusion, I cannot help rejecting the paper at this stage”.
Meanwhile, having just moved to EMBL-EBI near Cambridge UK, I gave a seminar on the work. Puzzled by my conclusions, Matthieu Muffato and Javier Herrero from the Ensembl Compara team set out to replicate our results on the Ensembl Compara pipeline. They saw the same systematic worsening of their trees after alignment filtering.
We joined forces and combined our results in a revised manuscript, alongside additional controls requested by the reviewers from our original submission. The additional controls necessitated several additional months of computations but all confirmed our initial observations. We resubmitted the manuscript to MBE in late 2012 alongside a 10-page cover letter detailing the improvements.
Once again, the paper was rejected. Basically, the editor and one referee did not believe in the conclusions and no amount of controls were going to convince them of the contrary. We appealed. The editor-in-chief rebutted the appeal but now the reason was rather different:
“[Members of the Board of Editors] were not convinced that the finding that the automated filtering of multiple alignment does not improve the phylogenetic inference on average for a single-gene data set was sufficiently high impact for MBE.”
We moved on and submitted our work to Systematic Biology. Things worked out better there, but it nevertheless took another two years and three resubmissions—addressing a total of 147 major and minor points (total length of rebuttal letters: 43 pages)—before the work got accepted. Two of the four peer-reviewers went so far as to reanalyse our data as part of their report—one conceding that our results were correct and the other one holding out until the bitter end.
Why no preprint?
Some of the problem with this slow publication process could would have been
mitigated if we had submitted the paper as a preprint. In hindsight, it’s
obvious that we should have done so. Initially, however, I did not anticipate
that it would take so long. And with each resubmission, the paper was
strengthening so I thought during the whole time that it was just about to be accepted… Also, I surely
also fell for the Sunk Cost
Other perils of long-term projects
I’ll finish with a few amusing anectodes highlighting the perils of papers requiring many cycles of resubmissions:
More than once, we had to redo analyses with new filtering methods that got published after we started the project.
At some point, one referee asked why we were using such an outdated version of TCoffee (went from version 5 to version 10 during the project!).
The editor-in-chief of MBE changed, and alongside some of the editorial policy and manuscript format (the paper had to be restructured with the method section at the end).
Tan, G., Muffato, M., Ledergerber, C., Herrero, J., Goldman, N., Gil, M., & Dessimoz, C. (2015). Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference Systematic Biology, 64 (5), 778-791 DOI: 10.1093/sysbio/syv033
Authors: Ed Chalstrey, Jan Koch, Clement Train & Lucas Wittwer •
On 25-27 May 2015, the lab attended the 4th international ‘Quest for Orthologs’ conference held at Center for Genomic Regulation (CRG) in Barcelona, Spain. The following blog entry is a summary of the experiences had at the conference by Ed Chalstrey, Jan Koch, Clement Train, and Lucas Wittwer, who are interns and master’s students in the Dessimoz lab.
Quest for Orthologs (QfO) is a meeting of groups working on orthology detection and phylogenomic databases, with an aim to improve and standardise orthology predictions. This meeting was part of a series of conferences beginning in 2009, which have successfully brought together a community of researchers with shared goals. These goals included collaboration on benchmarking and the sharing of reference datasets.
As short project students in the group, QfO gave an excellent opportunity for those of us based at UCL to meet some of our colleagues from ETH (in Zurich) and Bayer CropScience (in Ghent) in person for the first time and to make contact with other scientists working in the field of ortholog prediction.
As young scientists, some of the most important questions we face are: Will I be able to explain my project to established scientists and discuss it with them? Will I be able to understand the work of other scientists, even if their research topic falls outside my area of expertise? How can I have new ideas and be inspired to contribute to an area of research I’m new to? For us, most of whom had not attended a conference before, QfO was the perfect place to begin answering these questions.
The conference involved talks from each of the research groups and a poster session for students to display their contributions. Each of the postdocs and PhD students in the Dessimoz lab gave a short talk to introduce their posters, as well as one of us (Clement).
Clement: “The talk and the poster were the great practice for us to increase our communication skills by presenting to an audience composed of experts in related topics. This enabled us to adapt our talks depending of the kind of people we had in front of us and exchange ideas with other people during a constructive conversation. Also, attending talks on the many fields related to our work (orthology) was an amazing experience as interns, both in discovering new things and helping us in our own project with new ideas and other ways of thinking.”
QfO was a great opportunity for us to meet scientists that have worked in the field for many years and from all over the world. We were able to benefit from their experience and the advice they gave us after talking with them about our own research projects, gaining a different perspective to that of our usual supervisors and colleagues. One of the discussions had by Jan with two researchers from Switzerland may even lead to a potential future collaboration; they were interested in DLIGHT, a program that was developed by our group.
As well as discussing our current work, the conference also gave us the chance to think about future work opportunities and network with established scientists. One of the highlights for us was meeting Eugene Koonin and Sergei Mekhedov from the NCBI at the conference dinner. We had an amusing chat (about topics not necessarily related to orthology!) and an enjoyable evening. They even invited us to visit them at the NCBI!
All in all, we greatly benefited from our participation in the QfO conference.
We have published our latest review, on methods to infer horizontal gene transfer, in Wikipedia. It was peer-reviewed and it simultaneously appeared in PLOS Computational Biology. After our review on approximate Bayesian computations published two years ago, this is our second contribution using an exciting new format called “Topic Page”. In this post, I reflect on our motivation and experience as Topic Page authors.
The difficult relationship between academia and Wikipedia
Academia has mixed feelings about Wikipedia. Although many academics—and certainly many students—consult Wikipedia frequently, I’d venture to say that most remain reluctant to cite Wikipedia or admit relying on it otherwise.
As for academics contributing to Wikipedia, things are even worse. As a result, the quality of Wikipedia articles on scientific topics is oftenquitepoor.
I think there are two main reasons for this reluctance to contribute. First, the lack of clear authorship and therefore credit makes it difficult for scientists to get recognition for contributing to Wikipedia. Given the intense competitiveness of contemporary science, this is more than a vanity issue; recognition is tightly coupled with funding and job success—i.e. survival in the profession. But perhaps just as importantly, many scientists are unfamiliar with Wikipedia’s conventions and practices and are thus (rightly!) concerned that their contributions might be “diluted” by further edits by others or even flat out turned down. I know of several disgruntled people who have given up on editing Wikipedia because of such bad experiences.
Simson Garfinkel provides an illuminating account of this sort of tensions in this article:
“I have attempted to retire from directing films in the alternative universe
that is the Wikipedia a number of times, but somebody always overrules me,”
Lanier wrote. “Every time my Wikipedia entry is corrected, within a day I’m
turned into a film director again.”
Since Lanier’s attempted edits to his own Wikipedia entry were based on
firsthand knowledge of his own career, he was in direct violation of
Wikipedia’s three core policies. He has a point of view; he was writing on
the basis of his own original research; and what he wrote couldn’t be
verified by following a link to some kind of legitimate, authoritative, and
For the tertiary source Wikipedia aims to be, these core policies are entirely reasonable but it’s easy to imagine situation where they might frustrate some contributors.
Wikipedia’s tremendous impact
It’s however worth considering the upsides of contributing to Wikipedia. For all the obsessions many of us have about publishing articles in generalist journals with broad readership, the lack of interest in Wikipedia feels like a missed opportunity. Consider the wikipedia page on
Phylogenetics. It was consulted
over 50,000 in the last 3
This is over twice as much as the median number of views of papers published
in, say, the 17 Oct 2013 issue of
Nature in a quarter of the time (as it
happens, this particular issue has “impact metrics” as its cover story…).
PLOS Topic Pages
Fortunately, the good folks at PLOS Computational Biology have worked out a great solution to this conundrum: the “Topic Page”. In short, authors contribute Wikipedia-style articles on topics not or only poorly covered in Wikipedia. These get peer-reviewed and published in the journal with attribution, a DOI, and all the bells and whistles that come with journal articles. But in addition, the page gets incorporated into Wikipedia, where it starts a new life.
This format solves the problem discussed above. Authors get credit for their work in a way that fits well to existing structures. There is a permanent record of the contribution, indexed in scholarly databases such as PubMed, Google Scholar, etc. The contribution benefits from additional feedback from the peer-review and editorial process. And, perhaps most importantly, the authors can relinquish control over their work—for better or worse—with the reassurance that an unadulterated version of their work will remain available no matter what.
We’ve been pleasantly surprised by the excellent reception of the ABC article. It has been viewed over 26,000 times on the PLOS site alone. It’s also consulted a few thousand times every month on Wikipedia.
Remarkably, since the article was publicly accessible while we were drafting it on the PLOS Topic Page wiki, it had already accumulated over 10,000 views even before publication (see counter at bottom of this page). It was also picked up by a prominent’s statistician’s blog.
But just as importantly, the editorial process itself was great. Editing the manuscript on the PLOS Topic Page wiki provided a natural environment for collaborative writing. The wiki-based, open peer-review process yielded constructive and timely reviews (we could start addressing referee reports as they rolled in!). Our editor Daniel Mietchen was helpfully hands-on and did a substantial number of edits directly on the manuscript itself.
The only caveat I can think of is that the neutral, factual, impersonal, intemporal style of Wikipedia articles is quite different from the type of review articles I am otherwise used to. This is definitely not the right outlet for opinion-type pieces!
On the other hand, this format is great for student work. In fact, both of our Topic Pages started as student assignment in my course Reviews in Computational Biology. That being said, although the course gave the initial impetus, in both cases extensive additional work (and the involvement of additional co-authors) was required to get them published.
What happened since?
As it’s been two years since we published the ABC review, we can start to discern some outcomes.
The Wikipedia version underwent 46 changes, all minor modifications or additions (typo corrections, additional links and “Wikifications”, additional entries in the list of relevant software packages, attempts to sneak in one’s own contributions, … the usual stuff).
It is also gratifying to see our work appearing as first hit in Google. Since the publication in early 2013, it’s already been cited over 50 times.
Way of the future
In conclusion, the Topic Page is a great format. I am surprised that only eight Topic Pages have been published thus far, but perhaps there is still a lack of awareness about the format. I hope that this blog post will inspire some readers to improve Wikipedia by contributing a Topic Page. We are certainly thinking of our Topic Page number three…
It’s hiring season and there are many interesting opportunities for
prospective PhD students. Here is a list of some of the opportunities that
were brought to my attention. Please get in touch if you have any other