Author: Christophe Dessimoz •
This post accompanies a recent publication and is part of our
series story behind the paper,
inspired by Jonathan Eisen’s series of the same
One fundamental step in sequence analysis is the identification of homologous
sequences, sequences related through common ancestry. There are many different
ways of identifying homolog but they broadly fall into two categories:
all-against-all comparisons and clustering.
The all-against-all approach aligns every sequence with every other one. This
is straighforward to implement, relatively sensitive, and robust to variations
in sequence lengths. The main downside of all-against-all comparisons is the
quadratic computational cost with respect to the number of sequences.
In contrast, clustering works by using one representative sequence or profile
per homologous family of genes (clusters), thus limiting the number of
required comparisons to one per cluster. Assuming a fixed (or nearly fixed)
number of clusters, the computational cost is (nearly) linear in the number of
input sequences. Clustering methods however tend to miss more homologous
relationships than the all-against-all.
Can the sensitivity of the all-against-all be achieved at the speed of clustering?
The OMA database—developed in our lab—currently
relies upon an all-against-all. With 8,798,758 protein sequences from 1706
genomes in the latest release, this represents 38.7 trillion alignments. We
could probably cope with a few thousands genomes more, but will struggle to
get to the next order of magnitude with the current pipeline.
Furthermore, it is difficult to accept that as we increasingly sample the protein
sequence universe, even though we know more and more about its diversity, the
marginal computational cost of adding sequences goes higher, not lower.
In this project, we thus set out to try to achieve the sensitivity of the
all-against-all at the speed of clustering.
Transitivity of homology
In principle, homology is a transitive relationship: if gene A is homologous
to gene B, and gene B is homologous to gene C, this implies that gene A is
homologous to gene C. Transitive relationships are typically a good fit for
In practice, however, things are more complicated. Homology can be difficult
to ascertain for very divergent sequences. Furthermore, homology is not always
transitive due to insertions, deletions, fusion, fissions, and other events
that may cause inconsistencies in terms of matching residues across multiple
homologs. This figure illustrates these problems and outlines the ideas
we implemented to address them:
Putting together the ideas outlined in the figure above, we were pleasantly
surprised to see that clustering can indeed be both sensitive and fast. We
obtained 4-5x speed-ups across various datasets while recovering ~99.9% of all
homologous relationships identified through all-against-all.
In comparison, general purpose clustering approaches such as
UCLUST—which admittedly have not been
designed to identify distant homologs
~10% of all homologous relationships. They are, however, several orders of
Only the beginning
The results of our proof-of-concept implementation are thus very
encouraging. We have plans to follow up with a long list of refinement ideas,
many of which we discuss in the manuscript. One essential
refinement will be to parallelise the new approach. This is not as
straightforward as with all-against-all compraisons, but we think it can be
Meanwhile, the serial variant is available as part of the OMA
Wittwer, L., Piližota, I., Altenhoff, A., & Dessimoz, C. (2014). Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology PeerJ, 2:e607 DOI: 10.7717/peerj.607
Author: Anna Sueki •
Note: this is another interview in our series “Life in the Lab”, which gives unedited accounts of students who have spent time with us. —Christophe
Please introduce yourself in a few sentences.
I am Anna Sueki, 3rd year Biochemistry student at UCL. I did my summer internship at Dr. Dessimoz’s lab during summer 2014. I’m originally from Japan, but grew up in Singapore and Germany.
Why did you choose to join the lab?
During my second year, I took a module called “Computational Biology” and I really enjoyed learning programming and other computational aspects in biology. Dr. Dessimoz was one of the lecturer for that module, and since I found his lectures interesting, I applied to his lab for this summer internship.
What project did you work on (and for how long)?
My project was “Synteny visualisation in the OMA browser”. This was for the new release of the OMA browser, and I created the new synteny viewer function within the browser to show neighbouring genes of the entry gene and its ortholog relationships. I worked for 3 months (from June to September 2014) during my summer vacation.
What came out of it?
With the help of other members in the lab, I could produce the synteny viewer function, and it will be implemented in the new OMA browser. Also, explanation of this new function and an example was included in the paper for new OMA release.
Other than this actual project outcome, I learned about how computers and their systems work, programming of Python, and how to use Python’s web framework Django.
New synteny viewer in the OMA Browser
Was there any highlight or low point you’d like to share?
Personally, I enjoyed the lab retreat in Zurich a lot. It was nice to see the members in Zurich who I saw only through video meeting every week. Also I got many helps from Zurich members through the chatting system, especially Adrian and Clement.
What is your overall impression and would you do it again?
I really enjoyed my time in Dr. Dessimoz’s lab, and I learned a lot during these three months. I started without barely any knowledge in computer science but through this summer, I found out that I like learning coding and other computational skills. If I can achieve another chance, I would love to work with Dr. Dessimoz and other members again!