<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>Open Reading Frame</title><link>https://lab.dessimoz.org/blog</link><atom:link href="https://lab.dessimoz.org/blog/rss.xml" rel="self" type="application/rss+xml"/><description>Blog of the Dessimoz Lab at UCL</description><item><title>FastOMA: a fast and accurate orthology inference tool</title><link>https://lab.dessimoz.org/blog/2025/01/02/fastoma</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2025/01/02/fastoma</guid><dc:creator>Sina Majidian</dc:creator><pubDate>Thu, 02 Jan 2025 22:14:18 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Genomic data is expanding at a rapid pace, driven by ambitious efforts to sequence the DNA of millions of species worldwide. Comparative genomics, essentially the science of comparing genomes across species, helps us understand the evolutionary relationships between species. A key part of this is to find homologous regions, which are regions of DNA that are shared across species due to having a common ancestor.&lt;/p&gt;

&lt;p&gt;When it comes to homologous genes, there are two main types to know about: orthologs and paralogs. Orthologs are genes that started diverging because of speciation (evolutionary branching into new species), while paralogs diverged because of gene duplication. Orthologs often have similar functions across species, which makes them extremely useful for transferring knowledge from well-studied organisms to newly sequenced ones (&lt;a href="#Nicheperovich"&gt;Nicheperovich 2022&lt;/a&gt;).&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2025/01/fastoma1.png"&gt;&lt;img alt="conceptual overview of orthology and paralogy" width="66%" src="https://lab.dessimoz.org/blog/media/2025/01/fastoma1.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 1. The relationship between two genes that share a common ancestor is called homologous, from the &lt;a href="https://www.etymonline.com/word/homologous#etymonline_v_12130"&gt;Greek word&lt;/a&gt; homologos&amp;mdash; homos (meaning &amp;ldquo;same&amp;rdquo;) + logos (meaning &amp;ldquo;relation&amp;rdquo;). Orthologs are gene pairs that diverged due to evolutionary speciation, while paralogs are gene pairs that diverged due to a duplication event. This distinction is important because orthologs tend to have similar functions, but paralogs do not.&lt;/p&gt;

&lt;h2&gt;A bit of History!&lt;/h2&gt;

&lt;p&gt;The idea of distinguishing orthologs from paralogs goes back to Walter Fitch&amp;rsquo;s seminal work at the University of Wisconsin in 1970 (&lt;a href="#Fitch"&gt;Fitch 1970&lt;/a&gt;). Since then, several research groups have been working on algorithms to accurately estimate orthology. One of the first contributions was the Clusters of Orthologous Groups of proteins (COGs) database, launched by NCBI in 2000, covering 21 genomes of bacteria, archaea, and eukaryotes (&lt;a href="#Tatusov"&gt;Tatusov 2000&lt;/a&gt;). More recently, the Orthofinder tool made it possible to find orthologs for a set of genomes of interest with high accuracy. This well-known software uses fast all-against-all gene comparisons with DIAMOND to group genes into orthogroups and refine them with gene trees. Earlier this year, Sonicparanoid presented its second version, which benefits from machine learning to efficiently avoid unnecessary all-against-all alignments, which makes it even faster. All these exciting advancements highlight the thriving community that works in the field of orthology and comparative genomics.&lt;/p&gt;

&lt;p&gt;The OMA (Orthologous MAtrix) project came along in 2004 as a method and database for identifying orthologs across genomes (&lt;a href="#Dessimoz"&gt;Dessimoz et al. 2005&lt;/a&gt;).&amp;nbsp; The original OMA algorithm uses all-against-all gene comparisons with Smith-Waterman to find homologous sequences and then infers orthology relationships from there. Since 2010, Adrian Altenhoff has been the OMA project manager and OMA is hosted at the Comparative Genomics lab, led by Christophe Dessimoz and Natasha Glover.&amp;nbsp; In 2017, Cl&amp;eacute;ment Train, a talented PhD student in the lab, took things to the next level with OMA algorithm 2.0, which delivered high precision in orthology inference (&lt;a href="#Train"&gt;Train et al. 2017&lt;/a&gt;). Fast forward to today, the OMA Browser has seen 24 major updates where all the orthology data of around 3000 genomes is now presented for easy access with visualization innovations for phylostratigraphy, synteny and gene information (&lt;a href="#Altenhoff"&gt;Altenhoff et al. 2024&lt;/a&gt;). Along the way, OMA also became a core resource supported by the SIB Swiss Institute of Bioinformatics.&lt;/p&gt;

&lt;p&gt;In 2021, I joined the Comparative Genomics lab in Lausanne as a postdoc, took a leap of faith and started working on developing a new algorithm for orthology. The goal was to make it work for several thousands of species, basically scaling to the tree of life&amp;mdash;something that&amp;rsquo;s really needed these days. At first, it felt quite overwhelming as there were several efficient ortholog inference tools such as Panther, OrthoMCL, Orthofinder, Sonicparanoid, Ensembl compara,&amp;nbsp; Domainoid, MetaPhOrs, TOGA and GETHOGS (to name only a few) that are being maintained rigorously and regularly. The developer of these tools made great contributions to the field, and the huge number of comparative genomics studies over the years wouldn&amp;rsquo;t have been possible without these softwares. Their intricate design and comprehensive algorithms are accurate and efficient, making it hard to imagine advancing the field even further.&lt;/p&gt;

&lt;p&gt;On top of that, I was new to the field&amp;mdash;my PhD was on &lt;a href="https://academic.oup.com/gigascience/article/9/7/giaa078/5875849"&gt;diploid&lt;/a&gt; and polyploid haplotype phasing using DNA sequencing reads (&lt;a href="#Majidian2"&gt;Majidian et al. 2020&lt;/a&gt;) and my background is in engineering and &lt;a href="https://ieeexplore.ieee.org/abstract/document/8686170"&gt;signal processing&lt;/a&gt;. But, I embarked on this journey and started learning concepts and methods in comparative genomics. I was lucky to have great mentors and lab mates who were always open to answering my questions, over zoom and in-person.&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;OMA turns young!&lt;/h2&gt;

&lt;p&gt;Let&amp;rsquo;s talk about FastOMA. With contributions from several lab members (Stefano, Yannis, Ali, Alex, David) and guidance from Christophe, Adrian and Natasha, we developed and implemented the FastOMA method. FastOMA works by benefiting from the current knowledge of orthology available on the OMA browser. FastOMA first maps the input genes (at amino-acid level) to reference gene families (the Hierarchical Orthologous Groups, HOGs), using OMAmer, a fast k-mer-based mapper. To learn about HOG, see this &lt;a href="https://www.youtube.com/watch?v=5p5x5gxzhZA"&gt;YouTube video&lt;/a&gt; by Natasha. Next, FastOMA works on each family separately. In other words, FastOMA does not perform comparison of genes from one family to another since these genes do not have any shared homology. This is an important step which saves us a huge amount of computations. Then, FastOMA infers the gene trees on (a subsample of) genes at each taxonomic level to distinguish orthologs from paralogs within each family. This phylogeny-guided subsampling is also key to maintaining speed and accuracy at the same time.&amp;nbsp;&lt;/p&gt;

&lt;p&gt;FastOMA&amp;rsquo;s speed makes it possible to handle genomic datasets with thousands of species. FastOMA uses the &amp;ldquo;OMA&amp;rsquo;s knowledge&amp;rdquo;, and is now swift as OMA turns young. FastOMA achieves high accuracy and resolution, as shown by the Quest for Orthologs benchmarks (&lt;a href="#Majidian"&gt;Majidian, 2024&lt;/a&gt;).&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2025/01/fastoma2.png"&gt;&lt;img alt="conceptual overview of orthology and paralogy" width="100%" src="https://lab.dessimoz.org/blog/media/2025/01/fastoma2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 2. Overview of how FastOMA infers orthologs.&lt;/p&gt;

&lt;h2&gt;To the future!&lt;/h2&gt;

&lt;p&gt;As a community, we work collaboratively to advance the field and the lab has been contributing to the benchmarking datasets, making it possible to compare the performance of different tools,&amp;nbsp; and ultimately advance the field. Earlier this year, in July, the Quest for Orthologs event (QFO8) was held at the University of Montreal, where recent advancements in orthology inference were discussed, and FastOMA was also presented there. The QFO 9 will be in Switzerland in 2026!&lt;/p&gt;

&lt;p&gt;There are several directions for improving FastOMA&amp;rsquo;s accuracy and speed further. One exciting direction is taking advantage of recent advancements in protein structure prediction to reconstruct structural trees (&lt;a href="#Moi"&gt;Moi et al. 2023&lt;/a&gt;) in the context of orthology inference. This could really help boost resolution at deeper evolutionary levels. Besides, it would be very interesting to use gene order conservation, a.k.a, synteny information (&lt;a href="#Bernard"&gt;Bernard et al. 2024&lt;/a&gt;), which could serve as an additional layer of information to refine orthology predictions. We hope our proposed hierarchical approach accompanied with several ideas will stimulate further developments.&lt;/p&gt;

&lt;p&gt;So far, FastOMA has caught the attention of several labs around the world, who incorporated FastOMA in their studies. We are excited to hear how you plan to use FastOMA into your own research. Feel free to create a GitHub issue (&lt;a href="https://github.com/DessimozLab/FastOMA"&gt;https://github.com/DessimozLab/FastOMA&lt;/a&gt;) or send us an email if any help is needed!&lt;/p&gt;

&lt;p&gt;To learn more see FastOMA academy: &lt;a href="https://omabrowser.org/oma/academy/module/fastOMA"&gt;https://omabrowser.org/oma/academy/module/fastOMA&lt;/a&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Altenhoff"&gt;&lt;/a&gt;Altenhoff, Adrian M., et al. &amp;ldquo;OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA Ecosystem.&amp;rdquo; Nucleic Acids Research 52.D1 (2024): D513-D521. &lt;a href="https://doi.org/10.1093/nar/gkad1020"&gt;doi:10.1093/nar/gkad1020&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Bernard"&gt;&lt;/a&gt;Bernard, Charles, et al. &amp;ldquo;EdgeHOG: fine-grained ancestral gene order inference at tree-of-life scale.&amp;rdquo; bioRxiv (2024): 2024-08. &lt;a href="doi:10.1101/2024.08.28.610045"&gt;https://doi.org/10.1101/2024.08.28.610045&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Dessimoz"&gt;&lt;/a&gt;Dessimoz, Christophe, et al. &amp;ldquo;OMA, a Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data: Introduction and First Achievements&amp;rdquo; RECOMB 2005 Workshop on Comparative Genomics, LNCS 3678 (pp. 61-72). &lt;a href="/blog/media/2025/01/oma_2005_paper.pdf"&gt;link&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Emms"&gt;&lt;/a&gt;Emms, David M., and Steven Kelly. &amp;ldquo;OrthoFinder: phylogenetic orthology inference for comparative genomics.&amp;rdquo; Genome Biology 20 (2019): 1-14. &lt;a href="https://doi.org/10.1186/s13059-019-1832-y"&gt;doi:10.1186/s13059-019-1832-y&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Fitch"&gt;&lt;/a&gt;Fitch, Walter M. &amp;ldquo;&lt;a href="https://doi.org/10.2307/2412448"&gt;Distinguishing homologous from analogous proteins&lt;/a&gt;.&amp;rdquo; Systematic zoology 19.2 (1970): 99-113.&amp;nbsp;&lt;a href="https://doi.org/10.2307/2412448"&gt;doi:10.2307/2412448&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Majidian"&gt;&lt;/a&gt;Majidian, Sina, et al. &amp;ldquo;Orthology inference at scale with FastOMA.&amp;rdquo; Nature Methods (2025) &lt;a href="https://doi.org/10.1038/s41592-024-02552-8"&gt;doi:10.1038/s41592-024-02552-8&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Majidian2"&gt;&lt;/a&gt;Majidian, Sina, Mohammad Hossein Kahaei, and Dick de Ridder. &amp;ldquo;Minimum error correction-based haplotype assembly: Considerations for long read data.&amp;rdquo; PLOS ONE 15.6 (2020): e0234470.  &lt;a href="doi.org/10.1371/journal.pone.0234470"&gt;doi.org/10.1371/journal.pone.0234470&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Moi"&gt;&lt;/a&gt;Moi, David, et al. &amp;ldquo;Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses.&amp;rdquo; BioRxiv (2023): 2023-09.&amp;nbsp;&lt;a href="https://doi.org/10.1101/2023.09.19.558401"&gt;doi:10.1101/2023.09.19.558401&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Nicheperovich"&gt;&lt;/a&gt;Nicheperovich, Alina, et al. &amp;ldquo;OMAMO: orthology-based alternative model organism selection.&amp;rdquo; Bioinformatics 38.10 (2022): 2965-2966. &lt;a href="https://doi.org/10.1093/bioinformatics/btac163"&gt;doi:10.1093/bioinformatics/btac163&lt;/a&gt;&amp;nbsp;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Tatusov"&gt;&lt;/a&gt;Tatusov, Roman L., et al. &amp;ldquo;The COG database: a tool for genome-scale analysis of protein functions and evolution.&amp;rdquo; Nucleic acids research 28.1 (2000): 33-36. &lt;a href="https://doi.org/10.1093/nar/28.1.33"&gt;doi:10.1093/nar/28.1.33&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a name="Train"&gt;&lt;/a&gt;Train, Cl&amp;eacute;ment-Marie, et al. &amp;ldquo;Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference.&amp;rdquo; Bioinformatics 33.14 (2017): i75-i82. &lt;a href="https://doi.org/10.1093/bioinformatics/btx229"&gt;doi:10.1093/bioinformatics/btx229&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Ancestral gene order inference at Tree of Life scale</title><link>https://lab.dessimoz.org/blog/2024/08/30/edgehog</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2024/08/30/edgehog</guid><dc:creator>Charles Bernard</dc:creator><pubDate>Fri, 30 Aug 2024 12:35:39 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;For an evolutionary biologist, tracing today&amp;rsquo;s genomes back to key ancestors on the Tree of Life is a dream come true. With a collection of ancestral genomes, we could unravel the genetic steps that led to Life&amp;rsquo;s diversification from LUCA, the Last Universal Common Ancestor.&lt;/p&gt;

&lt;p&gt;In practice, this means comparing modern genomes to find similar features&amp;mdash;&amp;ldquo;orthologous&amp;rdquo; genes&amp;mdash;passed down from common ancestors. By reversing this thinking, we can use these orthologous features as clues to &amp;ldquo;reconstruct&amp;rdquo; what ancestral genomes might have looked like.&lt;/p&gt;

&lt;p&gt;But while much previous work has focused on reconstructing ancestral gene repertoires, reconstructing ancestral gene orders has been much more elusive.&lt;/p&gt;

&lt;p&gt;In this post, I&amp;rsquo;ll dive into how we&amp;rsquo;ve developed a tool, EdgeHOG (&lt;a href="#References"&gt;1&lt;/a&gt;), to achieve this at a scale and speed never seen before (preprint here: &lt;a href="https://www.biorxiv.org/content/10.1101/2024.08.28.610045v1"&gt;https://www.biorxiv.org/content/10.1101/2024.08.28.610045v1&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;Why do ancestral gene orders matter?&lt;/h2&gt;

&lt;p&gt;A genome isn&amp;rsquo;t just a random collection of genes; it has a structure that&amp;rsquo;s been shaped by evolution. Where a gene sits on a chromosome and its neighbours can matter a lot. Indeed, neighboring genes often work together (&lt;a href="#References"&gt;2&lt;/a&gt;). Plus, changes in gene order&amp;mdash;genomic rearrangements&amp;mdash;can lead to new traits and adaptations (&lt;a href="#References"&gt;3&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;So, to understand the evolutionary history of these gene neighborhoods, we need to focus on gene adjacencies, not just the genes themselves. However, figuring out the gene order for every internal node on the Tree of Life is a huge computational challenge (&lt;a href="#References"&gt;4&lt;/a&gt;)&amp;hellip;&lt;/p&gt;

&lt;h2&gt;How did we get into ancestral gene order inference?&lt;/h2&gt;

&lt;p&gt;When we wanted to analyze the link between gene function and gene adjacency across Life, we needed software that could reconstruct ancestral gene orders across the entire Tree of Life in one go, while accurately distinguishing between different copies of a gene in an ancestor.&lt;/p&gt;

&lt;p&gt;But no tool could scale up to this level. Even the best tools, like AGORA (&lt;a href="#References"&gt;5&lt;/a&gt;) require reconstructing gene trees and perform pairwise comparisons of gene orders, which make them too slow to run on large datasets.&lt;/p&gt;

&lt;p&gt;This is what drove us to create an algorithm with a clear goal: a linear time approach to reconstruct ancestral gene order, but without sacrificing accuracy.&lt;/p&gt;

&lt;h2&gt;How does EdgeHOG achieve linear-time complexity?&lt;/h2&gt;

&lt;p&gt;To infer ancestral gene orders at scale, our approach uses &lt;a href="https://youtu.be/5p5x5gxzhZA?si=bOtzJZaHlbHK56Ea"&gt;Hierarchical Orthologous Groups (HOGs)&lt;/a&gt;. These model the lineage of genes from their ancestors to today&amp;rsquo;s species, assuming vertical inheritance.&lt;/p&gt;

&lt;p&gt;By leveraging these gene lineages, our method uses &amp;ldquo;tree traversal&amp;rdquo; tricks to propagate or remove gene adjacencies along the species tree without any pairwise comparisons. Thanks to these tricks, our approach scales linearly with the size of the input phylogeny.&lt;/p&gt;

&lt;p&gt;Since the software draws edges (gene adjacencies) between HOGs (proxies for ancestral genes), we called it &lt;em&gt;EdgeHOG&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2024/08/treetraversaltrick.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2024/08/treetraversaltrick.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt; Here are the 2 first steps of EdgeHOG and the famous tree traversal tricks! The bottom-up phase propagates gene adjacencies up to the parental level of the species tree as long it is inferred by the HOG framework to have the two ancestral genes. The top-down phase essentially applies the Fitch algoritm and removes edges not supported by parsimony. Designing these tricks to comply with the constraint of linear time complexity was probably the most fun part of the project!&lt;/p&gt;

&lt;h2&gt;Fast and accurate&lt;/h2&gt;

&lt;p&gt;But EdgeHOG is not only fast, it is also very accurate! We validated it extensively on both simulated and real data. Across all benchmarks, EdgeHOG&amp;rsquo;s precision and recall met or exceeded the state of the art.&lt;/p&gt;

&lt;h2&gt;How to access EdgeHOG&amp;rsquo;s large scale inference of ancestral genomes?&lt;/h2&gt;

&lt;p&gt;The next step for us was to apply EdgeHOG to the entire OMA database, which currently includes 2,845 genomes from across the Tree of Life! This represent the first tree-of-life scale inference, resulting in 1133 ancestral genomes. You can explore these genomes on the OMA browser by clicking on &lt;em&gt;Explore &amp;rarr; Quick access to &amp;rarr; Extant and ancestral genomes.&lt;/em&gt; For instance, check out &lt;a href="https://omabrowser.org/oma/ancestralgenome/Mammalia/synteny/"&gt;the ancestral gene order for the last common ancestor of the mammals&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In the EdgeHOG paper, we also analysed the functions of the &lt;a href="https://omabrowser.org/oma/ancestralgenome/Eukaryota/synteny/"&gt;first ancestral contigs of genes ever reconstructed for LECA&lt;/a&gt;, the Last Eukaryotic Common Ancestor! These contigs contain genes that highlight core pathways like glycolysis, the pentose-phosphate shunt, amino-acid recycling, and histone organisation.&lt;/p&gt;

&lt;h2&gt;What kind of evolutionary analyses does EdgeHOG unlock?&lt;/h2&gt;

&lt;p&gt;In the lab, we&amp;rsquo;re using EdgeHOG to study the association between between gene order conservation and function conservation across different branches of the Tree of Life. We&amp;rsquo;re also dating gene adjacencies to identify old genomic neighbourhoods (like histone clusters in eukaryotes) or newer ones (like gene adjacencies on the sex chromosomes of animals).&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2024/08/ageadj.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2024/08/ageadj.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt; On these karyotypes, old histone clusters are circled in blue, and sex chromosomes are highlighted by rectangles. The estimated age of adjacencies is shown by the color scale on the right.&lt;/p&gt;

&lt;p&gt;Overall, EdgeHOG opens up new possibilities in comparative genomics. For example, it helps track genomic rearrangements across a species tree, identify conserved gene clusters in clades of interest, or improve genome assembly by integrating gene order data from other species. Ultimately, knowing ancestral gene orders will enhance orthology inference by spotting highly divergent orthologs through their neighboring genes.&lt;/p&gt;

&lt;h2&gt;Do you want to try EdgeHOG on your datasets?&lt;/h2&gt;

&lt;p&gt;EdgeHOG is easy to use and available on GitHub &lt;a href="https://github.com/DessimozLab/edgehog"&gt;https://github.com/DessimozLab/edgehog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;You&amp;rsquo;ll need a species tree, proteomes of each extant species (in Fasta files), and gene coordinates on contigs (in GFF files). Then, run our superfast &lt;a href="https://github.com/DessimozLab/fastoma"&gt;FastOMA method&lt;/a&gt; to infer the HOGs. Finally, call EdgeHOG with the HOGs (OrthoXML file), the species tree (Newick file), and the path to the GFF files.&lt;/p&gt;

&lt;p&gt;Now, you&amp;rsquo;re ready to perform big data ancestral gene order inferences, even with massive phylogenies of over 1,000 species! Try it on your favorite clade and let us know how it goes!&lt;/p&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Bernard C, Nevers Y, Karampudi NBR, Gilbert KJ, Train C, Warwick Vesztrocy A, Glover N, Altenhoff A, Dessimoz C. EdgeHOG: fine-grained ancestral gene order inference at tree-of-life scale. bioRxiv 2024. &lt;a href="https://doi.org/10.1101/2024.08.28.610045"&gt;https://doi.org/10.1101/2024.08.28.610045&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Overbeek R, Fonstein M, D&amp;rsquo;Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A. 1999. &lt;a href="https://doi.org/10.1073/pnas.96.6.2896"&gt;https://doi.org/10.1073/pnas.96.6.2896&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;An X, Mao L, Wang Y, Xu Q, Liu X, Zhang S, Qiao Z, Li B, Li F, Kuang Z, Wan N, Liang X, Duan Q, Feng Z, Yang X, Liu S, Nevo E, Liu J, Storz JF, Li K. Genomic structural variation is associated with hypoxia adaptation in high-altitude zokors. Nat Ecol Evol. 2024. &lt;a href="https://doi.org/10.1038/s41559-023-02275-7"&gt;https://doi.org/10.1038/s41559-023-02275-7&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;El-Mabrouk N. Predicting the Evolution of Syntenies&amp;mdash;An Algorithmic Review. Algorithms. 2021. &lt;a href="https://doi.org/10.3390/a14050152"&gt;https://doi.org/10.3390/a14050152&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Muffato M, Louis A, Nguyen NTT, Lucas J, Berthelot C, Roest Crollius H. Reconstruction of hundreds of reference ancestral genomes across the eukaryotic kingdom. Nat Ecol Evol. 2023. &lt;a href="https://doi.org/10.1038/s41559-022-01956-z"&gt;https://doi.org/10.1038/s41559-022-01956-z&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Phylogenetics from AI-predicted Protein Structures: it works!!</title><link>https://lab.dessimoz.org/blog/2023/09/24/structural-phylogenetics-works</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2023/09/24/structural-phylogenetics-works</guid><dc:creator>David Moi</dc:creator><pubDate>Sun, 24 Sep 2023 14:04:55 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Breakthroughs don&amp;rsquo;t come every day, but the consequences of AlphaFold largely solving the 3D structure prediction problem has reshaped biology in profound ways. The sudden availability of protein structures for billions of proteins opens up many new possibilities. Last week&amp;rsquo;s two papers on the sequencing universe provide a compelling glimpse of the possibilities (&lt;a href="https://www.nature.com/articles/s41586-023-06510-w"&gt;here&lt;/a&gt; and &lt;a href="https://www.nature.com/articles/s41586-023-06622-3"&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;As someone who has been interested in tracing back the evolutionary origins of selected proteins&amp;mdash;such as the cell fusion-mediating proteins &lt;a href="https://www.nature.com/articles/s41467-022-31564-1"&gt;fsx1 in plants, viruses, and archaea&lt;/a&gt;, or &lt;a href="https://elifesciences.org/articles/62507"&gt;odorant receptors&lt;/a&gt; in insects&amp;mdash;I have attempted to reconstruct phylogenies from structure in the past.&lt;/p&gt;

&lt;p&gt;But I have faced two major issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Until AlphaFold came along, there typically wasn&amp;rsquo;t sufficient high-quality structure predictions as &amp;ldquo;starting material&amp;rdquo; to perform structure-based phylogenetics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Even when I could obtain reasonably high confidence structures, the trees inferred from them were often met with skepticism&amp;mdash;how reliable are these trees?&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So now that high quality structure predictions are widely available, we could finally ask: &lt;strong&gt;are structures any good as starting material to infer trees? Specifically, how accurate are the reconstructed trees compared to sequences?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Today, we are super excited to report that structural phylogenetics works! What&amp;rsquo;s more, we found an approach that doesn&amp;rsquo;t just outperform traditional sequence-based methods for distant relationships; it also excels in resolving phylogenetic trees for closely related proteins. This post gives the gist of what we found&amp;mdash;the full study is released as a &lt;a href="https://www.biorxiv.org/content/10.1101/2023.09.19.558401v2"&gt;preprint&lt;/a&gt; (&lt;a href="#References"&gt;1&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;What&amp;rsquo;s the big deal with structural phylogenetics?&lt;/h2&gt;

&lt;p&gt;Before presenting our results, let&amp;rsquo;s take a step back. Why is structural phylogenetics potentially a big deal? Traditional phylogenetics, the study of evolutionary relationships among species or genes, has long relied on comparing the sequences of DNA, RNA, or proteins. While this approach has been immensely valuable, it does have its limitations. The primary challenge lies in the fact that the sequences of these biomolecules can change rapidly over time due to mutations and other factors, making it difficult to trace back their evolutionary history accurately when the divergence is very high. By contrast, proteins have unique three-dimensional structures that are intricately linked to their functions; these structures tend to change more slowly over evolutionary timescales compared to the sequences of the amino acids that make up the proteins since they are closely tied to the function of the protein.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/09/fusexins.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2023/09/fusexins.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt; In this particular example close to my heart, we can see structural homology between functionally homologous proteins at wide evolutionary ranges. The examples shown span plants, metazoans, viruses and archaea. They share virtually no sequence homology. Ref: (&lt;a href="#References"&gt;2&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;When we set out to do our work, however, we were not at all sure that it would work, let alone outperform sequence based methods. On the one hand, there have been decades of intensive tool and model refinements for sequence-based approaches, unlike its structure-based counterpart. But also, complications related to structure, such as allostery, flexible regions, and functional constraints could conceivably confound the evolutionary signal that can be extracted from structures.&lt;/p&gt;

&lt;h2&gt;Evidence that structure-based trees &lt;em&gt;can&lt;/em&gt; outperform sequence-based trees&lt;/h2&gt;

&lt;p&gt;We tested a few structural approaches, and settled on an approach reconstructing distance trees using Foldseek&amp;rsquo;s &amp;ldquo;local structural alphabet&amp;rdquo; approach, which was developed in the lab of our collaborator Martin Steinegger to search for similar structures very rapidly&amp;mdash;by encoding local structure motifs in a 20-letter alphabet and repurposing highly optimized alignment software originally developed to align amino acid sequences (&lt;a href="#References"&gt;3&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Testing and comparing the quality of phylogenetic trees empirically is tricky business. Most comparisons are based on simulated data, or by comparing the fit of data to different models. But how to compare trees that are reconstructed from entirely different kinds of input data? Luckily, our lab has accumulated quite some experience in these kinds of empirical observations, used previously to compare the accuracy of alignment (&lt;a href="#References"&gt;4&lt;/a&gt; and &lt;a href="#References"&gt;5&lt;/a&gt;) or orthology (&lt;a href="#References"&gt;6&lt;/a&gt; and &lt;a href="#References"&gt;7&lt;/a&gt;) methods. We used an approach which compares the propensity of inferred trees to recapitulate the known taxonomy of the species from which the proteins are sampled from.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/09/tcsgraphs.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2023/09/tcsgraphs.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt; When comparing the taxonomic plausibility of thousands of trees derived from homologous protein families, Foldtree outperforms sequence-based phylogenetics. (In the paper, we show that after filtering the input set to families with high quality structures, the structural phylogenies perform even better!)&lt;/p&gt;

&lt;p&gt;Amazingly, the trees we inferred in this way were more in line with the known taxonomy than those defined by sequence similarity! The input data can either be experimental crystal structures or AI structural models. Using good quality structures positively impacts the quality of the trees produced which means that as structural prediction methods get better, so will our structural trees.&lt;/p&gt;

&lt;h2&gt;The RRNPPA family: a first unifying phylogeny for peptidic quorum sensing proteins&lt;/h2&gt;

&lt;p&gt;To put our method to the test, we focused on a particularly complex gene family - the RRNPPA quorum sensing receptors (&lt;a href="#References"&gt;8&lt;/a&gt;). These receptors play a pivotal role in enabling communication and coordination among gram-positive bacteria, plasmids, and bacteriophages for crucial behaviors like sporulation, virulence, antibiotic resistance, conjugation, and phage lysis/lysogeny decisions.&lt;/p&gt;

&lt;p&gt;The complex evolutionary pattern of this family is revealed in its name. Before AI structures, new homologs were previously only detectable after having been crystallized and each subfamily was added piecemeal to the overall picture, resulting in their particularly long acronym. As the family expanded researchers also attempted to piece together its evolutionary history, using a diverse set of methods, some of which relied on structural analysis. Using Foldtree we decoded the evolutionary diversification of these genes, shedding new light on their intricate history.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/09/rrnppatree.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2023/09/rrnppatree.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt; Compared to the sequence-based phylogeny, the Foldtree reconstruction of the RRNPPA family&amp;rsquo;s history is remarkably parsimonious. Several events such as domain architecture changes or transfers to the viral world appear only once in the tree.&lt;/p&gt;

&lt;h2&gt;Foldtree: infer a structural phylogeny for your favorite protein family&lt;/h2&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/09/horizontalcomplogo.png"&gt;&lt;img alt="" src="https://lab.dessimoz.org/blog/media/2023/09/horizontalcomplogo.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To make it easy to try this approach, as well as facilitate methodological improvements, we are releasing this new approach as an open source tool we call Foldtree. It&amp;rsquo;s available for download on GitHub (&lt;a href="https://github.com/DessimozLab/fold_tree"&gt;https://github.com/DessimozLab/fold_tree&lt;/a&gt;). Try it on your favorite protein family and let us know how it performs!&lt;/p&gt;

&lt;h2&gt;Exciting new research directions&lt;/h2&gt;

&lt;p&gt;High-accuracy structural phylogenetics has the potential to uncover deeper evolutionary relationships, elucidate unknown protein functions, and even refine the design of bioengineered molecules. The evolutionary histories of protein families in the viral domain, the start of eukaryotic life and the role of asgard archaea as well as the evolution of the prokaryotic mobilome are just a few cases where the fast pace of evolution has confounded sequence-based analyses and could be revisited. We believe this work represents an important step in investigating how structures are polished by the processes of evolution and how we can use this signal to peer further into the past than ever before.&lt;/p&gt;

&lt;p&gt;&lt;section id="References"&gt;&lt;/section&gt;&lt;/p&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Moi D, Bernard C, Steinegger M, Nevers Y, Langleib M, Dessimoz C. Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses. bioRxiv 2023.09.19.558401; doi: &lt;a href="https://doi.org/10.1101/2023.09.19.558401"&gt;https://doi.org/10.1101/2023.09.19.558401&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Moi D, Nishio S, Li X, Valansi C, Langleib M, Brukman NG, et al. Discovery of archaeal fusexins homologous to eukaryotic HAP2/GCS1 gamete fusion proteins. Nat Commun. 2022;13: 3880. doi:&lt;a href="http://dx.doi.org/10.1038/s41467-022-31564-1"&gt;10.1038/s41467-022-31564-1&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, et al. Fast and accurate protein structure search with Foldseek. Nat Biotechnol. 2023. doi:&lt;a href="http://dx.doi.org/10.1038/s41587-023-01773-0"&gt;10.1038/s41587-023-01773-0&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Tan G, Gil M, L&amp;ouml;ytynoja AP, Goldman N, Dessimoz C. Simple chained guide trees give poorer multiple sequence alignments than inferred trees in simulation and phylogenetic benchmarks. Proceedings of the National Academy of Sciences of the United States of America. 2015. pp. E99&amp;ndash;100. doi:&lt;a href="http://dx.doi.org/10.1073/pnas.1417526112"&gt;10.1073/pnas.1417526112&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dessimoz C, Gil M. Phylogenetic assessment of alignments reveals neglected tree signal in gaps. Genome Biol. 2010;11: R37. doi:&lt;a href="http://dx.doi.org/10.1186/gb-2010-11-4-r37"&gt;10.1186/gb-2010-11-4-r37&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Altenhoff AM, Dessimoz C. Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009;5: e1000262. doi:&lt;a href="http://dx.doi.org/10.1371/journal.pcbi.1000262"&gt;10.1371/journal.pcbi.1000262&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Altenhoff AM, Boeckmann B, Capella-Gutierrez S, Dalquen DA, DeLuca T, Forslund K, et al. Standardized benchmarking in the quest for orthologs. Nat Methods. 2016;13: 425&amp;ndash;430. doi:&lt;a href="http://dx.doi.org/10.1038/nmeth.3830"&gt;10.1038/nmeth.3830&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bernard C, Li Y, Lopez P, Bapteste E. Large-scale identification of known and novel RRNPP quorum sensing systems by RRNPP_detector captures novel features of bacterial, plasmidic and viral co-evolution. Mol Biol Evol. 2023. doi:&lt;a href="http://dx.doi.org/10.1093/molbev/msad062"&gt;10.1093/molbev/msad062&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>The Surprising Uniformity of Protein Length Distribution Across the Tree of Life</title><link>https://lab.dessimoz.org/blog/2023/06/13/surprising_protein_uniformity</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2023/06/13/surprising_protein_uniformity</guid><dc:creator>Yannis Nevers &amp; Christophe Dessimoz</dc:creator><pubDate>Tue, 13 Jun 2023 09:55:06 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Proteins are fundamental to all life forms, dictating the complex biochemical interactions that maintain and drive the existence of every species. The functionality of a protein hinges on its structural domain organization, and the protein&amp;rsquo;s length is a direct manifestation of this. Given that every species has evolved under varying evolutionary pressures, one would intuitively expect protein length distribution to differ significantly across species.&lt;/p&gt;

&lt;p&gt;Well, we report in a paper just &lt;a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-023-02973-2"&gt;published in &lt;em&gt;Genome Biology&lt;/em&gt;&lt;/a&gt; that this is not the case.&lt;/p&gt;

&lt;h2&gt;Unexpected Homogeneity in Protein Length Distribution&lt;/h2&gt;

&lt;p&gt;In our study, we examined the protein length distribution across 2,326 species encompassing 1,688 bacteria, 153 archaea, and 485 eukaryotes. Counter to expectations, we observed a striking consistency in protein length distribution across these species. Though eukaryotic proteins were somewhat longer, the variation in protein length distribution was notably low compared to other genomic features such as genome size, gene length, number of proteins, GC content, and isoelectric points of proteins.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/06/protein-length-distribution.png"&gt;&lt;img width="100%" alt="Plots illustrating the high uniformity of protein length-related measures compared to other kinds of summary statistics" src="https://lab.dessimoz.org/blog/media/2023/06/protein-length-distribution.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Features directly related to protein length are much more conserved than other features.&lt;/p&gt;

&lt;h2&gt;Exceptions: Errors or Biological Peculiarities?&lt;/h2&gt;

&lt;p&gt;We did note a few atypical cases of protein length distribution, but these were typically due to inaccuracies in gene annotation: no well-annotated model species displayed enrichment in small proteins, and those with a high number of small proteins were more likely to have incomplete or fragmented genome annotations.&lt;/p&gt;

&lt;p&gt;Indeed, the outliers tended to include many more genomes scoring low in BUSCO quality score. The only exception we observed was the prevalence of longer proteins in the Ustilago fungal genus and the Apicomplexa phylum, known for their intracellular parasitic lifestyles.&lt;/p&gt;

&lt;p&gt;This suggests that the actual variation in protein length distribution might be even smaller than what we reported. Hopefully, resequencing and reannotation efforts will help solve this issue in the future: we already noticed a few species getting updated proteomes where the length distributions gets more similar to the typical one!&lt;/p&gt;

&lt;h2&gt;A Universal Selection Force at Play&lt;/h2&gt;

&lt;p&gt;The startling uniformity of protein length distribution across diverse species suggests a strong, universal selective pressure, maintaining a high proportion of the coding sequence within a specific length range. In the discussion part of the paper, we articulate a number of potential explanations, but these remain highly speculative.&lt;/p&gt;

&lt;p&gt;More positively put, the evolutionary forces behind the uniformity of protein distribution and their potential impact on fitness remain exciting areas of exploration!&lt;/p&gt;

&lt;h2&gt;Protein Length Distribution: A New Criterion for Gene Quality?&lt;/h2&gt;

&lt;p&gt;This observation led us to propose the use of protein length distribution as a new criterion of protein-coding gene quality upon publication. Considering that the overabundance of spurious proteins could potentially bias downstream analyses, this quality measure could aid in identifying and rectifying annotation errors. We also encourage everyone to take a look at this simple criterion when selecting proteomes for comparative genomics analysis.&lt;/p&gt;

&lt;h2&gt;Story behind the paper&lt;/h2&gt;

&lt;p&gt;The basic premise of the paper, exploring protein length distribution across the tree of life, may seem straightforward at first glance. Not quite. It started as part of Yannis&amp;rsquo;s PhD in &lt;a href="https://cstb.icube.unistra.fr/en/index.php?title=Odile_Lecompte"&gt;Odile Lecompte&lt;/a&gt;&amp;rsquo;s lab in Strasbourg&amp;mdash;and a few questions: what are the characteristics of the thousands of publicly available proteomes?  How to decide which to include in large scale analyses? It took another three years of Yannis&amp;rsquo;s postdoc, with about half of that time spent in the peer-review process.&lt;/p&gt;

&lt;p&gt;Perhaps the most revealing testament to the depth of this work is the supplementary PDF, a &lt;a href="https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-023-02973-2/MediaObjects/13059_2023_2973_MOESM2_ESM.pdf"&gt;68-page document&lt;/a&gt; filled with detailed data and analyses. Moreover, anyone interested in the peer-review history of our paper can delve into the 18-page record available &lt;a href="https://static-content.springer.com/esm/art%3A10.1186%2Fs13059-023-02973-2/MediaObjects/13059_2023_2973_MOESM8_ESM.docx"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The journey is the reward, they say; well in this instance, we are quite happy to have reached our destination!&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;Reference:&lt;/h2&gt;

&lt;p&gt;Nevers, Y., Glover, N.M., Dessimoz, C, Lecompte, O. Protein length distribution is remarkably uniform across the tree of life. Genome Biol 24, 135 (2023). &lt;a href="https://doi.org/10.1186/s13059-023-02973-2"&gt;https://doi.org/10.1186/s13059-023-02973-2&lt;/a&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Read2Tree infers phylogenetic trees from raw sequencing reads quick and easy</title><link>https://lab.dessimoz.org/blog/2023/04/23/read2tree-infers-trees-from-raw-reads-behind-the-paper</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2023/04/23/read2tree-infers-trees-from-raw-reads-behind-the-paper</guid><dc:creator>Christophe Dessimoz &amp; Fritz Sedlazeck</dc:creator><pubDate>Sun, 23 Apr 2023 20:56:37 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;We just &lt;a href="https://www.nature.com/articles/s41587-023-01753-4"&gt;published&lt;/a&gt; a method to build phylogenetic trees directly from raw reads, bypassing time-consuming steps such as genome assembly. This post gives the short story and the backstory. In particular, find out below what Read2Tree has in common with &amp;ldquo;Smoke on the Water&amp;rdquo; from the band Deep Purple.&lt;/p&gt;

&lt;p&gt;In biology, phylogenetic trees are everywhere. They help us understand the relationships between species, genes, or cells&amp;mdash;how they evolved, and how they&amp;rsquo;re related.&lt;/p&gt;

&lt;p&gt;The sequencing revolution provides the raw material to infer phylogenetic trees, but building state-of-the-art phylogenetic trees requires tedious steps from read curation, de novo assembly, gene annotation, ortholog identification to tree inference, which can take many months to run&amp;mdash;millions of CPU hours invested in this process are not uncommon&amp;mdash;and specialised knowledge to oversee this process.&lt;/p&gt;

&lt;p&gt;That&amp;rsquo;s where Read2Tree comes in. Our new approach to tree inference bypasses the usual steps of genome assembly, annotation, and orthology inference. Instead, it uses existing knowledge of the protein sequence universe to directly reconstruct comprehensive sequence alignments from raw sequencing reads.&lt;/p&gt;

&lt;p&gt;The approach is vastly faster than traditional methods and in many cases more accurate&amp;mdash;the exception being when sequencing coverage is high and reference species very distant. Read2Tree is also flexible, working with genome and transcriptome, short and long reads, and sequencing coverage as low as 0.1x.&lt;/p&gt;

&lt;p&gt;We were encouraged by the buzz the Read2Tree manuscript elicited on bioRxiv last year, and are delighted it has now been &lt;a href="(https://www.nature.com/articles/s41587-023-01753-4)"&gt;published in Nature Biotechnology&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;What is Read2Tree good for?&lt;/h2&gt;

&lt;p&gt;A nice illustration of Read2Tree&amp;rsquo;s potential was the reconstruction of a phylogeny of coronaviruses, which processed on the same tree diverse Coronaviridae sequences as well as 10,000 raw SARS-CoV-2 datasets from the Short Read Archive. The reconstructed tree was consistent with the lineage classification obtained from the UniProt reference proteomes, accurately recovering the main coronavirus genera and all subgenera (Figure 1). At the same time, the same phylogeny accurately clustered the sequences according to CDC variants of concerns classification. These results demonstrate the versatility and scalability of Read2Tree, making it suitable for both zoonotic surveillance and human epidemiology.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/04/sarscov2-tree.png"&gt;&lt;img width="70%" alt="10k-sample COVID tree" src="https://lab.dessimoz.org/blog/media/2023/04/sarscov2-tree.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 1&amp;mdash;Zoomed-in display of a tree inferred using Read2Tree on 10,283 samples whole genome SARS-CoV-2 samples. Classification in colour was obtained from [https://harvestvariants.info](https://harvestvariants.info), where grey leaves are unclassified according to the CDC label. The colour clustering shows that the Read2Tree-based tree recovers consistent classification. Click on the tree to see it full screen.&lt;/p&gt;

&lt;p&gt;The ability to reconstruct phylogenetic trees from raw reads has additional advantages. Some genomes are deposited with poor or even entirely absent protein annotation sets. Processing genomes directly from raw reads can avoid this limitation and also decrease biases that arise from relying too heavily on specific reference genomes. Although some efforts have been made to &lt;a href="https://www.science.org/doi/10.1126/science.aar6343"&gt;&amp;ldquo;dehumanize&amp;rdquo; non-human great ape genomes&lt;/a&gt;, other clades still face similar biases that can be significantly reduced by processing raw reads.&lt;/p&gt;

&lt;h2&gt;Who might find it useful?&lt;/h2&gt;

&lt;p&gt;We think Read2Tree will be especially useful for small labs with limited bioinformatics expertise and computational resources, allowing them to perform state-of-the-art phylogenomics on particular species or environments of interest.&lt;/p&gt;

&lt;p&gt;But it&amp;rsquo;s not just small labs that can benefit from Read2Tree. Large consortia can also use it to regularly update their trees as new genomes are sequenced. This is especially important as more and more projects around comparative genomics are underway, such as the &lt;a href="https://www.earthbiogenome.org/"&gt;Earth BioGenome&lt;/a&gt;, the &lt;a href="https://www.darwintreeoflife.org/"&gt;Darwin Tree of Life&lt;/a&gt;, or the &lt;a href="https://www.erga-biodiversity.eu/"&gt;European Reference Genome Atlas&lt;/a&gt; projects.&lt;/p&gt;

&lt;p&gt;In addition, Read2Tree&amp;rsquo;s ability to infer trees from much lower coverage than traditional methods means it can also be useful for quality control early in the process. This makes it a valuable tool for environmental and metagenomic applications, especially when combined with genome binning techniques.&lt;/p&gt;

&lt;p&gt;Overall, Read2Tree is a powerful method for inferring phylogenetic trees directly from raw sequencing reads. We hope it will help make phylogenetic tree inference faster, more accurate, and more accessible to a wider range of researchers.&lt;/p&gt;

&lt;h2&gt;What&amp;rsquo;s next?&lt;/h2&gt;

&lt;p&gt;Now that the introductory Read2Tree paper is published, we are excited to explore new potential applications that we haven&amp;rsquo;t been able to tackle so far. For instance, we have already received inquiries from researchers interested in using Read2Tree for ancient DNA applications or for monitoring systems that require fast turnaround time and low coverage.&lt;/p&gt;

&lt;p&gt;Moving forward, we have two main goals. First, we aim to expand Read2Tree&amp;rsquo;s capabilities to handle multi-species samples, which will enable an even broader range of applications in the metagenomics field. While long-read applications may offer the most benefit, we are confident that Read2Tree&amp;rsquo;s ability to perform well with short-reads will also prove valuable in detangling multiple species.&lt;/p&gt;

&lt;p&gt;Secondly, we plan to explore the use of Read2Tree in single-cell sequencing. This rapidly growing field involves sequencing individual cells, including cancer cells, and analysing their genetic information. Given Read2Tree&amp;rsquo;s ability to operate with low coverage levels (down to 0.2x), we believe it could facilitate fast and accurate characterization of tumour or cell evolution.&lt;/p&gt;

&lt;p&gt;We hope that Read2Tree will help streamline and democratise comparative genomics analyses. We are excited to see how researchers will apply this tool to further advance our understanding of genetics and evolution.&lt;/p&gt;

&lt;h2&gt;What&amp;rsquo;s the backstory?&lt;/h2&gt;

&lt;p&gt;Both of our labs (&lt;a href="https://fritzsedlazeck.github.io/"&gt;Fritz Sedlazeck&lt;/a&gt;&amp;rsquo;s and &lt;a href="https://lab.dessimoz.org/"&gt;Christophe Dessimoz&lt;/a&gt;&amp;rsquo;s) have been collaborating for many years, and we&amp;rsquo;ve always enjoyed exchanging ideas even though our research interests are quite diverse. One of our interests over the years is how to combine our expertise in sequence analysis and ortholog comparison to develop new methodologies and gain new insights into biology.&lt;/p&gt;

&lt;p&gt;It was during one of Fritz&amp;rsquo;s visits to Christophe&amp;rsquo;s lab in Lausanne, Switzerland, that we started brainstorming ideas for a project that led to Read2Tree. Our goal was to overcome the limitations and bottlenecks of comparative genomics. We had some amazing cheese risotto, and the beautiful scenery fueled our discussions further (Figure 2).&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/04/montreux.png"&gt;&lt;img width="100%" alt="View on the lake Geneva from Montreux" src="https://lab.dessimoz.org/blog/media/2023/04/montreux.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 2 &amp;mdash; Fritz alleges that the epiphany of Read2Tree took place with this view from his hotel room in Montreux, Switzerland, during a collaborative visit to Christophe&amp;rsquo;s group. It&amp;rsquo;s not entirely implausible, considering this very view [inspired the song &amp;ldquo;Smoke on the Water&amp;rdquo; by Deep Purple](https://en.wikipedia.org/wiki/Smoke_on_the_Water#History).  &lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;David Dylus, the first author, was convinced that it was possible to bring our ideas to life, although he did not anticipate how much time and effort it would take (Figure 3). Even after he moved on to a new role in the pharmaceutical industry, he continued to work on Read2Tree after regular work hours. And when the COVID-19 pandemic hit, we had to face additional challenges, such as maintaining regular meetings and pushing the manuscript forward while not compromising on quality. We also faced technical issues, such as hard disk crashes and cluster updates that led to data loss, but David hang on.&lt;/p&gt;

&lt;p&gt;Completing the paper was not an easy task, and one of the biggest challenges was organising and identifying all of the SRA data sets, including those related to yeast and COVID-19. Despite these challenges, we were able to bring the project to completion. It was a special joy to present the work at ISMB 2022, where Fritz and Christophe had the wonderful opportunity to meet in person, and we continued to discuss our work while enjoying good food and drinks by beautiful Mendota lake in Madison, Wisconsin.&lt;/p&gt;

&lt;p&gt;In summary, nice food and lakeside views were instrumental in the making of Read2Tree.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2023/04/david.png"&gt;&lt;img width="100%" alt="David mining at SIB 20th anniversary party" src="https://lab.dessimoz.org/blog/media/2023/04/david.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 3 &amp;mdash; First author David Dylus performing on stage (centre, crouching) on the occasion of SIB Swiss Institute of Bioinformatics&amp;rsquo;s 20th anniversary&amp;mdash;a period of rapid progress in the development of Read2Tree. Though no-one is entirely certain, rumour has it that David is miming &amp;ldquo;sipping a cup of tea while looking into the distance&amp;rdquo;, in line with our theme of sustenance, inspiring landscapes, and scientific progress.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;i&gt;Note this blog post was first published on the Nature Communities blog &lt;a href="https://ecoevocommunity.nature.com/posts/read2tree-infers-phylogenetic-trees-from-raw-sequencing-reads-quick-and-easy?channel_id=behind-the-paper"&gt;here&lt;/a&gt;&lt;/i&gt;.&lt;/p&gt;

&lt;h2&gt;Reference&lt;/h2&gt;

&lt;p&gt;Dylus, D., Altenhoff, A., Majidian, S. et al. Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree. Nat Biotechnol (2023). &lt;a href="https://doi.org/10.1038/s41587-023-01753-4"&gt;doi:10.1038/s41587-023-01753-4&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>OMArk: assessing proteome quality, quick and easy</title><link>https://lab.dessimoz.org/blog/2022/12/12/omark-to-evaluate-proteome-quality</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2022/12/12/omark-to-evaluate-proteome-quality</guid><dc:creator>Yannis Nevers</dc:creator><pubDate>Mon, 12 Dec 2022 09:39:58 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;I am excited to introduce our preprint for our new tool OMArk. We hope our software will help fill a gap in assessing the quality of gene annotation sets.&lt;/p&gt;

&lt;p&gt;Many studies directly rely on the protein-coding gene repertoires (&amp;ldquo;proteomes&amp;rdquo;) predicted from genome assemblies to perform their comparisons. Doing so, they rely on the assumption that the predicted gene content of all genomes are of homogeneous quality and an accurate reflection of reality. Yet in practice, this assumption is rarely met, with protein-coding genes often missing or fragmented in the reported proteomes, non-coding sequences wrongly annotated as coding genes by gene predictors, or contamination from other species wrongly included among the reported sequences.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;Why a new proteome quality tool?&lt;/h2&gt;

&lt;p&gt;Our new method, OMArk, provides a way to easily and comprehensively measure different aspects of proteome quality: completeness of the gene repertoire, consistency of the included genes at the taxonomic level, whether they have doubtful gene structures, and presence or not of inter or intra-domain contamination. Furthermore, contrary to existing methods, OMArk does not rely on a manual selection of reference dataset; instead, it automatically identifies the most likely taxonomic classification of the test proteomes. It can thus process any test proteome across the tree of life using a universal reference database.&lt;/p&gt;

&lt;table&gt;&lt;tr style="border-style: hidden!important;"&gt;&lt;td width="55%"&gt;&lt;a href="/blog/media/2022/12/omark_concept.jpg"&gt;&lt;img width="100%" class="nocaption" alt="Conceptual overview of the OMArk tool for genome quality assessment" src="https://lab.dessimoz.org/blog/media/2022/12/omark_concept.jpg"&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="/blog/media/2022/12/omark_consistency.png"&gt;&lt;img width="100%" class="nocaption" alt="genome or proteome consistency assessment gives new insights" src="https://lab.dessimoz.org/blog/media/2022/12/omark_consistency.png"&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p class="caption"&gt;Conceptual overview of OMArk (left) and how the innovative consistency assessment is computed (right).&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;OMArk is accurate and provides new insights&lt;/h2&gt;

&lt;p&gt;We performed extensive validation of the method by introducing controlled amounts of noise, fragmentation and contamination to reference proteomes and accurately estimating these amounts using OMArk. We also performed a large-scale analysis of 1805 eukaryotic UniProt Reference Proteomes with our software and were able to detect unambiguous cases of quality issues, either caused from incompleteness, contamination, or inclusion of translated non-coding sequences. In the most extreme case, we found a plant proteome with contamination from eight different species&amp;mdash;fungi and bacteria.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2022/12/omark_reference_proteomes.png"&gt;&lt;img width="100%" class="nocaption" alt="OMArk run on all UniProt Reference Proteomes" src="https://lab.dessimoz.org/blog/media/2022/12/omark_reference_proteomes.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;OMArk results on 1805 Eukaryotic proteomes from UniProt. Interactively check results on the &lt;a href="https://omark.omabrowser.org"&gt;OMArk webserver&lt;/a&gt;, e.g. for the current &lt;a href="https://omark.omabrowser.org/view/8dcce5da36ebb68565e7090ab29912ef/"&gt;cowpea weevil&lt;/a&gt; reference proteome.&lt;/p&gt;

&lt;p&gt;Why does the consistency metric matter? For example, comparing the Ensembl gene set for two assemblies of Bombus impatiens. We can detect a major improvement in consistency (including contamination removal) for a similar completeness.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2022/12/omark_improvements.jpg"&gt;&lt;img width="40%" class="nocaption" alt="OMArk can reveal improvements in genome assemblies/annotations even if completeness has not substantially changed" src="https://lab.dessimoz.org/blog/media/2022/12/omark_improvements.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;OMArk can reveal improvements in genome assemblies/annotations even if completeness has not substantially changed.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;OMArk is quick and easy to run&lt;/h2&gt;

&lt;p&gt;OMArk can be easily used as a &lt;a href="https://github.com/DessimozLab/OMArk"&gt;command line tool&lt;/a&gt; or on our &lt;a href="https://omark.omabrowser.org"&gt;OMArk webserver&lt;/a&gt;. On the webserver, you can submit a FASTA file of your proteome and get results in about 30 minutes. Nothing more required. You can visualize the results and directly compare it to precomputed results from closely related species (UniProt reference proteomes).&lt;/p&gt;

&lt;p&gt;More details can be found in the preprint linked below. Please let us know how &lt;a href="https://omark.omabrowser.org"&gt;the tool&lt;/a&gt; works for you!&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;Reference&lt;/h2&gt;

&lt;p&gt;Yannis Nevers, Alex Warwick Vesztrocy, Victor Rossier, Cl&amp;eacute;ment-Marie Train, Adrian Altenhoff, Christophe Dessimoz, Natasha M Glover
&lt;i&gt;Quality assessment of gene repertoire annotations with OMArk&lt;/i&gt;
Nature Biotechnology 2024 &lt;a href="https://doi.org/10.1038/s41587-024-02147-w"&gt;doi:10.1038/s41587-024-02147-w&lt;/a&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>“Mathematical and Computational Evolutionary Biology” annual conference, edition 2022</title><link>https://lab.dessimoz.org/blog/2022/11/06/mceb-2022</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2022/11/06/mceb-2022</guid><dc:creator>Charles Bernard</dc:creator><pubDate>Sun, 06 Nov 2022 14:51:04 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;This year, the MCEB international conference was held in Switzerland for the first time in its history.&lt;/p&gt;

&lt;p&gt;During five days, from June 26th to 30th 2022, experts in mathematical and computational evolutionary biology
from all over the world exchanged about their passion in Chateau d&amp;rsquo;Oex, 
in the heart of the welcoming mountains of the regional natural park of Gruyeres.&lt;/p&gt;

&lt;p&gt;Here is a little summary on this annual not-to-be-missed event for evolutionary biology aficinionados.&lt;/p&gt;

&lt;p&gt;The edition 2022 of the international MCEB conference brought together a hundreds of scientists 
from diverse disciplines and at different stages of their career, 
from PhD students to world-renown senior scientists.
During five consecutive days, the Chateau d&amp;rsquo;Oex has hence been the improbable scene of inspiring exchanges
about evolution between mathematicians, computational biologists, evolutionary biologists, ecologists, 
epidemiologists and cancer biologists.&lt;/p&gt;

&lt;table&gt;&lt;tr style="border-style: hidden!important;"&gt;&lt;td width="62%"&gt;&lt;a href="/blog/media/2022/11/mceb1.jpg"&gt;&lt;img width="100%" class="nocaption" alt="view from conference room" src="https://lab.dessimoz.org/blog/media/2022/11/mceb1.jpg"&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="/blog/media/2022/11/mceb2.jpg"&gt;&lt;img width="100%" class="nocaption" alt="cheese making" src="https://lab.dessimoz.org/blog/media/2022/11/mceb2.jpg"&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p class="caption"&gt;View from the conference venue (left) and cheese making social activity (right).&lt;/p&gt;

&lt;p&gt;In total, 6 one-hour talks and 20 short talks were given to present
epistemological perspectives, recent methodological advances and challenges yet to be addressed 
for reconstructing the evolutionary history of the genes, genomes, populations and species 
observed then and today on Earth.&lt;/p&gt;

&lt;p&gt;As a mirror of the truly interdisciplinary nature of the event, a wide range of phylogenetic structures 
have been discussed during these five days. 
Networks of gene flows, phylogenetic trees, or genealogical trees predicted by coalescent theory 
were on the menu of this year.&lt;/p&gt;

&lt;p&gt;Experts specialized in introgressive events such as horizontal gene transfers or endosymbioses 
provided insights on the methods and challenges to model reticulate evolution. 
With this respect, an inspirational talk was given on how ghost lineages that went extinct in the past 
but nonetheless exchanged some DNA with ancestors of extant species could mislead our interpretation 
of the directionality of gene flows within phylogenetic networks.&lt;/p&gt;

&lt;p&gt;Lectures on the theoretical advances that were made throughout the last 50 years in the reconstruction 
of gene and species phylogenetic trees were then given by world leaders in the field of phylogeny. 
In particular, they provided mathematical evidence that contrary to what is practiced today 
to reconstruct species trees, neither the consensus tree of several gene trees 
nor the tree inferred from the concatenated alignment of these genes 
actually give a good approximation of the phylogeny of the different species encoding these genes, 
which highlights the urgent need to pursue methodological efforts to better model species evolution.&lt;/p&gt;

&lt;p&gt;On shorter evolutionary timescales, numerous mathematical models to infer geneological trees 
of human populations or cancer cell lineages were also presented during the conference.&lt;/p&gt;

&lt;table&gt;&lt;tr style="border-style: hidden!important;"&gt;&lt;td width="50%"&gt;&lt;a href="/blog/media/2022/11/mceb4.jpg"&gt;&lt;img width="100%" class="nocaption" alt="view from conference room" src="https://lab.dessimoz.org/blog/media/2022/11/mceb4.jpg"&gt;&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;a href="/blog/media/2022/11/mceb3.jpg"&gt;&lt;img width="100%" class="nocaption" alt="cheese making" src="https://lab.dessimoz.org/blog/media/2022/11/mceb3.jpg"&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p class="caption"&gt;Poster session (left) and group photo (right).&lt;/p&gt;

&lt;p&gt;Finally, a strong focus was placed this year on methods for coupling phylogenetic inferences with phenotypical, 
ecological, archaeological, geographical, epidemiological and medical data in order to study 
how traits or diseases evolved across space and time. 
Striking examples of these integrative analyses were provided by methodologies to retrace with accuracy the evolution 
of the recent Sars-Cov2 and MERS-Cov pandemics over time and space.&lt;/p&gt;

&lt;p&gt;In addition to these talks of exceptional scientific quality, two poster sessions animated 
by junior researchers and students took place during the conference and were truly appreciated 
by every participants for the scientific excellence of the posters and the conviviality of the moments.&lt;/p&gt;

&lt;p&gt;Overall, through five days of scientific presentations, poster sessions, dinners, parties
and social activities such as hiking in the Alpes or visiting a cheese factory, 
scientific exchanges and informal talks were fostered and allowed to create news bonds within this community of researchers.&lt;/p&gt;

&lt;p&gt;The MCEB 2022 conference was a great success as it enabled a diversity of scientists from all over the world to meet, 
exchange on their work and build new collaborations!&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>The Banana Conjecture</title><link>https://lab.dessimoz.org/blog/2020/12/08/human-banana-orthologs</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2020/12/08/human-banana-orthologs</guid><dc:creator>Natasha Glover</dc:creator><pubDate>Tue, 08 Dec 2020 16:24:06 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;I recently became aware of the memes and popular science articles going around the internet claiming that we share 50% of our DNA with bananas. For example:&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/11/banana_meme.jpg"&gt;&lt;img alt="Banana Ortholog Meme" src="https://lab.dessimoz.org/blog/media/2020/11/banana_meme.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Source: &lt;a href="http://www.quickmeme.com/meme/36gnaz"&gt;http://www.quickmeme.com/meme/36gnaz&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I work in the Dessimoz lab at the University of Lausanne, and here we are in the business of comparing genes. In fact, I&amp;rsquo;ve had a similar question before&amp;#19968; what percentage of our protein-coding genes do we share with another plant, Arabidopsis thaliana. I computed the number as being &lt;a href="http://lab.dessimoz.org/blog/2018/10/01/human-plant-orthologs"&gt;closer to 17%&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I wanted to get to the bottom of this question once and for all: What percentage of a human&amp;rsquo;s &amp;ldquo;genetic material&amp;rdquo; is shared with a banana? There have been several other blog posts from scientists touching on this question (Neil Saunders: &lt;a href="https://nsaunders.wordpress.com/2018/05/09/50-bananas/"&gt;&amp;ldquo;50% bananas&amp;rdquo;&lt;/a&gt;, Stack Exchange skeptics: &lt;a href="https://skeptics.stackexchange.com/questions/35213/do-humans-share-50-of-their-dna-with-bananas"&gt;&amp;ldquo;Do humans share 50% of their DNA with bananas?&amp;rdquo;&lt;/a&gt;, Sanogenetics: &lt;a href="https://sanogenetics.com/blog/are-we-genetically-similar-to-bananas-and-why-is-this-important-for-research-in-disease/"&gt;&amp;ldquo;Are We Genetically Similar To Bananas And Why Is This Important For Research In Disease?&amp;rdquo;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;However, I wanted to go a little deeper into:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where this number came from, and the extent of it being spread on the internet.&lt;/li&gt;
&lt;li&gt;What exactly do we mean by &amp;ldquo;shared genetic material&amp;rdquo;?&lt;/li&gt;
&lt;li&gt;Some results I computed in attempts to put this controversy to rest once and for all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this blog post I will attempt to address these questions.
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;Where did the mythical 50% come from anyway?&lt;/h2&gt;

&lt;p&gt;After performing a quick google search, it seems that the relatedness between a human and a banana has been a popular question. With a cursory, non-exhaustive search, I show in the table below eight sources who report that 44-60% of the human genome is &amp;ldquo;shared&amp;rdquo; with banana.&lt;/p&gt;

&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
  &lt;th&gt;Source&lt;/th&gt;
  &lt;th&gt;quote&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.irishnews.com/magazine/science/2018/04/25/news/14-strange-facts-about-your-dna-1313755/"&gt;Irishnews.com&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;But we are also genetically related to bananas &amp;ndash; with whom we share 50% of our DNA &amp;ndash; and slugs &amp;ndash; with whom we share 70% of our DNA.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.getscience.com/biology-explained/how-genetically-related-are-we-bananas"&gt;getscience.com&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;Banana: more than 60 percent identical. Many of the &amp;ldquo;housekeeping&amp;rdquo; genes that are necessary for basic cellular function, such as for replicating DNA, controlling the cell cycle, and helping cells divide are shared between many plants (including bananas) and animals.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.thenakedscientists.com/articles/interviews/mythconception-we-share-half-our-dna-bananas"&gt;thenakedscientists.com&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;So where does this banana statistic come from? Is it just complete nonsense? Well, no. We do in fact share about 50% of our genes with plants &amp;ndash; including bananas.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.popsci.com/humans-genetically-linked-to-bananas"&gt;PopSci.com&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;Bananas have 44.1% of genetic makeup in common with humans.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.facebook.com/MythBusters/posts/sciencefact-humans-share-approximately-98-of-their-dna-with-chimps-70-with-slugs/10150466971768224/"&gt;MythBusters&lt;/a&gt; (tv show) facebook&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;#sciencefact: Humans share approximately 98% of their DNA with chimps, 70% with slugs, and 50% with bananas! http://bit.ly/qsWX8p&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.mirror.co.uk/news/weird-news/humans-share-50-dna-bananas-2482139"&gt;mirror.co.uk&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;Humans share 50% of our DNA with a banana.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.businessinsider.fr/us/comparing-genetic-similarity-between-humans-and-other-things-2016-5"&gt;Business Insider&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;The genetic similarity between a human and a banana is 60%.&amp;rdquo; Source: National Human Genome Research Institute (However, no link and when I tried to search)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
  &lt;td&gt;&lt;a href="https://www.sundaypost.com/in10/health/the-honest-truth-were-all-bananas-its-in-our-dna/"&gt;Sundaypost.com&lt;/a&gt;&lt;/td&gt;
  &lt;td&gt;&amp;ldquo;Yes, and we share 50% with bananas. It&amp;rsquo;s not surprising, if you look at the basic mechanism of biochemistry.&amp;rdquo;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;p&gt;What is disconcerting is that at least half of these sources come from popular science websites or science sections of newspapers, yet few have any sort of citation at all. The only exceptions were Popular Science, which gave DataScope as a source, and Business Insider, who cites the National Human Genome Research Institute. However, neither of these articles give a link or further information to follow up on.&lt;/p&gt;

&lt;p&gt;Upon further digging, I found one recent article published on howstuffworks entitled &amp;ldquo;Do People and Bananas Really Share 50 Percent of the Same DNA?&amp;rdquo;, which contains an interview with one of the scientists from the Human Genome Research Institute, where he explains how they arrived at that number.&lt;/p&gt;

&lt;blockquote&gt; 
&amp;ldquo;Brody says the experiment was not published, as most scientific research is. Instead, it was generated to be included as part of an educational Smithsonian Museum of Natural History video called &amp;lsquo;The Animated Genome.&amp;rsquo; That video noted that DNA between a human and a banana is &amp;lsquo;41 percent similar.&amp;rsquo;&amp;rdquo;&amp;rdquo;
&lt;/blockquote&gt;

&lt;p&gt;The article goes on to explain that this 41% figure comes from a blast search between protein sequences of human and banana. They found about 7,000 hits, and the average &lt;strong&gt;percent identity&lt;/strong&gt; of these hits was 41%.  He goes on to note:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;&amp;ldquo;This is the average similarity between proteins (gene products), not genes&amp;hellip; Of course, there are many, many genes in our genome that do not have a recognizable counterpart in the banana genome and vice versa.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So when we get to the bottom of it, the 50% figure is actually 40% average amino acid percent identity between 7000 blast hits of human and banana.
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;What do we mean when we say &amp;ldquo;we share 50% of our DNA with a banana&amp;rdquo;?&lt;/h2&gt;

&lt;p&gt;All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. What determines how many genes in common depends on how far back in time the two species shared a common ancestor. For example, humans and chimps share such a high percentage of genes, because we only diverged ~6 MYA&lt;sup&gt;1&lt;/sup&gt;. However, human and banana (more specifically the common ancestors which led to human and banana) split around 1.5 BILLION years ago&lt;sup&gt;2&lt;/sup&gt;. Talk about a banana split! Therefore we would expect a lot less to be conserved.&lt;/p&gt;

&lt;p&gt;As brought up by Neil Saunders in his &lt;a href="https://nsaunders.wordpress.com/2018/05/09/50-bananas/"&gt;blog post&lt;/a&gt;, &lt;em&gt;&amp;ldquo;What does &amp;lsquo;we share 50% of our DNA&amp;rsquo; really mean?&amp;rdquo;&lt;/em&gt; A non-biologist perhaps might not see the nuance in this question. If I were going to play the devil&amp;rsquo;s advocate, I could say that a child shares 50% of its DNA with their parents. Or even that every organism shares 100% DNA, as it is all made up of Gs, Cs, As, and Ts. Thus, it is important to be specific on what we&amp;rsquo;re talking about.&lt;/p&gt;

&lt;p&gt;This shared DNA could be referring to a number of things: protein-coding &lt;a href="https://en.wikipedia.org/wiki/Gene"&gt;genes&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Non-coding_DNA"&gt;non-coding genes&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Transposable_element"&gt;transposable elements&lt;/a&gt;, the percent that gets aligned in a whole genome alignment&lt;sup&gt;3&lt;/sup&gt;, etc. Each of these specific features evolve at different rates, and thus will be more or less conserved between any given species.
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;Well, how do we know if these genomic features are conserved?&lt;/h3&gt;

&lt;p&gt;Generally, sequences are compared by making an alignment, and then computing the percent identity or evolutionary distance between the two sequences. If the sequences are sufficiently similar, they can be declared as conserved. Thus, &amp;ldquo;conserved&amp;rdquo; can be seen as either categorical (i.e. conserved or not), and then specified as a quantitative value (conserved to a certain degree). For more information, see the Wiki page on &lt;a href="https://en.wikipedia.org/wiki/Conserved_sequence"&gt;conserved sequences&lt;/a&gt;.
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;What are the genomic features the most likely to be conserved?&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;&amp;ldquo;Conservation indicates that a sequence has been maintained by natural selection&amp;rdquo;&lt;/em&gt; (&lt;a href="https://en.wikipedia.org/wiki/Conserved_sequence"&gt;wiki&lt;/a&gt;). Genes, or DNA sequences, encode for the proteins. Proteins are slower to evolve and change than the DNA, due to the &lt;a href="https://en.wikipedia.org/wiki/Codon_degeneracy"&gt;redundancy of the genetic code&lt;/a&gt;. Thus proteins are the genomic feature most likely to be conserved between evolutionary distant species. While it is true that other genomic features such as non-coding regulatory sequences or non-coding RNA can be conserved over long evolutionary distances, they are far more likely to diverge in sequence than proteins&lt;sup&gt;4,5&lt;/sup&gt;. Other genetic features such as transposable elements, or intergenic &amp;ldquo;junk DNA&amp;rdquo; are even less likely to be conserved, as their sequences are under less selection pressure and accumulate mutations at an even higher rate.&lt;/p&gt;

&lt;p&gt;It is important to note that while we generally declare sequences to be conserved on the basis of sequence similarity, sequences may be still conserved and lack similarity. For example, two sequences might be conserved in the structure of the protein, indicating homology &lt;sup&gt;6,7&lt;/sup&gt;. Additionally, sequences might be in a syntenic position, indicating ancestral conservation, but may also lack sequence similarity &lt;sup&gt;8&lt;/sup&gt;. Thus, it is possible for some genes to be shared between evolutionary distant species, but they may fly under the radar of our current homology-inference tools. So, in order to investigate the 50% shared DNA claim, we can only focus on sequence conservation which we are able to detect.&lt;/p&gt;

&lt;p&gt;To understand how much of the genome is conserved between banana and human, I will look at proteins because it&amp;rsquo;s the feature &lt;strong&gt;&lt;em&gt;most likely&lt;/em&gt;&lt;/strong&gt; to be conserved between human and banana. This is to be as permissive as possible in attempts to give the benefit of the doubt to the 50% meme.&lt;/p&gt;

&lt;p&gt;Now the question is, how do we compare all the proteins in one species to all the proteins in another species and see which ones &amp;ldquo;match&amp;rdquo;, i.e. descended from a common ancestral gene? This is a fundamental problem important for studying evolution. &lt;strong&gt;Orthologs&lt;/strong&gt; are the term we use for genes in different species that started diverging due to a speciation event, i.e. &amp;ldquo;corresponding&amp;rdquo; genes between species. This is where our lab&amp;rsquo;s expertise comes in: we maintain &lt;a href="https://omabrowser.org/oma/home/"&gt;Orthologous Matrix&lt;/a&gt;, which is a method and database for finding orthologs between many species. 
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;Orthologs in common between human and banana&lt;/h2&gt;

&lt;p&gt;I wanted to see what percent of human&amp;rsquo;s genes are orthologous to banana genes&amp;#19968;and vice versa&amp;#19968;what percent of banana&amp;rsquo;s genes are orthologous to human&amp;rsquo;s. To compare several different methods, I tested three common methods for finding orthologs: &lt;strong&gt;OMA&lt;/strong&gt;&lt;sup&gt;9&lt;/sup&gt;, &lt;strong&gt;OrthoInspector&lt;/strong&gt;&lt;sup&gt;10&lt;/sup&gt;, and &lt;strong&gt;best-bidirectional hit&lt;/strong&gt; (using BLASTP)&lt;sup&gt;11&lt;/sup&gt;. For each method, I divided the number of orthologs found by the number genes in the genome to come up with a percentage of each genome that is shared. You can find all the details here &lt;a href="https://github.com/DessimozLab/blogpost-code/blob/main/banana_conjecture/The_Banana_Conjecture.ipynb"&gt;jupyter notebook&lt;/a&gt;, but the results are summed up in the graph below:&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/11/Percentage_human_banana_orthologs_barplot2.png"&gt;&lt;img width="80%" class="nocaption" alt="Banana Ortholog Comparison" src="https://lab.dessimoz.org/blog/media/2020/11/Percentage_human_banana_orthologs_barplot2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Comparison of ortholog methods&lt;/p&gt;

&lt;p&gt;As you can see, all the orthology-inference methods tested show a maximum of 25% of human genes to be orthologous to banana. Again, these results give the most leeway, as we used protein sequences, which are the genomic elements the most likely to be conserved.&lt;/p&gt;

&lt;p&gt;Additionally, I investigated the percentage of a whole-genome alignment that would be shared between banana and human. Since this is computationally intensive, I used &lt;a href="http://www.ensembl.org/info/genome/compara/index.html"&gt;Ensembl Compara&lt;/a&gt;, which has precomputed pairwise whole-genome alignments between a number of species. A whole-genome alignment looks at the whole genome, not just genes, as well as compares DNA rather than proteins. They didn&amp;rsquo;t have results between human and banana, but here are the results between human and chimp, mouse, and zebrafish:
&lt;br&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/11/ensembl_genome_comparisons.png"&gt;&lt;img width="90%" alt="Ensembl Whole Genome Comparisons" src="https://lab.dessimoz.org/blog/media/2020/11/ensembl_genome_comparisons.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Data obtained from https://uswest.ensembl.org/info/genome/compara/analyses.html#
&lt;/p&gt;

&lt;p&gt;As we get progressively further in evolutionary distance, we get a smaller and smaller percentage of the genome which is able to be aligned. We can presume that plants would be even less than 1%, a far cry from the 50% as reported by internet memes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So whichever way you slice it, humans share at most &amp;frac14;, not &amp;frac12; of its genetic material with banana (at least what we are able to detect)!&lt;/strong&gt;
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;What do these human-banana orthologs DO?&lt;/h2&gt;

&lt;p&gt;Now that we have found the human-banana orthologs, we can try to gain some insight into what these genes do. To do this, I performed a Gene Ontology (GO) enrichment analysis of the human genes. &lt;a href="https://en.wikipedia.org/wiki/Gene_Ontology_Term_Enrichment"&gt;GO enrichment&lt;/a&gt; works by assigning functional annotations to all of the sequences, then looking for a statistical overrepresentation of certain functions in a subset of genes compared to the entire genome.&lt;/p&gt;

&lt;p&gt;I used the &lt;a href="http://pantherdb.org/webservices/go/overrep.jsp"&gt;PANTHER Overrepresentation Test&lt;/a&gt; web server for the GO enrichment, then used &lt;a href="https://gitlab.com/evogenlab/GO-Figure"&gt;GO-Figure&lt;/a&gt;&lt;sup&gt;12&lt;/sup&gt; for summarizing and visualizing the most enriched Biological Processes.  All the details are in the &lt;a href="https://github.com/DessimozLab/blogpost-code/blob/main/banana_conjecture/The_Banana_Conjecture.ipynb"&gt;jupyter notebook&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The top 10 overrepresented GO terms, i.e. a summary of the most common functions of the human genes with orthologs, is shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/11/biological_process_gofigure_.png"&gt;&lt;img width="90%" alt="Ensembl Whole Genome Comparisons" src="https://lab.dessimoz.org/blog/media/2020/11/biological_process_gofigure_.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Top 10 overrepresented GO Biological Processes for human protein-coding genes with banana orthologs
&lt;/p&gt;

&lt;p&gt;We can see that the human-banana orthologs are highly enriched for basic, metabolic processes such as &amp;ldquo;cellular metabolic process,&amp;rdquo; &amp;ldquo;gene expression,&amp;rdquo; and &amp;ldquo;RNA processing.&amp;rdquo; These biological functions are likely genes which encode for cellular processes that are essential for eukaryotic life!
&lt;br&gt;
&lt;br&gt;&lt;/p&gt;

&lt;h2&gt;Take home message&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&amp;ldquo;Humans share 50% of DNA with banana&amp;rdquo; is a statement that has very little meaning.&lt;/li&gt;
&lt;li&gt;We must  be careful to be precise in our language. We have to clarify what we mean when we give a percentage of &amp;ldquo;shared genetic material/DNA/genome.&amp;rdquo; I argue that the percentage of protein-coding genes is currently the best way to compare evolutionarily distant species&lt;/li&gt;
&lt;li&gt;There&amp;rsquo;s no evidence that humans have 50% of detectable orthologs with a banana. In my analysis, I show between 17 and 24%, depending on which method was used. As scientists, we have to do a better job communicating science with each other and with the general public. 
&lt;br&gt;
&lt;br&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though we don&amp;rsquo;t have 50% genes in common with banana, we still have ~20% which is nothing to scoff at! The functions of these genes are most likely basic housekeeping proteins involved in metabolic processes that are necessary for most, if not all of eukaryotic life. It is amazing that these genes have been conserved over 1.5 billion years of evolution!&lt;/p&gt;

&lt;p&gt;&lt;br&gt;&lt;/p&gt;

&lt;h3&gt;References&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Patterson, N., Richter, D. J., Gnerre, S., Lander, E. S. &amp;amp; Reich, D. Genetic evidence for complex speciation of humans and chimpanzees. Nature 441, 1103&amp;ndash;1108 (2006).&lt;/li&gt;
&lt;li&gt;Wang, D. Y., Kumar, S. &amp;amp; Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163&amp;ndash;171 (1999).&lt;/li&gt;
&lt;li&gt;Armstrong, J., Fiddes, I. T., Diekhans, M. &amp;amp; Paten, B. Whole-Genome Alignment and Comparative Annotation. Annu Rev Anim Biosci 7, 41&amp;ndash;64 (2019).&lt;/li&gt;
&lt;li&gt;Ransohoff, J. D., Wei, Y. &amp;amp; Khavari, P. A. The functions and unique features of long intergenic non-coding RNA. Nat. Rev. Mol. Cell Biol. 19, 143&amp;ndash;157 (2018).&lt;/li&gt;
&lt;li&gt;Diederichs, S. The four dimensions of noncoding RNA conservation. Trends Genet. 30, 121&amp;ndash;123 (2014).&lt;/li&gt;
&lt;li&gt;Illerg&amp;aring;rd, K., Ardell, D. H. &amp;amp; Elofsson, A. Structure is three to ten times more conserved than sequence&amp;mdash;a study of structural response in protein cores. Proteins 77, 499&amp;ndash;508 (2009).&lt;/li&gt;
&lt;li&gt;Zheng, W. et al. Detecting distant-homology protein structures by aligning deep neural-network based contact maps. PLoS Comput. Biol. 15, e1007411 (2019).&lt;/li&gt;
&lt;li&gt;Vakirlis, N., Carvunis, A.-R. &amp;amp; McLysaght, A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Cold Spring Harbor Laboratory 735175 (2019) doi:10.1101/735175.&lt;/li&gt;
&lt;li&gt;Altenhoff, A. M. et al. OMA orthology in 2021: website overhaul, conserved isoforms, ancestral gene order and more. Nucleic Acids Res. doi:10.1093/nar/gkaa1007.&lt;/li&gt;
&lt;li&gt;Nevers, Y. et al. OrthoInspector 3.0: open portal for comparative genomics. Nucleic Acids Res. 47, D411&amp;ndash;D418 (2019).&lt;/li&gt;
&lt;li&gt;Moreno-Hagelsieb, G. &amp;amp; Latimer, K. Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics 24, 319&amp;ndash;324 (2008).&lt;/li&gt;
&lt;li&gt;Reijnders, M. J. &amp;amp; Waterhouse, R. M. Summary Visualisations of Gene Ontology Terms with GO-Figure! Cold Spring Harbor Laboratory 2020.12.02.408534 (2020) doi:10.1101/2020.12.02.408534.&lt;/li&gt;
&lt;/ol&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Progress in genomic checkers</title><link>https://lab.dessimoz.org/blog/2020/09/23/genomic-checkers</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2020/09/23/genomic-checkers</guid><dc:creator>Nastassia Gobet</dc:creator><pubDate>Wed, 23 Sep 2020 16:53:19 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;When I started using word processors, the spell checker was only looking at small and common typing errors and was often trying to correct acceptable words due to lack of vocabulary. A few years later, they not only are better at it and use more developed dictionaries, but they can also capture grammar mistakes and redundant phrases. A similar story is happening with the detection of genomic variants.&lt;/p&gt;

&lt;h2&gt;The genome as a big text&lt;/h2&gt;

&lt;p&gt;The genome can be considered as a big text, written in a 4-letter alphabet (A, C, G, T). When comparing the genomic words from two individuals, we can look at single or few letter(s) differences (single nucleotide variants, SNVs) and longer patterns (structural variants, SVs) such as words, sentences, and paragraphs that are added (insertions) or missing (deletions), exchanged (translocations), repeated (duplications and copy number variations, CNVs), inverted (inversions) or combinations of these (complex SVs).&lt;/p&gt;

&lt;h2&gt;Discovering the importance of SVs&lt;/h2&gt;

&lt;p&gt;About ten years ago, the focus was mainly on SNVs as these are numerous and many methods to detect them were developed. They were studied in deep and indexed in dictionaries (databases) that also document their frequencies. However, one letter differences do not necessarily have a significant effect on the meaning of the text (the phenotypes). On the other hand, although SVs were underestimated and consequently understudied, they were discovered to have a profound phenotypic impact on gene regulation, dosage, and function. Therefore, they are important in a wide variety of medical conditions: cancers, neurological diseases (Parkinson, Huntington), and mental disorders (autism, schizophrenia).&lt;/p&gt;

&lt;h2&gt;Challenges in SV identification&lt;/h2&gt;

&lt;p&gt;Methods were recently developed and are currently being developed to detect SVs. A number of challenges need to be dealt with. First, short read sequencing greatly limits the detection of large events exceeding read length. Consequently, using longer read technologies (PacBio and ONT) is improving the range of detectable SVs, but this comes at the cost of decreased sequencing accuracy and higher price. Hybrid strategies combining short and long reads are therefore promising. Second, SVs are hard to classify as the variant type depends on variant sequence context: a sequence can be considered an insertion, duplication, or translocation depending on the source (Figure 1). In addition, the number of possible SVs is infinite, whereas for SNVs there are 3 variants per position in the worst case. SVs are thus hard to compare: which criteria should we use to determine if two slightly different calls correspond to the same event or not? This affects SV reporting and frequencies. Due to the relative youth of the field, standards and best practices have yet to be established. Different initiatives (eg. &lt;a href="https://www.nist.gov/programs-projects/genome-bottle"&gt;Genome in a Bottle&lt;/a&gt; and &lt;a href="https://sites.google.com/view/seqc2"&gt;SEQC2&lt;/a&gt;) aim at better characterizing false positives and false negatives in SV calling. This should help implement more objective benchmarking and comparison between the various detection methods.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/09/sv_overview.png"&gt;&lt;img width="100%" alt="Redesign OMA Browser" src="https://lab.dessimoz.org/blog/media/2020/09/sv_overview.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;Figure 1: An SV was called for a sequence from a sample differing from the reference sequence. Three possible scenarios of formation could explain the SV observed: an insertion, a duplication or a translocation.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;Future of genomic spelling and grammar checkers&lt;/h2&gt;

&lt;p&gt;Standards and objective benchmarking for SV detection are still missing, so one must be careful with results obtained from current methods. However, SVs are increasingly recognized as being important and technologies to detect them are evolving rapidly. I think  their use will become a more common practice in genomic variation studies in a few years, similar to spelling and grammar checkers in text processors. And you, which genome checker will you use?&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;Reference&lt;/h2&gt;

&lt;p&gt;Mahmoud M, Gobet N, Cruz-D&amp;aacute;valos DI, Mounier N, Dessimoz C, Sedlazeck FJ. 2019. Structural variant calling: the long and the short of it. Genome Biol 20:246. &lt;a href="https://doi.org/10.1186/s13059-019-1828-7"&gt;doi:10.1186/s13059-019-1828-7&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;i&gt;If you want to get involved in improving SV variant detection, consider joining &lt;a href="https://www.biostars.org/p/457598/"&gt;this Hackathon&lt;/a&gt;, to be hold remotely Oct. 11-14, 2020.&lt;/i&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>OMA Standalone made easy: a step-by-step guide</title><link>https://lab.dessimoz.org/blog/2020/04/06/omastandalone_guide</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2020/04/06/omastandalone_guide</guid><dc:creator>Natasha Glover</dc:creator><pubDate>Mon, 06 Apr 2020 10:34:17 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Got newly sequenced genomes with protein annotations? Need to quickly and easily define the homologous relationships between the genes?&lt;/p&gt;

&lt;p&gt;OMA Standalone is a software developed by our lab which can be used to infer homologs from whole genomes, including orthologs, paralogs, and Hierarchical Orthologous Groups &lt;a href="https://genome.cshlp.org/content/29/7/1152.full"&gt;(Altenhoff et al 2019)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The OMA Standalone algorithm works like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2020/04/omastandalone_pipeline.png"&gt;&lt;img alt="OMA Standalone pipeliner" src="https://lab.dessimoz.org/blog/media/2020/04/omastandalone_pipeline.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In short, it takes as input user-contributed custom genomes (with the option of combining them with reference genomes already in the OMA database), and proceeds through three main parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Quality and consistency checks of the genomes that will be used to run OMA Standalone;&lt;/li&gt;
&lt;li&gt;All-against-all alignments of every protein sequence to all other protein sequences;&lt;/li&gt;
&lt;li&gt;Orthology inference, in the form of: pairwise orthologs, OMA Groups, and Hierarchical Orthologous Groups (HOGs). For more information on these types of orthologs output by OMA, see &lt;a href="https://f1000research.com/articles/9-27"&gt;OMA: A Primer (Zahn-Zabal et al. 2020)&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Although the OMA Standalone is well-documented and straightforward, one of the challenges can be running it on an High Performance Cluster (HPC).&lt;/p&gt;

&lt;p&gt;In order to understand the bare necessities needed to run OMA Standalone, we wrote an &lt;a href="/blog/media/2020/04/omastandalone_cheat_sheet.pdf"&gt;OMA Standalone Cheat Sheet&lt;/a&gt;, which you can download and follow the step-by-step instructions on running the software on an HPC. We use the cluster Wally as an example, as that is one of the HPCs here at the University of Lausanne. Wally uses SLURM as the scheduler for submitting jobs, so all the examples will be shown with that. We plan in the future to provide additional information on running with other schedulers, such as LSF or SGE. In the &lt;a href="/blog/media/2020/04/omastandalone_cheat_sheet.pdf"&gt;Cheat Sheet&lt;/a&gt;, you will find tips, hints, commands, and example scripts to run OMA Standalone on Wally.&lt;/p&gt;

&lt;p&gt;Additionally, we prepared a video which walks the user through the process of running OMA Standalone from start to finish, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Downloading the software&lt;/li&gt;
&lt;li&gt;Preparing your genomes for running&lt;/li&gt;
&lt;li&gt;Editing the necessary parameters file&lt;/li&gt;
&lt;li&gt;Creating the job scripts and&lt;/li&gt;
&lt;li&gt;Submitting your jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;br&gt;
The video can be found on our lab&amp;rsquo;s YouTube channel, at &lt;a href="https://www.youtube.com/watch?v=a1FqwGZ0WV4&amp;amp;feature=youtu.be"&gt;OMA standalone: how to efficiently identify orthologs using a cluster&lt;/a&gt;, and is also embedded here for your convenience:&lt;/p&gt;

&lt;div class="yt-container"&gt; &lt;iframe src="//www.youtube.com/embed/a1FqwGZ0WV4?html5=1" frameborder="0" allowfullscreen class="video"&gt;&lt;/iframe&gt; &lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;We hope these resources can be helpful if you need help getting started running OMA Standalone. But don&amp;rsquo;t forget, there is also plenty of information that can be found on the OMA Standalone &lt;a href="https://omabrowser.org/standalone/"&gt;webpage&lt;/a&gt; or in the OMA Standalone &lt;a href="https://genome.cshlp.org/content/29/7/1152.full"&gt;paper&lt;/a&gt;. If all else fails, don&amp;rsquo;t hesitate to contact us on &lt;a href="https://www.biostars.org/t/oma/"&gt;Biostars&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Altenhoff, A. M. et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Res. 29, 1152&amp;ndash;1163 (2019). &lt;/li&gt;
&lt;li&gt;Zahn-Zabal, M., Dessimoz, C. &amp;amp; Glover, N. M. Identifying orthologs with OMA: A primer. F1000Res. 9, 27 (2020).&lt;/li&gt;
&lt;/ol&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Creating a bibliography with links to PubMed and PubMedCentral</title><link>https://lab.dessimoz.org/blog/2020/02/24/bibliography-with-links-to-pubmed</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2020/02/24/bibliography-with-links-to-pubmed</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Mon, 24 Feb 2020 12:57:07 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;&lt;link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"&gt;
&lt;script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"&gt;&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;We just submitted a paper to Nucleic Acids Research Web Server issue. As it 
turns out, the editor requires &lt;a href="https://academic.oup.com/nar/pages/submission_webserver#Manuscripts"&gt;a bibliography with links to DOI, PubMed, and PubMedCentral entries&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is a brief tutorial to generate such a bibliography from a bibtex file 
which contains the relevant entries, loosely based on the explanations 
provided in this &lt;a href="https://tex.stackexchange.com/questions/175776/how-can-i-create-entirely-new-data-types-with-biblatex-biber/175896#175896"&gt;TeX StackExchange entry&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Generating the bibtex file&lt;/h2&gt;

&lt;p&gt;In the lab, we mainly use &lt;a href="https://paperpile.com"&gt;Paperpile&lt;/a&gt; as bibliography management system, but most system allow to export records in bibtex format. If available, Paperpile includes DOI, PubMed IDs, and PubMedCentral IDs as follows:&lt;/p&gt;

&lt;pre&gt;&lt;code class="tex"&gt;@ARTICLE{Glover2019-xs, 
    title    = "{Assigning confidence scores to homoeologs using fuzzy logic}",  
    author   = "Glover, Natasha M and Altenhoff, Adrian and Dessimoz, Christophe",
    journal  = "PeerJ",
    volume   =  6,
    pages    = "e6231",
    year     =  2019,
    doi      = "10.7717/peerj.6231",
    pmid     = "30648004",
    pmc      = "PMC6330999"
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this example, we store the bibliography in a file named &lt;code&gt;ref.bib&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;Extending biblatex to include PMID and PMC links in the bibliography&lt;/h2&gt;

&lt;p&gt;DOI are already supported by most bibliography systems. To also include PMID and PMCIDs, the trick is to use the flexible &lt;a href="https://ctan.org/pkg/biblatex?lang=en"&gt;BibLatex&lt;/a&gt; package.&lt;/p&gt;

&lt;p&gt;In a separate definition file, which we named &lt;code&gt;adn.dbx&lt;/code&gt;, add the 
additional definitions for PMID and PMCIDs.&lt;/p&gt;

&lt;pre&gt;&lt;code class="tex"&gt;\DeclareDatamodelFields[type=field,datatype=literal]{
    pmid,
    pmc,
}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can now include this file in the library definition in the main LaTeX file 
(in the preamble, i.e. before &lt;code&gt;\begin{document}&lt;/code&gt;, and define the 
links to PubMed and PubMedCentral entries:&lt;/p&gt;

&lt;pre&gt;&lt;code class="tex"&gt;\usepackage[backend=biber,style=authoryear,datamodel=adn]{biblatex}
\DeclareFieldFormat{pmc}{%
  PubMedCentral\addcolon\space
  \ifhyperref
    {\href{http://www.ncbi.nlm.nih.gov/pmc/articles/#1}{\nolinkurl{#1}}}
    {\nolinkurl{#1}}}
\DeclareFieldFormat{pmid}{%
  PubMed ID:\addcolon\space
  \ifhyperref
    {\href{https://www.ncbi.nlm.nih.gov/pubmed/#1}{\nolinkurl{#1}}}
    {\nolinkurl{#1}}}
\renewbibmacro*{finentry}{%
  \printfield{pmid}%
  \newunit
  \printfield{pmc}%
  \newunit
  \finentry}
\addbibresource{ref.bib}&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can generate the bibliography by citing every paper using the &lt;code&gt;\cite{}&lt;/code&gt; command, and printing the bibliography.&lt;/p&gt;

&lt;pre&gt;&lt;code class="tex"&gt;\cite{Glover2019-xs}
\newpage
\printbibliography&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;Polishing: highlighting the links with colour&lt;/h2&gt;

&lt;p&gt;To make the links more visible, define the &lt;code&gt;hyperref&lt;/code&gt; package 
accordingly:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;\usepackage[colorlinks]{hyperref}
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;OK, thanks but could I just have the files please?&lt;/h2&gt;

&lt;p&gt;Of course! &lt;a href="/blog/media/2020/02/biblio.zip"&gt;Here they are&lt;/a&gt;.&lt;/p&gt;

&lt;script&gt;hljs.initHighlightingOnLoad();&lt;/script&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>How to get published: interview series</title><link>https://lab.dessimoz.org/blog/2019/02/20/how-to-get-published</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2019/02/20/how-to-get-published</guid><dc:creator>Natasha Glover</dc:creator><pubDate>Wed, 20 Feb 2019 08:34:59 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;You&amp;rsquo;ve wrapped up the research on project. You&amp;rsquo;ve gotten good results. It&amp;rsquo;s time to publish them&amp;mdash;your PhD/postdoc/career depends on it.&lt;/p&gt;

&lt;p&gt;But you&amp;rsquo;re drawing a blank. Staring at the screen for hours, not knowing how to get started, and the only thing you can produce is an empty document. One of the most daunting tasks for young scientists is writing and publishing a paper, especially the first one.&lt;/p&gt;

&lt;p&gt;In the context of a tutorial of the &lt;a href="https://unil.ch/quantitative-biology"&gt;UNIL Quantitative Biology PhD Program&lt;/a&gt;, I prepared a series of three short videos featuring interviews with professors on tips to successfully publish a scientific paper. These videos were aimed towards PhD students, but contain useful advice for anyone, at any stage of their career. Here&amp;rsquo;s what they had to say:&lt;/p&gt;

&lt;h1&gt;Part 1: the writing process&lt;/h1&gt;

&lt;div class="yt-container"&gt; &lt;iframe src="//www.youtube.com/embed/JQrJyDsddZc?html5=1" frameborder="0" allowfullscreen class="video"&gt;&lt;/iframe&gt; &lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h1&gt;Part 2: the journal selection&lt;/h1&gt;

&lt;div class="yt-container"&gt; &lt;iframe src="//www.youtube.com/embed/xFnIsWrljLQ?html5=1" frameborder="0" allowfullscreen class="video"&gt;&lt;/iframe&gt; &lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h1&gt;Part 3: responding to reviewers&lt;/h1&gt;

&lt;div class="yt-container"&gt; &lt;iframe src="//www.youtube.com/embed/j2cfcH1P8E8?html5=1" frameborder="0" allowfullscreen class="video"&gt;&lt;/iframe&gt; &lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Exclusive: European Tour of Antonis Rokas        </title><link>https://lab.dessimoz.org/blog/2018/10/23/tour-antonis-rokas</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2018/10/23/tour-antonis-rokas</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Tue, 23 Oct 2018 08:54:22 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;We are delighted to host Prof. Antonis Rokas, Vanderbilt, for two special seminars at University College London and at the University of Lausanne!&lt;/p&gt;

&lt;h2&gt;Genomics and the making of biodiversity across the budding yeast subphylum&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://as.vanderbilt.edu/rokaslab/"&gt;Prof. Antonis Rokas&lt;/a&gt;, Vanderbilt 
University&lt;/p&gt;

&lt;p&gt;&lt;em&gt;London&lt;/em&gt;: Tue 13 Nov 2018, 11am, UCL, &lt;a href="https://www.ucl.ac.uk/maps/roberts-309"&gt;Roberts Building 309&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Lausanne&lt;/em&gt;: Wed 14 Nov 2018, 11am, UNIL, &lt;a href="https://planete.unil.ch/plan/?local='GEN-2003'"&gt;Genopode auditorium A&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Abstract&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Yeasts are unicellular fungi that do not form fruiting bodies. Although the yeast lifestyle has evolved multiple times, most known species belong to the subphylum Saccharomycotina (hereafter yeasts). This diverse group includes the premier eukaryotic model system, Saccharomyces cerevisiae; the common human commensal and opportunistic pathogen, Candida albicans; and over 1,000 other known species (with more continuing to be discovered). Yeasts are found in every biome and continent and are more genetically diverse than either plants or bilaterian animals. Ease of culture, simple life cycles, and small genomes (10&amp;ndash; 20 Mbp) have made yeasts exceptional models for molecular genetics, biotechnology, and evolutionary genomics. Since only a tiny fraction of yeast biodiversity and metabolic capabilities has been tapped by industry and science, expanding the taxonomic breadth of deep genomic investigations will further illuminate how genome function evolves to encode their diverse metabolisms and ecologies. As part of National Science Foundation&amp;rsquo;s Dimensions of Biodiversity program, we have undertaken a large-scale comparative genomic study to uncover the genetic basis of metabolic diversity in the entire Saccharomycotina subphylum. In my talk, I will discuss the team&amp;rsquo;s evolutionary analyses of 332 genomes spanning the diversity of the subphylum. These include establishing a robust genus-level phylogeny and timetree for the subphylum, quantification of the extent of horizontal gene transfer for the subphylum, and characterization of the evolution of approximately 50 metabolic traits (and, in some cases, their underlying genes and pathways). These analyses allow us, for the first time, to infer the key metabolic characteristics of the Last Yeast Common Ancestor (LYCA) and characterize the tempo and mode of genome evolution across an entire subphylum.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;All welcome!&lt;/em&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Predicting QTL genes by integrating functional data across species</title><link>https://lab.dessimoz.org/blog/2018/10/05/qtlsearch</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2018/10/05/qtlsearch</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Fri, 05 Oct 2018 07:51:14 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;h2&gt;The problem in a nutshell&lt;/h2&gt;

&lt;p&gt;Quantitative Trait Loci (QTL) are regions of a genome for which genetic variants correlate with particular traits. To take a simple example in plants, one might observe that the average seed size (trait) is significantly larger when considering the subset of a population which has a C at a particular position in the genome than a subpopulation with a T.&lt;/p&gt;

&lt;p&gt;The reason QTL identifies genomic &lt;em&gt;regions&lt;/em&gt; and not precise positions is that neighbouring variants tend to be inherited together. These regions typically contain hundreds of genes, making it difficult to say which one(s) are causal to the trait variation&amp;mdash;if any at all (the causal genetic variation(s) can be in non-coding regions too).&lt;/p&gt;

&lt;p&gt;Thus, to prioritise candidate causal genes within a QTL region, researchers typically consider previous knowledge on these genes, to see whether a particular gene &amp;ldquo;makes sense&amp;rdquo;. In the case of seed size, it might be a gene previously implicated in growth or regulation, or a gene known to influence seed size in a different species. This process is however requires substantial manual interpretation, and is thus labour-intensive and haphazard.&lt;/p&gt;

&lt;h2&gt;Enter QTLsearch&lt;/h2&gt;

&lt;p&gt;We realised that our framework of &lt;a href="https://m.youtube.com/watch?v=5p5x5gxzhZA"&gt;hierarchical orthologous groups&lt;/a&gt;, which relates genes across many species, could be extended to integrate QTL results with previous gene function annotations.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2018/10/qtlsearch.png"&gt;&lt;img width="100%" alt="Conceptual overview of QTLsearch" src="https://lab.dessimoz.org/blog/media/2018/10/qtlsearch.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;&lt;b&gt;Conceptual overview of QTLsearch&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;If we go back to the seed size example, it might be that among the genes in the window, one has an ortholog in a different species previously annotated with the GO term &amp;ldquo;reproductive system development&amp;rdquo;. This could be a good candidate causal gene.&lt;/p&gt;

&lt;p&gt;One risk however in integrating lots of previous knowledge across many species is that we might also find some spurious patterns. We therefore had to devise a way of controlling for random associations between QTL regions and evolutionarily propagated knowledge. Such &amp;ldquo;&lt;a href="https://en.m.wikipedia.org/wiki/Null_distribution"&gt;null distribution&lt;/a&gt;&amp;rdquo; depends on the specificity or the terms in question, the amount of annotations, the size of the QTL regions, and the species sampling. To cope with this complexity, we chose to implement a non-parametric permutation test.&lt;/p&gt;

&lt;p&gt;We implemented the tool as an open source package called QTLsearch, &lt;a href="http://lab.dessimoz.org/blog/drafts/qtlsearch"&gt;available here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;QTLsearch infers more candidate causal genes than manual analyses&lt;/h2&gt;

&lt;p&gt;We used QTLsearch to reanalyse two previous studies. In both cases, we could call more candidate genes than the original studies. But more importantly, the evidence behind our calls is fully traceable and statistically supported.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2018/10/qtlsearch2.png"&gt;&lt;img width="100%" alt="Barchart of QTLsearch performance" src="https://lab.dessimoz.org/blog/media/2018/10/qtlsearch2.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;&lt;b&gt;QTLsearch could identify more candidate genes than the original study, but in an automated, reproducible, and statistically meaningful way.&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;Thus we think this will greatly facilitate future QTL analyses, particularly those that are done in non-model species for which the previous experimental knowledge is very limited.&lt;/p&gt;

&lt;h2&gt;Behind the paper&lt;/h2&gt;

&lt;p&gt;This is the third paper that resulted from our collaboration with Bayer CropScience (now BASF CropScience), after &lt;a href="/blog/2016/03/24/what-are-homoeologs"&gt;our work on homoeologs&lt;/a&gt; and on &lt;a href="https://www.biorxiv.org/lookup/doi/10.1101/182550"&gt;detecting split genes&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The project was conceived by Henning Redestig, collaborator at Bayer at the start of the project (now at DuPont). Henning had contributed to a QTL study and knew how labor intensive the search for putative causal genes is. He realised that HOGs could provide a natural way of integrating functional knowledge across multiple species, to combine the QTL information with previous functional data.&lt;/p&gt;

&lt;p&gt;Alex Warwick Vesztrocy, PhD student on the project and first author, ran with the idea&amp;mdash;promptly implementing and testing it. Early results looked promising, but Alex soon realised that the mapping between metabolites and GO terms could be improved. He also realised that some terms were quite common, so he devised the approach to compute the significance scores.&lt;/p&gt;

&lt;p&gt;Our manuscript was accepted as proceedings paper at the European Conference on Computational Biology (ECCB). In our lab, we like proceedings paper. It&amp;rsquo;s nice to be able to present the work &lt;em&gt;and&lt;/em&gt; publish the paper, particularly since the ECCB proceedings appear in a good journal. More importantly, conferences impose hard deadlines. Deadlines for submission of course, but also for peer-reviewing and for deciding acceptance or not!&lt;/p&gt;

&lt;h2&gt;Reference&lt;/h2&gt;

&lt;div&gt;Alex Warwick Vesztrocy, Christophe Dessimoz*, Henning Redestig*, &lt;i&gt;Prioritising Candidate Genes Causing QTL using Hierarchical Orthologous Groups&lt;/i&gt;, Bioinformatics, 2018, 34:17, pp. i612&amp;ndash;i619 (ECCB 2018 proceedings) &lt;a href="http://doi.org/10.1093/bioinformatics/bty615"&gt;[Open Access Full Text]&lt;/a&gt;&lt;/div&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Over 10 computational biology positions at PhD, postdoc and group leader level </title><link>https://lab.dessimoz.org/blog/2018/10/03/computational-biology-phd-postdoc-jobs</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2018/10/03/computational-biology-phd-postdoc-jobs</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Wed, 03 Oct 2018 14:16:24 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;&lt;em&gt;(Post updated on 4 Oct 2018 and on 13 Nov 2018)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Our lab has an open position, and so do collaborators and colleagues across Switzerland and Europe.&lt;/p&gt;

&lt;p&gt;Please help us spread the word by forwarding this post. If you have computational biology jobs to announce, let me know and I will gladly add a link.&lt;/p&gt;

&lt;h2&gt;Postdoc position in our lab&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://tinyurl.com/y9hx8sgt"&gt;Postdoc position in evolutionary bioinformatics&lt;/a&gt; (closing soon!)&lt;/li&gt;
&lt;/ul&gt;

&lt;hr&gt;

&lt;h2&gt;PhD openings with colleagues&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.unil.ch/quantitative-biology/home/menuinst/join-the-qb-program/joint-phd-call-late-2018.html"&gt;Multiple PhD positions in Quantitative Biology at the University of Lausanne&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ucl.ac.uk/child-health/research/genetics-and-genomic-medicine-programme/genome-biology-and-precision-medicine/dr-sergi-5"&gt;Two PhD positions with Sergi Castellano at UCL&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr&gt;

&lt;h2&gt;Postdoc openings with colleagues&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://career5.successfactors.eu/career?company=universitdP&amp;amp;site=VjItZy84VGQ5U1B5c09CRGlJeTlzUHdlZz09&amp;amp;career_job_req_id=14011&amp;amp;career_ns=job_listing&amp;amp;navBarLevel=JOB_SEARCH"&gt;Postdoc position with Olivier Delaneau at University of Lausanne&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://atsv7.wcn.co.uk/search_engine/jobs.cgi?amNvZGU9MTc1NDM5MSZ2dF90ZW1wbGF0ZT05NjUmb3duZXI9NTA0MTE3OCZvd25lcnR5cGU9ZmFpciZicmFuZF9pZD0wJmpvYl9yZWZfY29kZT0xNzU0MzkxJnBvc3RpbmdfY29kZT0yMjQ%3D=&amp;amp;jcode=1754391&amp;amp;vt_template=965&amp;amp;owner=5041178&amp;amp;ownertype=fair&amp;amp;brand_id=0&amp;amp;job_ref_code=1754391&amp;amp;posting_code=224"&gt;Postdoc position with Natasa Przulj at UCL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://atsv7.wcn.co.uk/search_engine/jobs.cgi?amNvZGU9MTc1MzU4MSZ2dF90ZW1wbGF0ZT05NjUmb3duZXI9NTA0MTE3OCZvd25lcnR5cGU9ZmFpciZicmFuZF9pZD0wJmpvYl9yZWZfY29kZT0xNzUzNTgxJnBvc3RpbmdfY29kZT0yMjQ&amp;amp;jcode=1753581&amp;amp;vt_template=965&amp;amp;owner=5041178&amp;amp;ownertype=fair&amp;amp;brand_id=0&amp;amp;job_ref_code=1753581&amp;amp;posting_code=224"&gt;Postdoc position with Javier Herrero at UCL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ucl.ac.uk/child-health/research/genetics-and-genomic-medicine-programme/genome-biology-and-precision-medicine/dr-sergi-5"&gt;Postdoc position with Sergi Castellano at UCL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lbs.epfl.ch/page-74334-en-html/"&gt;Postdoc position with Paolo de Los Rios at EPFL&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://nimwegenlab.org/wp-content/uploads/2018/09/postdoc_image_analysis_final.pdf"&gt;Postdoc position with Erik van Nimwegen at the University of Basel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr&gt;

&lt;h2&gt;Group leader positions&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.embl.de/jobs/searchjobs/index.php?ref=EBI01295"&gt;Group leader position at EMBL-EBI&lt;/a&gt; (closing soon!)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jobs.sanger.ac.uk/wd/plsql/wd_portal.show_job?p_web_site_id=1764&amp;amp;p_web_page_id=365782"&gt;Head of Tree of Life initiative at Sanger&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr&gt;

&lt;h2&gt;Bonus position (not computational but what the heck)&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.jobs.ac.uk/job/BNC166/research-fellow"&gt;Postdoc position on comparative single cell sequencing with Max Telford at UCL&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>What genes do I have in common with a plant?</title><link>https://lab.dessimoz.org/blog/2018/10/01/human-plant-orthologs</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2018/10/01/human-plant-orthologs</guid><dc:creator>Natasha Glover</dc:creator><pubDate>Mon, 01 Oct 2018 08:38:37 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;&lt;link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"&gt;
&lt;script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"&gt;&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;All living organisms descended from a common ancestor, and therefore all living organisms have some genes in common. Therefore, we can take any two species on the tree of life and find out what set of genes were present in the common ancestor of the two species by looking for the genes that they still share to this day.  When two species are closely related, i.e. close together on the tree of life, we can observe that much of their genome is the same. But what do I mean by &amp;ldquo;same&amp;rdquo;, and what do I mean by &amp;ldquo;shared&amp;rdquo;?&lt;/p&gt;

&lt;p&gt;Genomes are complex, with many layers of control (DNA-level, RNA-level, protein-level, epigenetic level, alternative splicing-level, protein-protein interaction level, transposable element-level, etc). Because of this complexity, there could be many ways to compare the genomes of two different species to determine which portion of the genome is shared.&lt;/p&gt;

&lt;p&gt;In this blog post, when I refer to &amp;ldquo;shared&amp;rdquo; parts of the genome, I&amp;rsquo;m referring to &lt;strong&gt;orthologs&lt;/strong&gt;, which are genes in different species that started diverging due to a speciation event. Orthologs can be inferred from complete genome sequences, and there are many different methods to do this. In particular, I will focus on the Dessimoz lab&amp;rsquo;s method and database, Orthologous MAtrix, or &lt;strong&gt;OMA&lt;/strong&gt;.  OMA uses protein sequences from whole genome annotations to compute orthologs between over 2000 species. The OMA algorithm is &lt;a href="https://lab.dessimoz.org/papers/orthology.pdf"&gt;graph-based&lt;/a&gt;, compares all protein sequences to all others to find the closest between two genomes, allows for duplicates, and uses information from related genomes to improve the calling. OMA is available at &lt;a href="https://omabrowser.org"&gt;https://omabrowser.org/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It&amp;rsquo;s quite well-publicized that humans and chimps are extremely similar in terms of gene content, which is why they could be considered our evolutionary cousins&lt;sup&gt;1&lt;/sup&gt;. However, when two species are more distantly related, there is less of their genomes which are the same. But how distant can we go in the tree of life and still see the same genes between different species? For example, &lt;strong&gt;how many genes do I (a human) have in common with a plant?&lt;/strong&gt; Can we even detect orthologs between two species so evolutionarily distant? And if so, what biological/functional role do these genes play?&lt;/p&gt;

&lt;p&gt;In this blog post, &lt;strong&gt;I will show how to use OMA and some of its tools to find out how much of the human genome we share with plants.&lt;/strong&gt; There are many different plant species with their genomes&amp;rsquo; sequenced, so I will use the extensively-studied model species &lt;em&gt;Arabidopsis thaliana&lt;/em&gt; as the representative plant to compare the human protein-coding genes to.&lt;/p&gt;

&lt;h1&gt;Setup&lt;/h1&gt;

&lt;p&gt;For the remainder of this blog post, I will show how to use the OMA database, accessed via the python REST API, to get ortholog pairs between two genomes.&lt;/p&gt;

&lt;p&gt;In python, I use the requests library to send queries for information to the server housing the OMA database. &lt;a href="https://pandas.pydata.org"&gt;Pandas&lt;/a&gt; is a well-known python library for working with dataframes, so I will use that for analysis. For more information, see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://omabrowser.org/api/docs#"&gt;OMA API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.pythonforbeginners.com/requests/using-requests-in-python"&gt;Requests in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://realpython.com/python-json/"&gt;Handling JSON in Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let&amp;rsquo;s start our python code by importing libraries:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;import requests
import json
import pandas as pd
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The OMA API is accessible from the following link, so we will save it for later use in our code:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;api_url = "https://omabrowser.org/api"
&lt;/code&gt;&lt;/pre&gt;

&lt;h1&gt;Getting pairwise orthologs with the OMA API&lt;/h1&gt;

&lt;p&gt;Pairwise orthologs are computed between all pairs of species in OMA as part of the normal OMA pipeline. Accessing the data will result in a list of pairs of genes, one from one species and the other from the other species. Any one gene may be involved in several different pairs, depending of if its ortholog has undergone duplication or not.&lt;/p&gt;

&lt;p&gt;It is important to note that here we are only using &lt;strong&gt;pairs of orthologs&lt;/strong&gt;. In OMA, we report several types of orthologs, namely pairs of orthologs or groups of orthologs (&lt;a href="/blog/2016/12/08/what-hogs-are"&gt;Hierarchical Orthologous Groups (HOGs)&lt;/a&gt;. We use the pairs as the basis for building HOGs, which aggregates orthologous pairs together into clusters at different taxonomic levels. Some ortholog pairs are removed, and some new pairs are inferred when creating HOGs. Therefore, the set of orthologs deduced from HOGs will be slightly different than the set of orthologs purely from the pairs.&lt;/p&gt;

&lt;p&gt;Here we will use the API to request all the pairwise orthologs between &lt;em&gt;Homo sapiens&lt;/em&gt; and &lt;em&gt;Arabidopsis thaliana&lt;/em&gt;. OMA uses either the NCBI taxon identifier or the 5-letter UniProt species code to identify species. I will use the UniProt codes, which are &amp;ldquo;HUMAN&amp;rdquo; and &amp;ldquo;ARATH.&amp;rdquo; To request the ortholog pairs between HUMAN and ARATH, the request would simply look like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;requests.get(api_url + '/pairs/HUMAN/ARATH/')
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, when sending a request that returns long lists, i.e. lists of thousands of genes, the OMA API uses pagination. This means all the results won&amp;rsquo;t be returned at once, but sequentially page by page, with only 100 results per page. This is a safeguard to make sure the OMA servers don&amp;rsquo;t get overloaded with requests. Here I show a simple workaround to get all of the pairs.&lt;/p&gt;

&lt;h2&gt;Get all pairs&lt;/h2&gt;

&lt;pre&gt;&lt;code&gt;genome1 = "HUMAN"
genome2 = "ARATH"

#get first page
response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2))

#use the header to calculate total number of pages
total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
print("There are {} pages that need to be requested.".format(total_nb_pages))

&amp;gt; There are 128 pages that need to be requested.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Since we want all the pages, not just the first one, we have to use a loop to go through and request all the pages.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#get responses for all pages
responses = []
for page in range(1, total_nb_pages + 1):
    tmp_response = requests.get(api_url + '/pairs/{}/{}/'.format(genome1, genome2)+"?page="+str(page))
    responses.append(tmp_response.json())

#some basic tidying up so we only have a list of pairs at the end
pairs = []
for response in responses:
    for pair in response:
        pairs.append(pair)

#Example of a pair
pairs[0]

&amp;gt; {'entry_1': {'entry_nr': 8066469,
&amp;gt;   'entry_url': 'https://omabrowser.org/api/protein/8066469/',
&amp;gt;   'omaid': 'HUMAN00009',
&amp;gt;   'canonicalid': 'NOC2L_HUMAN',
&amp;gt;   'sequence_md5': 'dc91b2521daf594037a6c318f3b04d5a',
&amp;gt;   'oma_group': 840151,
&amp;gt;   'oma_hog_id': 'HOG:0420566.4b.15b.5a.3b',
&amp;gt;   'chromosome': '1',
&amp;gt;   'locus': {'start': 944694, 'end': 959240, 'strand': -1},
&amp;gt;   'is_main_isoform': True},
&amp;gt;  'entry_2': {'entry_nr': 12384097,
&amp;gt;   'entry_url': 'https://omabrowser.org/api/protein/12384097/',
&amp;gt;   'omaid': 'ARATH12334',
&amp;gt;   'canonicalid': 'NOC2L_ARATH',
&amp;gt;   'sequence_md5': 'f19ac310a5e56443f2ce0e4e832addb3',
&amp;gt;   'oma_group': 840151,
&amp;gt;   'oma_hog_id': 'HOG:0420566.1c',
&amp;gt;   'chromosome': '2',
&amp;gt;   'locus': {'start': 7928254, 'end': 7931851, 'strand': 1},
&amp;gt;   'is_main_isoform': True},
&amp;gt;  'rel_type': '1:1',
&amp;gt;  'distance': 138.0,
&amp;gt;  'score': 1094.969970703125}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We can see that the data is in the form of a big list of pairs, with each pair being a dictionary. In this dictionary, one key is entry_1, which is the human gene, and the second key is entry_2, which is the arabidopsis gene. The values for each of these keys are dictionaries with information about the gene. Additionally, there are other key value pairs in the pair dictionary which gives information computed about the pair, such as rel_type, distance, and score (will be explained later).&lt;/p&gt;

&lt;h2&gt;Make dataframe with all pairs&lt;/h2&gt;

&lt;p&gt;I personally like to work with &lt;a href="https://pandas.pydata.org"&gt;pandas&lt;/a&gt; because I find it easy to use, so I will import the information about the pairs into a dataframe.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#use pandas to make dataframe
df = pd.DataFrame.from_dict(pairs)

def make_columns_from_entry_dict(df, columns_to_make):
    '''Parses out the keys from the entry dictionary and adds them to the dataframe 
    with an appropriate column header.'''

    genome1 = df['entry_1'][0]['omaid'][:5]
    genome2 = df['entry_2'][0]['omaid'][:5]

    df[genome1+"_"+columns_to_make] = df.apply(lambda x: x['entry_1'][columns_to_make], axis=1)
    df[genome2+"_"+columns_to_make] = df.apply(lambda x: x['entry_2'][columns_to_make], axis=1)

    return df

#clean up the columns of the entry_1 and entry_2 dictionaries
df = make_columns_from_entry_dict(df, "omaid")    
df = df[['HUMAN_omaid','ARATH_omaid','rel_type','distance','score']]

#Here's a snippet of the dataframe
df[:10]
&lt;/code&gt;&lt;/pre&gt;

&lt;div&gt;
&lt;style scoped&gt;
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
&lt;/style&gt;
&lt;table border="1" class="dataframe"&gt;
  &lt;thead&gt;
    &lt;tr style="text-align: right;"&gt;
      &lt;th&gt;&lt;/th&gt;
      &lt;th&gt;HUMAN_omaid&lt;/th&gt;
      &lt;th&gt;ARATH_omaid&lt;/th&gt;
      &lt;th&gt;rel_type&lt;/th&gt;
      &lt;th&gt;distance&lt;/th&gt;
      &lt;th&gt;score&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;th&gt;0&lt;/th&gt;
      &lt;td&gt;HUMAN00009&lt;/td&gt;
      &lt;td&gt;ARATH12334&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;138.0&lt;/td&gt;
      &lt;td&gt;1094.969971&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;1&lt;/th&gt;
      &lt;td&gt;HUMAN00029&lt;/td&gt;
      &lt;td&gt;ARATH38493&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;168.0&lt;/td&gt;
      &lt;td&gt;246.330002&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;2&lt;/th&gt;
      &lt;td&gt;HUMAN00032&lt;/td&gt;
      &lt;td&gt;ARATH38034&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;69.0&lt;/td&gt;
      &lt;td&gt;773.869995&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;3&lt;/th&gt;
      &lt;td&gt;HUMAN00034&lt;/td&gt;
      &lt;td&gt;ARATH39850&lt;/td&gt;
      &lt;td&gt;m:n&lt;/td&gt;
      &lt;td&gt;140.0&lt;/td&gt;
      &lt;td&gt;635.900024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;4&lt;/th&gt;
      &lt;td&gt;HUMAN00034&lt;/td&gt;
      &lt;td&gt;ARATH33319&lt;/td&gt;
      &lt;td&gt;m:n&lt;/td&gt;
      &lt;td&gt;144.0&lt;/td&gt;
      &lt;td&gt;655.409973&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;5&lt;/th&gt;
      &lt;td&gt;HUMAN00034&lt;/td&gt;
      &lt;td&gt;ARATH01678&lt;/td&gt;
      &lt;td&gt;m:n&lt;/td&gt;
      &lt;td&gt;137.0&lt;/td&gt;
      &lt;td&gt;723.580017&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;6&lt;/th&gt;
      &lt;td&gt;HUMAN00034&lt;/td&gt;
      &lt;td&gt;ARATH07547&lt;/td&gt;
      &lt;td&gt;m:n&lt;/td&gt;
      &lt;td&gt;134.0&lt;/td&gt;
      &lt;td&gt;761.450012&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;7&lt;/th&gt;
      &lt;td&gt;HUMAN00036&lt;/td&gt;
      &lt;td&gt;ARATH01479&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;103.0&lt;/td&gt;
      &lt;td&gt;554.590027&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;8&lt;/th&gt;
      &lt;td&gt;HUMAN00037&lt;/td&gt;
      &lt;td&gt;ARATH10852&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;60.0&lt;/td&gt;
      &lt;td&gt;2325.889893&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;th&gt;9&lt;/th&gt;
      &lt;td&gt;HUMAN00052&lt;/td&gt;
      &lt;td&gt;ARATH02626&lt;/td&gt;
      &lt;td&gt;1:1&lt;/td&gt;
      &lt;td&gt;95.0&lt;/td&gt;
      &lt;td&gt;328.959991&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;

&lt;p&gt;Here you can see that each row is a pair of orthologs, with some information about each. The &lt;em&gt;rel_type&lt;/em&gt; is the relationship cardinality, between the orthologs. 1:1 means only one ortholog in human found in arabidopsis, whereas 1:m would mean it duplicated in ARATH. m:n means there were lineage-specific duplications on both sides.&lt;/p&gt;

&lt;h1&gt;How many pairs between HUMAN and ARATH?&lt;/h1&gt;

&lt;pre&gt;&lt;code&gt;print("There are {} pairs of orthologs between {} and {}.".format(len(df), genome1, genome2))

&amp;gt; There are 12792 pairs of orthologs between HUMAN and ARATH.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What percentage is this of the human genome? What percentage of the Arabidopsis genome? To answer this we need to 1) Get the number of genes in each genome that this represents, because there can be more than 1 pair of orthologs per gene. 2) We also need to get the total number of genes per genome.&lt;/p&gt;

&lt;p&gt;First we get the number of genes for each species that has at least one ortholog, using the dataframe from above. We don&amp;rsquo;t have to worry about alternative splic variants (ASVs) because OMA chooses a representative isoform when computing orthologs.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;human_proteins_with_ortholog = set(df['HUMAN_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(human_proteins_with_ortholog), "HUMAN","ARATH"))

arath_proteins_with_ortholog = set(df['ARATH_omaid'].tolist())
print("There are {} genes in {} with at least 1 ortholog in {}.".\
      format(len(arath_proteins_with_ortholog), "ARATH","HUMAN"))

&amp;gt; There are 3769 genes in HUMAN with at least 1 ortholog in ARATH.
&amp;gt; There are 5177 genes in ARATH with at least 1 ortholog in HUMAN.
&lt;/code&gt;&lt;/pre&gt;

&lt;h1&gt;Get the genomes of HUMAN and ARATH&lt;/h1&gt;

&lt;p&gt;Now we will again use the API to get all the proteins in the genome. (See https://omabrowser.org/api/docs#genome-proteins). Here I will write a function using the same procedure as above to deal with pagination.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#Use the API to get all the human genes
def get_all_proteins(genome):
    '''Gets all proteins using OMA REST API'''
    responses = []
    response = requests.get(api_url + '/genome/{}/proteins/'.format(genome))
    total_nb_pages = round(int(response.headers['X-Total-Count'])/100)
    for page in range(1, total_nb_pages + 1):
        tmp_response = requests.get(api_url + '/genome/{}/proteins/'.format(genome)+"?page="+str(page))
        responses.append(tmp_response.json())

    proteins = []
    for response in responses:
        for entry in response:
            proteins.append(entry)
    return proteins


human_proteins = get_all_proteins("HUMAN")

#Here is an example entry
human_proteins[0]

&amp;gt; {'entry_nr': 8066461,
&amp;gt;  'entry_url': 'https://omabrowser.org/api/protein/8066461/',
&amp;gt;  'omaid': 'HUMAN00001',
&amp;gt;  'canonicalid': 'OR4F5_HUMAN',
&amp;gt;  'sequence_md5': 'df953df7a11ee7be5484e511551ce8a4',
&amp;gt;  'oma_group': 650473,
&amp;gt;  'oma_hog_id': 'HOG:0361626.2a.7a',
&amp;gt;  'chromosome': '1',
&amp;gt;  'locus': {'start': 69091, 'end': 70008, 'strand': 1},
&amp;gt;  'is_main_isoform': True}
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It is important to note that this list of genes/proteins includes ASVs for many species. Therefore we need to take care to remove them to have one represenative isoform per gene. We can do this by filtering the list of entries based on the &amp;lsquo;is_main_isoform&amp;rsquo; key. Here is a small function to do that.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;#get canonical isoform
def get_canonical_proteins(list_of_proteins):
    proteins_no_ASVs = []
    for protein in list_of_proteins:
        if protein['is_main_isoform'] == True:
            proteins_no_ASVs.append(protein)
    return proteins_no_ASVs

print("HUMAN\n-----")
print("The number of genes before removing alternative splice variants: {}".format(len(human_proteins)))

human_proteins_no_ASVs = get_canonical_proteins(human_proteins)

print("The number of genes after removing alternative splice variants: {}".format(len(human_proteins_no_ASVs)))


&amp;gt; HUMAN
&amp;gt; -----
&amp;gt; The number of genes before removing alternative splice variants: 30700
&amp;gt; The number of genes after removing alternative splice variants: 20152



#Do the same for ARATH
print("ARATH\n-----")
arath_proteins = get_all_proteins("ARATH")

print("The number of genes before removing alternative splice variants: {}".format(len(arath_proteins)))
arath_proteins_no_ASVs = get_canonical_proteins(arath_proteins)
print("The number of genes after removing alternative splice variants: {}".format(len(arath_proteins_no_ASVs)))

&amp;gt; ARATH
&amp;gt; -----
&amp;gt; The number of genes before removing alternative splice variants: 40999
&amp;gt; The number of genes after removing alternative splice variants: 27627
&lt;/code&gt;&lt;/pre&gt;

&lt;h1&gt;Proportion of the genomes which have orthologs in the other&lt;/h1&gt;

&lt;p&gt;Now that we have all the necessary information we can see what percentage of the human and arabidopsis genomes are shared, in terms of proportion of genes with at least 1 ortholog in the other species.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("HUMAN", "ARATH", \
              round((len(human_proteins_with_ortholog)/len(human_proteins_no_ASVs)*100), 2)))

print("The percentage of genes in {} with at least 1 ortholog in {}: {}%"\
      .format("ARATH", "HUMAN", \
              round((len(arath_proteins_with_ortholog)/len(arath_proteins_no_ASVs)*100), 2)))

&amp;gt; The percentage of genes in HUMAN with at least 1 ortholog in ARATH: 18.7%
&amp;gt; The percentage of genes in ARATH with at least 1 ortholog in HUMAN: 18.74%
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So the answer to the original questions is that BOTH humans and arabidopsis have 18.7% of their genome shared with each other. A weird coincidence that it turns out to be the same proportion of both genomes!&lt;/p&gt;

&lt;h1&gt;Conclusion&lt;/h1&gt;

&lt;p&gt;Using the OMA database for orthology inference, we found that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There are 12792 pairs of orthologs between HUMAN and ARATH.&lt;/li&gt;
&lt;li&gt;There are 3769 genes in human that have at least 1 ortholog in arabidopsis.&lt;/li&gt;
&lt;li&gt;There are 5177 genes in arabidopsis that have at least 1 ortholog in human.&lt;/li&gt;
&lt;li&gt;These numbers represent about 19% of both genomes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Therefore, we can conclude that over the course of evolution since the animal and plant lineages split (~1.5 billion years ago&lt;sup&gt;2&lt;/sup&gt;), about 19% of the protein-coding genes remain and are able to be detected by OMA.&lt;/p&gt;

&lt;p&gt;Stay tuned for the next blog post, where I will show what are the functions of these shared genes!&lt;/p&gt;

&lt;h2&gt;References&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 69&amp;ndash;87 (2005). DOI: https://doi.org/10.1038/nature04072&lt;/li&gt;
&lt;li&gt;Wang, D. Y., Kumar, S. &amp;amp; Hedges, S. B. Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. Biol. Sci. 266, 163&amp;ndash;171 (1999). DOI: 10.1098/rspb.1999.0617&lt;/li&gt;
&lt;/ol&gt;

&lt;script&gt;
tmp = document.getElementsByTagName('code');
tmp[0].className = 'bash';
for (var i = 1; i &lt; tmp.length; i++) {
tmp[i].className = 'python';
}
hljs.initHighlightingOnLoad();
&lt;/script&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>pyHam: a python package to visualize and process hierarchical orthologous groups (HOGs)</title><link>https://lab.dessimoz.org/blog/2017/06/29/pyham</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2017/06/29/pyham</guid><dc:creator>Clement Train</dc:creator><pubDate>Thu, 29 Jun 2017 17:13:50 +0100</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;&lt;link rel="stylesheet" href="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"&gt;
&lt;script src="//cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"&gt;&lt;/script&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(This entry was updated 19 Sep 2018 to reflect recent feature updates)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;pyHam (&amp;lsquo;python HOG analysis method&amp;rsquo;) makes it possible to extract useful information from HOGs encoded in standard OrthoXML format. It is available both as a python library and as a set of command-line scripts. Input HOGs in OrthoXML format are available from multiple bioinformatics resources, including OMA, Ensembl and HieranoidDB.&lt;/p&gt;

&lt;p&gt;This post is a brief primer to pyham, with an emphasis on what it can do for you.&lt;/p&gt;

&lt;h1&gt;How to get pyHam?&lt;/h1&gt;

&lt;p&gt;pyHam is available as python package on the pypi server and
is compatible python 2 and python 3. You can easily install via
pip using the following bash command:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;pip install pyham
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can check the official &lt;a href="http://lab.dessimoz.org/pyham"&gt;&lt;em&gt;pyham website&lt;/em&gt;&lt;/a&gt;  for
further information about  &lt;a href="https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html"&gt;&lt;em&gt;how to use pyham&lt;/em&gt;&lt;/a&gt;, &lt;a href="https://zoo.cs.ucl.ac.uk/doc/pyham/index.html"&gt;&lt;em&gt;documentation&lt;/em&gt;&lt;/a&gt; and the &lt;a href="https://github.com/DessimozLab/pyham"&gt;&lt;em&gt;source code&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;What are Hierarchical Orthologous Groups (HOGs)?&lt;/h1&gt;

&lt;p&gt;You don&amp;rsquo;t know what HOGs are and you are eager to change this, we have an explanatory video about them just for you:&lt;/p&gt;

&lt;div class="yt-container"&gt; &lt;iframe src="//www.youtube.com/embed/5p5x5gxzhZA?html5=1" frameborder="0" allowfullscreen class="video"&gt;&lt;/iframe&gt; &lt;/div&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;You can learn more about this in our &lt;a href="http://lab.dessimoz.org/blog/2016/12/08/what-hogs-are"&gt;&lt;em&gt;previous blog post&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;Where to find HOGs?&lt;/h1&gt;

&lt;p&gt;HOGs inferred on public genomes can be downloaded from the &lt;a href="http://omabrowser.org"&gt;&lt;em&gt;OMA orthology database&lt;/em&gt;&lt;/a&gt;. Other databases, such as &lt;a href="http://eggnogdb.embl.de"&gt;&lt;em&gt;Eggnog&lt;/em&gt;&lt;/a&gt;, &lt;a href="http://orthodb.org"&gt;&lt;em&gt;OrthoDb&lt;/em&gt;&lt;/a&gt; or &lt;a href="http://hieranoidb.sbc.su.se"&gt;&lt;em&gt;HieranoiDB&lt;/em&gt;&lt;/a&gt; also infer HOGs, but not all of these databases offer them in OrthoXML format. You can check which database  serves hogs as orthoxml &lt;a href="https://github.com/DessimozLab/pyham#table-of-compatibility"&gt;&lt;em&gt;here&lt;/em&gt;&lt;/a&gt;. If you want to use your custom genomes to infer HOGs you can use the &lt;a href="http://omabrowser.org/standalone"&gt;&lt;em&gt;OMA standalone software&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In order to facilitate the use of pyHam on single gene family, we provide the option to let pyham fetch required data directly from a comptabilble databases (for now only OMA is available for this feature). The user simply have to give the id of a gene inside the gene family (HOGs) of insterest along with the name of the compatibible database where to get the data and pyHam will do the rest.&lt;/p&gt;

&lt;p&gt;For example, if you are interest by the P_53 gene in rat (&lt;a href="https://omabrowser.org/oma/info/7916807/"&gt;&lt;em&gt;P53 rat gene page in OMA&lt;/em&gt;&lt;/a&gt;) you simply have to run the following python code to set-up your pyHam session:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;my_gene_query = 'P53_RAT'
database_to_query = 'oma'
pyham_analysis = pyham.Ham(query_database=my_gene_query, use_data_from=database_to_query)
&lt;/code&gt;&lt;/pre&gt;

&lt;h1&gt;How does pyham help you investigate on HOGs?&lt;/h1&gt;

&lt;p&gt;The main features of pyHam are: (i) given a clade of interest, extract all the relevant HOGs, each of which ideally corresponds to a distinct ancestral gene in the last common ancestor of the clade; (ii) given a branch on the species tree, report the HOGs that duplicated on the branch, got lost on the branch, first appeared on that branch, or were simply retained; (iii) repeat the previous point along the entire species tree, and plot an overview of the gene evolution dynamics along the tree; and (iv) given a set of nested HOGs for a specific gene family of interest, generate a local iHam web page to visualize its evolutionary history.&lt;/p&gt;

&lt;h2&gt;What is the number of genes in a particular ancestral genome? (i)&lt;/h2&gt;

&lt;p&gt;In pyHam, ancestral genomes are attached to one specific internal node in the inputted species tree and denoted by the name of this taxon. Ancestral genes are then infered by fetching all the HOGs at the same level.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Get the ancestral genome by name
rodents_genome = ham_analysis.get_ancestral_genome_by_name("Rodents")

# Get the related ancestral genes (HOGs)
rodents_ancestral_genes = rodents_genome.genes

# Get the number of ancestral genes at level of Rodents
print(len(rodents_ancestral_genes)
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;How can I figure out the evolutionary history of genes in a given genome? (ii)&lt;/h2&gt;

&lt;p&gt;pyHam provides a feature to trace for HOGs/genes along a branch that span across one or multiple taxonomic ranges and report the HOGs that duplicated on this branch, got lost on this branch, first appeared on that branch, or were simply retained. The &amp;lsquo;vertical map&amp;rsquo; (see further information on map &lt;a href="https://zoo.cs.ucl.ac.uk/tutorials/tutorial_pyHam_get_started.html#COMPARE-SEVERAL-GENOMES"&gt;&lt;em&gt;here&lt;/em&gt;&lt;/a&gt;) allows for retrieval of all genes and their evolutionary history between the two taxonomic levels (i.e. which genes have been duplicated, which genes have been lost, etc).&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Get the genome of interest
human = ham_analysis.get_extant_genome_by_name("HUMAN")
vertebrates = ham_analysis.get_ancestral_genome_by_name("Vertebrata")

# Instanciate the gene mapping !
vertical_human_vertebrates = ham_analysis.compare_genomes_vertically(human, vertebrates) # The order doesn't matter!

# The identical genes (that stay single copies) 
# one HOG at vertebrates -&amp;gt; one descendant gene in human
vertical_human_vertebrates.get_retained())

# The duplicated genes (that have duplicated) 
# one HOG at vertebrates -&amp;gt; list of its descendants gene in human
vertical_human_vertebrates.get_duplicated())

# The gained genes (that emerged in between)
# list of gene that appeared after vertebrates taxon
vertical_human_vertebrates.get_gained()

# The lost genes (that been lost in between) 
HOG at vertebrates that have been lost before human taxon
vertical_human_vertebrates.get_lost()
&lt;/code&gt;&lt;/pre&gt;

&lt;h2&gt;How can I get an overview of the gene evolution dynamics along the tree that occured in my genomic setup? (iii)&lt;/h2&gt;

&lt;p&gt;pyHam includes treeProfile (extension of the &lt;a href="https://github.com/DessimozLab/phylo-io"&gt;&lt;em&gt;Phylo.io&lt;/em&gt;&lt;/a&gt; tool), a tool to visualise an annotated species tree with evolutionary events (genes duplications, losses, gains) mapped to their related taxonomic range. The aim is to provide a minimalist and intuitive way to visualise the number of evolutionary events that occurred on each branch or the numbers of ancestral genes along the species tree.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# create a local treeprofile web page
treeprofile = ham_analysis.create_tree_profile(outfile="treeprofile_example.html")
&lt;/code&gt;&lt;/pre&gt;

&lt;iframe width="560" height="700" src="/blog/media/2018/09/tp.html" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;As you can see in the figure above, the treeprofile is composed of the reference species used
to perform the pyham analysis. Each internal node is displayed with its related histogram of
phylogenetic events (number of genes duplicated, lost, gained, or retained) that occurred on each branch. The tree profile either display the number of genes resulting from phylogenetics events or the number of phylogenetic events on themself; the switch can be made by opening the settings panel (histogram icon on top right) and selecting between &amp;lsquo;genes&amp;rsquo; or &amp;lsquo;events&amp;rsquo;.&lt;/p&gt;

&lt;h2&gt;How can I visualise the evolutionary history of a gene family (HOG)? (iv)&lt;/h2&gt;

&lt;p&gt;pyHam embeds &lt;a href="https://github.com/DessimozLab/iHam"&gt;&lt;em&gt;iHam&lt;/em&gt;&lt;/a&gt;, an interactive tool to visualise gene family evolutionary history. It provides a way to trace the evolution of genes in terms of duplications and losses, from ancient ancestors to modern day species.&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;# Select an HOG
hog_of_interest = pyham_analysis.get_hog_by_id(2)

# create and export the hog vis as .html
output_filename = "hogvis_example.html"
pyham_analysis.create_hog_visualisation(hog=hog_of_interest,outfile=output_filename)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then, you simply have to double click on the .html file to open it in your default internet browser. We provide you an example below of what you should see. A brief video tutorial on iHam is available at this &lt;a href="https://www.youtube.com/watch?v=6eAoamP7NLo"&gt;&lt;em&gt;URL&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;iframe width="560" height="400" src="/blog/media/2018/09/HOG3.html" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;

&lt;p&gt;iHam is composed of two panels: a species tree that allows you to select the
taonomic range of interest, a genes panel where each grey square represents an extant gene and each row a
species.&lt;/p&gt;

&lt;p&gt;We can see for example that at the level of mammals (click on the related node and select &amp;lsquo;Freeze at this node&amp;rsquo;) all genes of this gene family are descendant from a single comon ancestral gene.&lt;/p&gt;

&lt;p&gt;Now, if we look at the level of Euarchontoglires (redo the same procedure as for mammals to freeze the vis at this level) we
observe that the genes are now split by a vertical line. This vertical line separates 2 group of genes that are each descendants from a same single ancestral gene. This is the result of a duplication in between Mammals and Euarchontoglires.&lt;/p&gt;

&lt;p&gt;This small example demonstrate the simplicity of iHam usefulness to identify evolutionary events that occured in gene families (e.g. when a duplication occured, which species have lost genes or how big genes families evolved).&lt;/p&gt;

&lt;script&gt;
tmp = document.getElementsByTagName('code');
tmp[0].className = 'bash';
for (var i = 1; i &lt; tmp.length; i++) {
    tmp[i].className = 'python';
}
hljs.initHighlightingOnLoad();
&lt;/script&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Sex, alcohol, and structural variants in fission yeast</title><link>https://lab.dessimoz.org/blog/2017/02/08/sex-alcohol-and-structural-variants-in-fission-yeast</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2017/02/08/sex-alcohol-and-structural-variants-in-fission-yeast</guid><dc:creator>Fritz Sedlazeck, Dan Jeffares &amp; Christophe Dessimoz</dc:creator><pubDate>Wed, 08 Feb 2017 10:36:55 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Our latest study just came out (&lt;a href="http://doi.org/10.1038/ncomms14061"&gt;Jeffares &lt;em&gt;et al.&lt;/em&gt;, Nature Comm 2017&lt;/a&gt;). In it, we carefully catalogued high-confidence structural variants among all known strains of the fission yeast population, and assessed their impact on spore viability, winemaking and other traits. This post gives a summary and the &lt;a href="http://lab.dessimoz.org/blog/tagged-story_behind_the_paper.html"&gt;story behind the paper&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;Structural variants (SVs) measure genetic variation beyond single nucleotide changes &amp;hellip;&lt;/h2&gt;

&lt;p&gt;Next generation sequencing is enabling the study of genomic diversity on unprecedented levels. While most of this research has focused on single base pair differences (single nucleotide polymorphisms, SNPs), larger genomic differences (called structural variations, SVs)  can also have an impact on the evolution of an organism, on traits and on diseases. SVs are usually loosely defined as events that are at least 50 base pair long. They are often classified in five subtypes: deletions, duplications, new sequence insertions, inversions and translocations.&lt;/p&gt;

&lt;p&gt;Over the recent years the impact of SVs has been characterized in many organisms. For example, SVs play a role in cancer, when duplications often lead to multiple copies of important oncogenes. Furthermore, SVs are known to play a role in other human disorders such as autism, obesity, etc.&lt;/p&gt;

&lt;h2&gt;&amp;hellip; but calling structural variants remains challenging&lt;/h2&gt;

&lt;p&gt;In principle, identifying SVs seems trivial: just map paired-end reads to a reference genome, look for any abnormally spaced pairs or split reads (i.e. reads with parts mapping to different regions), and&amp;mdash;boom&amp;mdash;structural variants!&lt;/p&gt;

&lt;p&gt;In practice, things are much harder. This is partly due to the frustrating tendency for SVs occur in or near repetitive regions where short read sequencing struggles to disambiguate the reads. Or in highly variable regions of genome such as the chromosome ends, which tend to be the tinkering workshop of the genome.&lt;/p&gt;

&lt;p&gt;As a result, a large proportion of SVs&amp;mdash;typically at least 30-40%&amp;mdash;remain undetected. As for false discovery rates (proportion of wrongly inferred SVs), they are mostly not well known because validating SVs on real data is very laborious.&lt;/p&gt;

&lt;h2&gt;Fission yeast: a compelling model to study structural variants&lt;/h2&gt;

&lt;p&gt;Studying structural variants in Schizosaccharomyces pombe is especially suited because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The genome is small, well-annotated and simple (few repeats, haploid).&lt;/li&gt;
&lt;li&gt;We had 40x or more coverage over 161 genomes covering the worldwide known population of S. pombe. &lt;/li&gt;
&lt;li&gt;We had more than 220 accurate trait measurements for these strains at hand. Since the traits are measured under strictly controlled conditions, they contain little (if any) environmental variance&amp;mdash;in stark contrast to human traits.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;SURVIVOR makes the most out of (imperfect) SV callers&lt;/h2&gt;

&lt;p&gt;To infer accurate SVs calls, we introduced SURVIVOR, a consensus method to reduce the false discovery rate, while  maintaining high sensitivity. Using simulated data, we observed that consensus calls obtained from two to three different SV callers could recover most SV while keeping the false-discovery rate in check. For example, SURIVOR performed second best with a 70% sensitivity (best was Delly: 75%), while the false discovery rate was significantly reduced to 1% (Delly: 13%) (but remember these figures are based on simulation; performance on real data is likely worse.) Furthermore, we equipped SURVIVOR with different methods to simulate data sets and evaluate callers; merge data from different samples; compute bad map ability regions (BED file) over the different regions, etc. SURVIVOR is written in C++ so it&amp;rsquo;s fast enough to run on large genomes as well. Since then, we are running it on multiple human data sets, which takes only a few minutes on a laptop. SURVIVOR is &lt;a href="http://github.com/fritzsedlazeck/SURVIVOR"&gt;available on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;SVs: now you see me, now you don&amp;rsquo;t&lt;/h2&gt;

&lt;p&gt;We applied SURVIVOR to our 161 genomic data sets, and then manually vetted all our calls to obtain a trustworthy set of SVs. We then discovered something suspicious. Some groups of strains that were &lt;em&gt;very&lt;/em&gt; closely related (essentially clonal, differing by &amp;lt;150 SNPs) had different numbers of duplications, or different numbers of copies in duplications (1x, 2x, even 6x). This observation was also validated with lab experiments.&lt;/p&gt;

&lt;p&gt;Interestingly we identified 15 duplications that were shared between the more diverse non-clonal strains (so these must have been shared during evolution) but could not be explained by the tree inferred from SNPs (Figure 1). To confirm this we compared the local phylogeny of SNPs in 20kb windows up and downstream of the duplications with the variance in copy numbers.  Oddly the copy number variance was not highly correlated with the SNP tree. This lead to the conclusion that some SVs are transient and thus are gained or lost faster than SNPs.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2017/02/SV_tree.jpg"&gt;&lt;img width="100%" alt="Tree reconstructed from SNPs, with coloured dots indicating strains with identical SVs." src="https://lab.dessimoz.org/blog/media/2017/02/SV_tree.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;&lt;b&gt;Duplications happen within near-clonal populations&lt;/b&gt; Phylogenetic tree of the strains reconstructed from SNPs data, with eight pairs of very close strains that nonetheless show structural variation. Click to enlarge.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;Though this transience came as a surprise, there is actually supporting evidence from laboratory experiments carried out by Tony Carr back in 1989 that duplications can occur frequently in laboratory-reared S. pombe, and can revert. (Carr &lt;em&gt;et al.&lt;/em&gt; 1989). The high turnover raises the possibility that SVs could be an important source for environmental adaptation.&lt;/p&gt;

&lt;h2&gt;SVs affect spore viability and are associated with several traits&lt;/h2&gt;

&lt;p&gt;We then investigated the phenotypic impact of these SVs. We used the 220 trait measurements from previous publications. We observed an inverse correlation between rearrangement distance and spore viability, confirming reports in other species that SVs can contribute to reproductive isolation. We also found a link between copy number variation and two traits relevant to wine making (malic acid accumulation and glucose+fructose ultilisation) (Benito et al. PLOS ONE 2016).&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2017/02/SV_pombe.jpg"&gt;&lt;img width="100%" alt="plots from SV paper showing the relationship between structural variants and spore viability, as well as the contribution of SVs to trait heritability" src="https://lab.dessimoz.org/blog/media/2017/02/SV_pombe.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption"&gt;&lt;b&gt;Structural variants, reproductive isolation, and wine&lt;/b&gt;. A) Making crosses between fission yeast strains often results in low offspring survival. The theory is that rearrangements (inversions and translocations) cause errors during meiosis, so we might expect them to affect offspring viability. If we compare offspring viability from crosses with the number of rearrangements that the parents differ by, there is a correlation, and a &amp;lsquo;forbidden triangle&amp;rsquo; in the top right of the plot (it seem impossible to produce high viability spores when parents have many unshared rearrangements). B) SVs also affect traits. For &amp;gt; 200 traits (vertical bars) we used [LDAK](http://dougspeed.com/ldak/) to estimate the proportion of the narrow sense heritability that was caused by copy number variants (red), rearrangements (black) and SNPs (grey). Some traits are very strongly affected by copy number variants, such as the wine-making traits (wine-colored bars along the x-axis). C) Fission yeast wine tasting at UCL&amp;mdash;how much of the taste is due structural variants? (J&amp;uuml;rg B&amp;auml;hler at right).&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;We used the estimation of narrow sense heritability from Doug Speed&amp;rsquo;s &lt;a href="http://dougspeed.com/ldak/"&gt;LDAK program&lt;/a&gt;. &lt;em&gt;Narrow sense heritability&lt;/em&gt; estimates how much of a difference in a trait between individuals can be explained by adding up all the tiny effects of the genomic differences (in our case SNPs; deletions and duplications; inversions and translocations and all combined). Overall, we found the heritability was better explained when combining the SNP data as well as the SVs data. In 45 traits SVs explained 25% or more of the trait variability. Five traits that were explained by over 90% heritability using SNPs and SVs came from different growth conditions in liquid medium. This may highlight again the influence of environmental conditions on the genomic structure. For 74 traits (~30% of those we analyzed) SVs explain more of the trait than the SNPs. These high SV-affected traits include malic acid, acetic acid and glucose/fructose contents of wine, key components of taste.&lt;/p&gt;

&lt;h2&gt;A collaborative effort&lt;/h2&gt;

&lt;p&gt;On a personal note, the paper concludes a wonderful team effort over two and a half years.&lt;/p&gt;

&lt;p&gt;The project started as a summer project for Clemency Jolly, who had then just completed her 3rd undergraduate year at UCL, in the Dessimoz and B&amp;auml;hler labs. Dan Jeffares and the rest of the B&amp;auml;hler lab had just published their 161 fission yeast genomes, with an in-depth analysis of the association between SNPs and quantitative traits (Jeffares &lt;em&gt;et al.&lt;/em&gt;, Nature Genetics 2015). Studying SVs was the logical next step, but given the challenging nature of reliable SV calling, we also recruited to the team Fritz Sedlazeck, collaborator and expert in tool development for NGS data analysis then based in Mike Schatz&amp;rsquo;s lab at Cold Spring Harbor Laboratory.&lt;/p&gt;

&lt;p&gt;At the end of the summer, it was clear that we were onto something, but there was still a lot be done. Clemency turned the work into her Master&amp;rsquo;s project, with Dan and Fritz redoubling their efforts until Clemency graduation in summer 2015. It took another year of intense work lead by Dan and Fritz to verify the calls, perform the GWAS and heritability analyses, and publish the work. Since then, Clemency has started her PhD at the Crick Institute, Fritz has moved to John Hopkins University, and Dan has started his own lab at the University of York.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;h2&gt;References:&lt;/h2&gt;

&lt;p&gt;&lt;span class="Z3988" title="ctx_ver=Z39.88-2004&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;amp;rft.jtitle=Nature+Communications&amp;amp;rft_id=info%3Adoi%2F10.1038%2Fncomms14061&amp;amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;amp;rft.atitle=Transient+structural+variations+have+strong+effects+on+quantitative+traits+and+reproductive+isolation+in+fission+yeast&amp;amp;rft.issn=2041-1723&amp;amp;rft.date=2017&amp;amp;rft.volume=8&amp;amp;rft.issue=&amp;amp;rft.spage=14061&amp;amp;rft.epage=&amp;amp;rft.artnum=http%3A%2F%2Fwww.nature.com%2Fdoifinder%2F10.1038%2Fncomms14061&amp;amp;rft.au=Jeffares%2C+D.&amp;amp;rft.au=Jolly%2C+C.&amp;amp;rft.au=Hoti%2C+M.&amp;amp;rft.au=Speed%2C+D.&amp;amp;rft.au=Shaw%2C+L.&amp;amp;rft.au=Rallis%2C+C.&amp;amp;rft.au=Balloux%2C+F.&amp;amp;rft.au=Dessimoz%2C+C.&amp;amp;rft.au=B%C3%A4hler%2C+J.&amp;amp;rft.au=Sedlazeck%2C+F.&amp;amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CBioinformatics%2C+Computational+Biology%2C+Evolutionary+Biology%2C+Genetics%2C+Reproduction"&gt;Jeffares, D., Jolly, C., Hoti, M., Speed, D., Shaw, L., Rallis, C., Balloux, F., Dessimoz, C., B&amp;auml;hler, J., &amp;amp; Sedlazeck, F. (2017). Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast &lt;span style="font-style: italic;"&gt;Nature Communications, 8&lt;/span&gt; DOI: &lt;a rev="review" href="http://dx.doi.org/10.1038/ncomms14061"&gt;10.1038/ncomms14061&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="Z3988" title="ctx_ver=Z39.88-2004&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;amp;rft.jtitle=Molecular+%26+general+genetics+%3A+MGG&amp;amp;rft_id=info%3Apmid%2F2674650&amp;amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;amp;rft.atitle=Molecular+cloning+and+sequence+analysis+of+mutant+alleles+of+the+fission+yeast+cdc2+protein+kinase+gene%3A+implications+for+cdc2%2B+protein+structure+and+function.&amp;amp;rft.issn=0026-8925&amp;amp;rft.date=1989&amp;amp;rft.volume=218&amp;amp;rft.issue=1&amp;amp;rft.spage=41&amp;amp;rft.epage=9&amp;amp;rft.artnum=&amp;amp;rft.au=Carr+AM&amp;amp;rft.au=MacNeill+SA&amp;amp;rft.au=Hayles+J&amp;amp;rft.au=Nurse+P&amp;amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CMolecular+Biology"&gt;Carr AM, MacNeill SA, Hayles J, &amp;amp; Nurse P (1989). Molecular cloning and sequence analysis of mutant alleles of the fission yeast cdc2 protein kinase gene: implications for cdc2+ protein structure and function. &lt;span style="font-style: italic;"&gt;Molecular &amp;amp; general genetics : MGG, 218&lt;/span&gt; (1), 41-9 PMID: &lt;a rev="review" href="http://www.ncbi.nlm.nih.gov/pubmed/2674650"&gt;2674650&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="Z3988" title="ctx_ver=Z39.88-2004&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;amp;rft.jtitle=Nature+Genetics&amp;amp;rft_id=info%3Adoi%2F10.1038%2Fng.3215&amp;amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;amp;rft.atitle=The+genomic+and+phenotypic+diversity+of+Schizosaccharomyces+pombe&amp;amp;rft.issn=1061-4036&amp;amp;rft.date=2015&amp;amp;rft.volume=47&amp;amp;rft.issue=3&amp;amp;rft.spage=235&amp;amp;rft.epage=241&amp;amp;rft.artnum=http%3A%2F%2Fwww.nature.com%2Fdoifinder%2F10.1038%2Fng.3215&amp;amp;rft.au=Jeffares%2C+D.&amp;amp;rft.au=Rallis%2C+C.&amp;amp;rft.au=Rieux%2C+A.&amp;amp;rft.au=Speed%2C+D.&amp;amp;rft.au=P%C5%99evorovsk%C3%BD%2C+M.&amp;amp;rft.au=Mourier%2C+T.&amp;amp;rft.au=Marsellach%2C+F.&amp;amp;rft.au=Iqbal%2C+Z.&amp;amp;rft.au=Lau%2C+W.&amp;amp;rft.au=Cheng%2C+T.&amp;amp;rft.au=Pracana%2C+R.&amp;amp;rft.au=M%C3%BClleder%2C+M.&amp;amp;rft.au=Lawson%2C+J.&amp;amp;rft.au=Chessel%2C+A.&amp;amp;rft.au=Bala%2C+S.&amp;amp;rft.au=Hellenthal%2C+G.&amp;amp;rft.au=O%27Fallon%2C+B.&amp;amp;rft.au=Keane%2C+T.&amp;amp;rft.au=Simpson%2C+J.&amp;amp;rft.au=Bischof%2C+L.&amp;amp;rft.au=Tomiczek%2C+B.&amp;amp;rft.au=Bitton%2C+D.&amp;amp;rft.au=Sideri%2C+T.&amp;amp;rft.au=Codlin%2C+S.&amp;amp;rft.au=Hellberg%2C+J.&amp;amp;rft.au=van+Trigt%2C+L.&amp;amp;rft.au=Jeffery%2C+L.&amp;amp;rft.au=Li%2C+J.&amp;amp;rft.au=Atkinson%2C+S.&amp;amp;rft.au=Thodberg%2C+M.&amp;amp;rft.au=Febrer%2C+M.&amp;amp;rft.au=McLay%2C+K.&amp;amp;rft.au=Drou%2C+N.&amp;amp;rft.au=Brown%2C+W.&amp;amp;rft.au=Hayles%2C+J.&amp;amp;rft.au=Salas%2C+R.&amp;amp;rft.au=Ralser%2C+M.&amp;amp;rft.au=Maniatis%2C+N.&amp;amp;rft.au=Balding%2C+D.&amp;amp;rft.au=Balloux%2C+F.&amp;amp;rft.au=Durbin%2C+R.&amp;amp;rft.au=B%C3%A4hler%2C+J.&amp;amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CBioinformatics%2C+Computational+Biology%2C+Evolutionary+Biology%2C+Genetics%2C+Reproduction"&gt;Jeffares, D., Rallis, C., Rieux, A., Speed, D., P&amp;#345;evorovsk&amp;yacute;, M., Mourier, T., Marsellach, F., Iqbal, Z., Lau, W., Cheng, T., Pracana, R., M&amp;uuml;lleder, M., Lawson, J., Chessel, A., Bala, S., Hellenthal, G., O&amp;rsquo;Fallon, B., Keane, T., Simpson, J., Bischof, L., Tomiczek, B., Bitton, D., Sideri, T., Codlin, S., Hellberg, J., van Trigt, L., Jeffery, L., Li, J., Atkinson, S., Thodberg, M., Febrer, M., McLay, K., Drou, N., Brown, W., Hayles, J., Salas, R., Ralser, M., Maniatis, N., Balding, D., Balloux, F., Durbin, R., &amp;amp; B&amp;auml;hler, J. (2015). The genomic and phenotypic diversity of Schizosaccharomyces pombe &lt;span style="font-style: italic;"&gt;Nature Genetics, 47&lt;/span&gt; (3), 235-241 DOI: &lt;a rev="review" href="http://dx.doi.org/10.1038/ng.3215"&gt;10.1038/ng.3215&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;&lt;span class="Z3988" title="ctx_ver=Z39.88-2004&amp;amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;amp;rft.jtitle=PLOS+ONE&amp;amp;rft_id=info%3Adoi%2F10.1371%2Fjournal.pone.0151102&amp;amp;rfr_id=info%3Asid%2Fresearchblogging.org&amp;amp;rft.atitle=Selected+Schizosaccharomyces+pombe+Strains+Have+Characteristics+That+Are+Beneficial+for+Winemaking&amp;amp;rft.issn=1932-6203&amp;amp;rft.date=2016&amp;amp;rft.volume=11&amp;amp;rft.issue=3&amp;amp;rft.spage=0&amp;amp;rft.epage=&amp;amp;rft.artnum=http%3A%2F%2Fdx.plos.org%2F10.1371%2Fjournal.pone.0151102&amp;amp;rft.au=Benito%2C+A.&amp;amp;rft.au=Jeffares%2C+D.&amp;amp;rft.au=Palomero%2C+F.&amp;amp;rft.au=Calder%C3%B3n%2C+F.&amp;amp;rft.au=Bai%2C+F.&amp;amp;rft.au=B%C3%A4hler%2C+J.&amp;amp;rft.au=Benito%2C+S.&amp;amp;rfe_dat=bpr3.included=1;bpr3.tags=Biology%2CBioinformatics%2C+Computational+Biology%2C+Evolutionary+Biology%2C+Genetics%2C+Reproduction"&gt;Benito, A., Jeffares, D., Palomero, F., Calder&amp;oacute;n, F., Bai, F., B&amp;auml;hler, J., &amp;amp; Benito, S. (2016). Selected Schizosaccharomyces pombe Strains Have Characteristics That Are Beneficial for Winemaking &lt;span style="font-style: italic;"&gt;PLOS ONE, 11&lt;/span&gt; (3) DOI: &lt;a rev="review" href="http://dx.doi.org/10.1371/journal.pone.0151102"&gt;10.1371/journal.pone.0151102&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;

&lt;h2&gt;More info&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://danieljeffares.com"&gt;Dan Jeffares&amp;rsquo;s website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://schatzlab.cshl.edu/people/fsedlaze/"&gt;Fritz Sedlazeck&amp;rsquo;s website&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.bahlerlab.info/home/"&gt;The B&amp;auml;hler lab&amp;rsquo;s website&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>How to access scientific papers for free?</title><link>https://lab.dessimoz.org/blog/2017/01/12/how-to-access-the-scientific-literature-for-free</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2017/01/12/how-to-access-the-scientific-literature-for-free</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Thu, 12 Jan 2017 14:55:39 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;You have a reference to a research paper of interest.&lt;/p&gt;

&lt;p&gt;Perhaps this one:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Iantorno S &lt;em&gt;et al&lt;/em&gt;, &lt;em&gt;Who watches the watchmen? an appraisal of benchmarks for multiple sequence alignment&lt;/em&gt;, Multiple Sequence Alignment Methods (D Russell, Editor), Methods in Molecular Biology, 2014, Springer Humana, Vol. 1079. doi:10.1007/978-1-62703-646-7_4&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;How can you retrieve the full article?&lt;/p&gt;

&lt;h1&gt;Gold open access&lt;/h1&gt;

&lt;p&gt;If the paper is published as &lt;a href="https://en.wikipedia.org/wiki/Open_access#Journals:_gold_open_access"&gt;&amp;ldquo;gold&amp;rdquo; open access&lt;/a&gt;, you can download the PDF from the publisher&amp;rsquo;s website. In such cases, it&amp;rsquo;s easiest to &lt;a href="http://lmgtfy.com/?q=Iantorno+Who+watches+the+watchmen%3F"&gt;paste the title and author names in Google&lt;/a&gt; and look for the first hit in the journal.&lt;/p&gt;

&lt;p&gt;Alternatively, if you know the Digital Object Identifer (DOI) of the article, prepending http://doi.org should directly lead you to the article. In this case, the
DOI is included at the end of the citation. (If you don&amp;rsquo;t know its DOI, you can find it on the publisher&amp;rsquo;s website or in a database such as &lt;a href="http://pubmed.com"&gt;PubMed&lt;/a&gt;). We get:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://doi.org/10.1007/978-1-62703-646-7_4"&gt;http://doi.org/10.1007/978-1-62703-646-7_4&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, we see that this particular paper is behind a paywall at the publisher.&lt;/p&gt;

&lt;h1&gt;Green open access&lt;/h1&gt;

&lt;p&gt;There is, however, still a chance that it might be deposited in a preprint server or in an institutional repository. This is referred to as &lt;a href="https://en.wikipedia.org/wiki/Open_access#Self-archiving:_green_open_access"&gt;&amp;ldquo;green&amp;rdquo; open access&lt;/a&gt;. One way to find out is to look at the article record in &lt;a href="https://scholar.google.com/scholar?hl=en&amp;amp;as_sdt=0%2C5&amp;amp;q=who+watches+the+watchmen+benchmarks&amp;amp;btnG="&gt;Google Scholar&lt;/a&gt; and look for a link in the &lt;em&gt;right&lt;/em&gt; margin:&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2017/01/googlescholar_rightmargin.png"&gt;&lt;img width="100%" alt="screenshot of google scholar with link in the right margin circled" src="https://lab.dessimoz.org/blog/media/2017/01/googlescholar_rightmargin.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, the paper is thus available &lt;a href="https://arxiv.org/pdf/1211.2160"&gt;on arXiv.org&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you know the DOI, an even quicker way of looking for a deposited version is 
by using the http://oadoi.org redirection tool, which works analogously to 
doi.org but redirect to a green &lt;strong&gt;o&lt;/strong&gt;pen &lt;strong&gt;a&lt;/strong&gt;ccess version whenever possible:&lt;/p&gt;

&lt;p&gt;&lt;a href="http://oadoi.org/10.1007/978-1-62703-646-7_4"&gt;http://oadoi.org/10.1007/978-1-62703-646-7_4&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, a free version of the article deposited in the UCL institutional repository is found.&lt;/p&gt;

&lt;p&gt;If you use Chrome or Firefox, you can also use the &lt;a href="http://unpaywall.org/"&gt;Unpaywall&lt;/a&gt; browser extension to automatically get a link to green open access alternatives as you land on paywalled articles.&lt;/p&gt;

&lt;h1&gt;On the author&amp;rsquo;s homepage&lt;/h1&gt;

&lt;p&gt;Sometimes, the paper is available on the homepage of one of the authors.  In 
this case, a link to the preprint is provided on the &lt;a href="http://lab.dessimoz.org/publications"&gt;homepage of the corresponding 
author&lt;/a&gt; (item #36).&lt;/p&gt;

&lt;h1&gt;ResearchGate&lt;/h1&gt;

&lt;p&gt;Instead of an institutional homepage, some authors self-archive their articles on &lt;a href="http://researchgate.net"&gt;ResearchGate&lt;/a&gt;. In the case of our paper, the full-text version is &lt;a href="https://www.researchgate.net/publication/258147986_Who_Watches_the_Watchmen_An_Appraisal_of_Benchmarks_for_Multiple_Sequence_Alignment"&gt;indeed directly available&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;And otherwise, if one of the authors is active on ResearchGate, it&amp;rsquo;s also possible to send a full-text request at the click of a button.&lt;/p&gt;

&lt;h1&gt;Pirated version off Sci-Hub&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Sci-Hub"&gt;Sci-Hub&lt;/a&gt; serves bootleg copies of pay-walled articles. This is illegal, so I only mention it for educational purposes. This works most reliably using, again, DOIs:&lt;/p&gt;

&lt;p&gt;http://sci-hub.tw/10.1007/978-1-62703-646-7_4&lt;/p&gt;

&lt;p&gt;If you, purely hypothetically of course, pasted that URL in your browser, you would or would not get a PDF of &lt;em&gt;the entire book&lt;/em&gt; in which the referenced article appears.&lt;/p&gt;

&lt;h1&gt;#icanhazpdf&lt;/h1&gt;

&lt;p&gt;It&amp;rsquo;s also possible to request full-text articles via Twitter. As described in &lt;a href="https://en.wikipedia.org/wiki/ICanHazPDF"&gt;Wikipedia&lt;/a&gt;, this works by tweeting the article title, its DOI, an email address (to indicate to whom the article should be sent), and the hashtag &lt;a href="http://twitter.com/#icanhazpdf"&gt;#icanhazpdf&lt;/a&gt;. Someone with access to the article might send a copy via email. Once the article is received, the tweet is deleted.  Again, I mention this for educational purpose only&amp;mdash;don&amp;rsquo;t break the law.&lt;/p&gt;

&lt;p&gt;&lt;a href="/blog/media/2017/01/icanhazpdf.jpg"&gt;&lt;img width="50%" alt="cat asking whether it can haz pdf because cat is purr" src="https://lab.dessimoz.org/blog/media/2017/01/icanhazpdf.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p class="caption" style="font-size:80%;padding-top:4px;"&gt;Image credit: &lt;a href="http://www.fieldofscience.com/2015/10/fieldnotes-milk-skurvy-bullying.html"&gt;Field of Science&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Altmetrics.com wrote an &lt;a href="https://www.altmetric.com/blog/interactions-the-numbers-behind-icanhazpdf/"&gt;interesting post on #icanhazpdf&lt;/a&gt; a few years ago.&lt;/p&gt;

&lt;h1&gt;Email the corresponding author&lt;/h1&gt;

&lt;p&gt;Finally, you can always ask the corresponding author by email for a copy of their article. They will happily oblige.&lt;/p&gt;

&lt;p&gt;&amp;nbsp;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;[Update (19 Mar 2017): added mention of &lt;a href="http://unpaywall.org"&gt;unpaywall&lt;/a&gt; to seemlessly retrieve green open access]&lt;/em&gt;&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item><item><title>Life as an academic: my 2016 in numbers</title><link>https://lab.dessimoz.org/blog/2016/12/29/my-2016-in-numbers</link><guid isPermaLink="false">https://lab.dessimoz.org/blog/2016/12/29/my-2016-in-numbers</guid><dc:creator>Christophe Dessimoz</dc:creator><pubDate>Thu, 29 Dec 2016 22:25:52 +0000</pubDate><description>&lt;!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"&gt;
&lt;html&gt;&lt;body&gt;&lt;p&gt;Life as an academic is varied and busy. Students sometimes believe that all we do is teach. In fact, we do quite a few other things. Here&amp;rsquo;s my 2016 in numbers.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;number of papers published: &lt;a href="http://lab.dessimoz.org/publications"&gt;10&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;number of paper rejections: 7&lt;/li&gt;
&lt;li&gt;number of books edited: &lt;a href="http://gohandbook.org"&gt;1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;number of grant proposals submitted: 8&lt;/li&gt;
&lt;li&gt;number of research contracts negotiated with the industry: 2&lt;/li&gt;
&lt;li&gt;number of blog posts: &lt;a href="http://lab.dessimoz.org/blog"&gt;5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;number of tweets: &lt;a href="https://twitter.com/cdessimoz"&gt;474&lt;/a&gt; (66% were retweets)&lt;/li&gt;
&lt;li&gt;number of YouTube videos: &lt;a href="https://www.youtube.com/watch?v=uCjcpw99yNo&amp;amp;t=23s"&gt;1&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;number of papers reviewed: &lt;a href="https://publons.com/author/727758/year-in-review"&gt;24&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;number of papers edited: 3&lt;/li&gt;
&lt;li&gt;number of grants reviewed: 3&lt;/li&gt;
&lt;li&gt;number of PhD theses examined: 2&lt;/li&gt;
&lt;li&gt;number of emails received (excluding spam and mailing-lists): 12,695&lt;/li&gt;
&lt;li&gt;number of emails written: 4,377 (!)&lt;/li&gt;
&lt;li&gt;number of minutes videoconferencing on GoToMeeting: 13,236 (!!)&lt;/li&gt;
&lt;li&gt;number of Geneva-London-Geneva roundtrips: 12&lt;/li&gt;
&lt;li&gt;number of meetings with &amp;gt;50 attendees co-organised: 6&lt;/li&gt;
&lt;li&gt;number of seminars hosted: 4&lt;/li&gt;
&lt;li&gt;number of conferences attended: 3&lt;/li&gt;
&lt;li&gt;number of talks given: 11&lt;/li&gt;
&lt;li&gt;number of semester-long courses organised: 2&lt;/li&gt;
&lt;li&gt;number of hours lectured: 32&lt;/li&gt;
&lt;li&gt;number of 2000-word student papers marked: 47&lt;/li&gt;
&lt;li&gt;number of summer students supervised: 4&lt;/li&gt;
&lt;li&gt;number of overnight retreats attended: 4&lt;/li&gt;
&lt;li&gt;number of work Christmas dinners attended: 3&lt;/li&gt;
&lt;li&gt;number of annual reports written: 3 (this does not count)&lt;/li&gt;
&lt;li&gt;number of &lt;a href="http://www.cheesesfromswitzerland.com/cheese-assortment/tete-de-moine-aop.html"&gt;T&amp;ecirc;te de Moine&lt;/a&gt; eaten at lab celebrations: 4&lt;/li&gt;
&lt;li&gt;number of times moved home: 0 (noteworthy since we moved 5 times in the preceding 5 years&amp;hellip;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wish you, Dear Reader, all the best in 2017!&lt;/p&gt;
&lt;/body&gt;&lt;/html&gt;
</description></item></channel></rss>
