Insightful Visualization of Bioinformatics Data

• Author: Christophe Dessimoz •

Bioinformatics analyses often consists in looking for interesting signals in large amounts of data. But in my current work environment (Darwin scripts with occasional gnuplot and R plots), I find it both conceptually difficult and practically tedious to produce insightful visual representation of my data. There are large scientific benefits in finding new visual representation of bioinformatics data, and in simplifying the process of data exploration in general.

This is not to say that there are no such examples. In fact, some excellent representation exist, and tools to easily produce them have been developed. I am listing a few of them on top of my head here as inspiration and starting point for future ideas:

Sequence logo

Sequence logos, introduced in 1990 by Schneider and Stephens, are very clever way of displaying consensus sequences. To take a classical example, the promoter sequence of many eukaryotic genes contain a TATA-box, the perhaps best known transcription factor recognition site:

Sequence logo

Source: http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980320f.htm

The height of a character depicts its degree of conservation in bits of information. This metric make sense because it isrelated to the thermodynamic energy. More importantly perhaps from the visual point of view, the logarithmic nature of bits makes strongly conserved characters stick much higher than they would if their height was proportional to the probability. As a result, the figure resolutely concentrates on signal, and wastes no space on noise!

Circular Phylogenetic Trees

Visualizing phylogetic tree of life using traditional representations becomes difficult for more than about 100 leaves. The circular tree representation has been popularized by iTol from Letunic and Bork:

itol

Source: Wikipedia

The downside of this representation is that since all leaves are distributed at constant angular intervals, closely related leaves can be far apart, while distant leaves can be adjacent. This problem is partly mitigated by changes in label color, but this can only be effective for the top few levels. 

Circos - Genome visualization

The following page shows stunning genome visualization, also based on the idea of a circular representation:

circos

Circos: visualizing the genome, among other things

Be sure to have a look at their poster too…

Visual Complexity

The Visual Complexity page is a repertoire of complex representations of networks, and include a number of examples from biology:

circos

source: http://www.visualcomplexity.com/vc/

References

T. D. Schneider and R. M. Stephens, Sequence Logos: A New Way to Display Consensus Sequences (1990) Nucl. Acids Res. 18: 6097-6100,

Letunic and Bork, Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation (2006) Bioinformatics 23(1):127-8

Share or comment:

To be informed of future posts, sign up to the low-volume blog mailing-list, subscribe to the blog's RSS feed, or follow us on Twitter. To read old posts, check out the index here.


Creative Commons
                    License The Dessimoz Lab blog is licensed under a Creative Commons Attribution 4.0 International License.