OMA HOW TOs

Guide 1: How to Find Your Favorite Gene in OMA 5

Guide 2: How to get basic information on Your Favorite Gene and genome 11

Guide 3: How to get the orthologs of Your Favorite Gene in OMA 15

Guide 4: How to get the evolutionary history of Your Favorite Gene in OMA 17

Guide 1: How to Find Your Favorite Gene in OMA

The first step to using OMA is finding your favorite gene (YFG). YFG might be in a journal article, with its assigned accession number or annotation identifier. Alternatively, you may only have a sequence of YFG, and want to find the closest match in OMA. This guide will show how to find it in the OMA browser. OMA can be found at omabrowser.org.

When working with the OMA database, the main identifier you will see used on the browser is the “OMA ID.” This is a unique identifier for each canonical gene in the database, consisting of the 5-letter UniProt species code + a 5 digit number. However, it is possible to search by external identifiers as well.

Example A: The Tomato DELLA Protein PROCERA Acts in Guard Cells to Promote Stomatal Closure

1. Search by identifier.

Find the accession number or identifier for YFG in the paper. Here, we take the PRO gene, Solyc11g011260: (http://www.plantcell.org/content/29/12/3186-sec-23)

Copy and paste YFG’s identifier in the search bar on the OMA home page.

Tip: If you are lucky, OMA will autocomplete and suggest genes for you. If there’s an exact match, it will automatically take you to the information page of the gene you searched for.

2. Search by keyword.

Select “Full-text search” in the drop down menu and search “DELLA.”

Many hits are found because many genes have DELLA in their name or description. Choose the appropriate one based on the species.

The first suggestion in Solanum lycopersicum is SOLLC07325 (this is the OMA ID).

3. Search by protein sequence.

Copy and paste the sequence into search bar, the first hit should be your gene.

Tip: Get the protein sequence from a journal article, in-house sequencing data, or external database. In this example, I copy and pasted the protein sequence of Solyc11g011260 from the Plant Transcription Factor Database. Don’t worry about copying and pasting spaces or line numbers because OMA will ignore them.

Because of the differences in identifiers across genome sources and versions, it is recommended to search by protein sequence in order to avoid any ambiguity!

Guide 2: How to get basic information on Your Favorite Gene and genome

For each gene, there is a number of external information that OMA provides. In this section, we describe how to access the General Information, IDs and Cross-references, Domain Architecture, and Gene Ontology annotations.

Example: SOLLC07325, the DELLA protein in tomato from Guide 1 Example A.

1. General information

Get the general information of YFG by clicking on the Information tab. This is the landing page after searching YFG.

The OMA ID is displayed at the top, along with an external UniProt ID if relevant. The OMA ID consists of the 5-letter UniProt species code + a unique 5-digit number.

Description: The description of the gene, usually from the genome annotation.
Organism: 5-letter species code and scientific name.
Locus: The location of the gene on the chromosome, taken from the genome source.

Click on the organism name link to go to that species’ information page.

Tip: Here you can find the common name, NCBI Taxonomy ID, the number of sequences in the genome (including ASVs), along with other information. The number of proteins in matrix is the number of genes with a homolog detected by OMA. You can get the rank compared to the ~2000 species in OMA. On the bottom half, the species with the most common and fewest common OMA groups are displayed. OMA groups are special groups of orthologs in which all genes are inferred to be orthologous with each other, with a maximum of one gene per species. Thus, SOLTU (Solanum tuberosum; potato) is the closest related species to tomato, in terms of conservation of OMA groups.

2. IDs and cross-references

Get the IDs and Cross-references of YFG. This is the second panel under the Information tab.

Tip: Alternative splice variants are usually denoted in Other IDs, with a trailing .1, .2, etc.

Here the alternative IDs for YFG are listed. This may include the UniProtKB/Swiss-Prot ID, Refseq, EntrezGene, and Other IDs. Other IDs are usually genome annotation-specific for that particular organism. All Alternative splice variants for a given protein are represented as 1 OMA ID.

3. Domain Architecture

The 3rd panel on the information tab displays the domains, if available. OMA obtains the domain annotations from Gene3D, and currently 78% of all proteins in OMA have at least 1 domain annotation. Mouse over or click on the domain to see the name, location (amino acid position) and source.

Note that the colour hue corresponds to the domain architecture (according to the CATH classification). Different color intensities are used to differentiate domains of the same architecture.

4. Gene Ontology

The 4th panel of the Information tab gives Gene Ontology (GO) annotations. GO annotations are important to infer the function of YFG. OMA provides GO annotations from the source, but additionally provides GO annotations which have been inferred based on OMA orthology relationships. OMA propagates GO annotations based on evidence codes EXP, IDA, IPI, IMP, IGI and IEP (http://geneontology.org/page/guide-go-evidencecodes) and then propagates throughout OMA groups.

Why should we use the OMA GO annotations?

There may be a difference between the GO annotations predicted by OMA and other databases. More information on how annotations are propagated in OMA:

https://academic.oup.com/nar/article/43/D1/D240/2438427#87022708.

Tip: On the web page, mouse over the Evidence codes to find out what they stand for.

5. Protein sequence

Copy the protein sequence of YFG.

Guide 3: How to get the orthologs of Your Favorite Gene in OMA

Once you have found YFG in OMA, you want to retrieve the orthologs. There are several ways to do this. Keep in mind that OMA reports three “types” of orthologs: pairwise-induced orthologs, Hierarchical Orthologous Groups (HOGs), and OMA Groups. These are three different methods, so they might not report identical orthologs. However, HOG and OMA Group orthologs are based on pairwise orthologs so they should be similar. For more information, see: https://omabrowser.org/oma/type/.

Differences between pairwise orthologs and HOGs

	Pairwise orthologs	Hierarchical Orthologous Groups (HOGs)	OMA Groups
Algorithm	Built by mutually-closest protein sequences within a confidence interval	Built by merging groups of pairwise orthologs at different taxonomic levels using a guide tree	Built by searching for cliques of pairwise orthologs (i.e. all genes that are pairwise orthologs to all others in the group)
Genomes included	Compares 2 genomes at a time	Compares all genomes at a time	Compares all genomes at a time
Types of homologs	Strictly orthologs, but can be 1:m or m:m	Groups of orthologs and in-paralogs	strictly orthologs, at most 1 per species reported, although there may be more not reported

We will discuss the concept of HOGs in more details below, but meanwhile you may want to check out this 4-min YouTube video, in which we provide a gentle introduction to the concept: https://www.youtube.com/watch?v=5p5x5gxzhZA

Example:

SOLLC07325, the DELLA protein in tomato from Guide 1 Example A.

1. Get orthologs via pairwise orthologs method.

Click on the ortholog tab to list all pairwise orthologs.

This takes you to the pairwise ortholog table, which gives the following columns:

Relation: the relationship cardinality of the orthologous relationship. 1:n means that the ortholog has been duplicated in the other species. For example, this gene is a 1:n ortholog compared to Populus trichocarpa because there are two copies in P.trichocarpa

Domain (E for eukaryote)
Protein ID: The OMA identifier
Cross reference: usually UniProt ids
Domain architecture: Visual representation of the CATH domains

Mouse over or click on the domains to get more information.

2. Get the sequences of the pairwise orthologs.

Click on download fasta.

3. Get orthologs via HOG method.

HOGs contain genes which descended from a common ancestral gene at a particular taxonomic level. Click on the table viewer subtab of the Hierarchical Orthologous Groups tab. Directly underneath the tabs you can see the different taxonomic levels for which this gene family has HOGs.

After clicking on the taxonomic range you are interested in, OMA displays the number of genes in the HOG at that level. The HOG table displays the Domain, Taxon (species), Protein ID, relevant Cross references, and Domain Architectures. Any of the columns can be sorted or removed from the display.

You can download the orthoxml, species tree (in phyloxml format), fasta sequences, or a computed multiple sequence alignment of the HOG members. Additionally, you can export the data in the table as a csv or other format for later use.

Guide 4: How to get the evolutionary history of Your Favorite Gene in OMA

Example 2: Arabidopsis thaliana gene Q9SRX4 which is involved in pectin production.

Because this section makes extensive use of the concept of Hierarchical Orthologous Groups (HOGs), if you haven’t watched the brief introductory video on Youtube, we recommend that you do now: https://www.youtube.com/watch?v=5p5x5gxzhZA

1. Find YFG in OMA and click on the Hierarchical Orthologous Groups tab.

Tip1: Select the node corresponding to the taxonomic level of interest, and click “Freeze tree node”. Tip2: drag and drop the thick lines separating the tree from the boxes to adjust the panel width.

iHam is an interactive visualization tool for OMA. Each box is an extant gene. Our gene of interest is highlighted in green. Hover the mouse over it to get the ID and more information.

We can see at the Pentapetalae level, there were 4 ancestral genes, delineated by the vertical lines.

Tip: Use the scroll bar on the bottom to view extra-large HOGs.

Each group of genes between the lines descended from one of those 4 ancestral genes. Thus, at this taxonomic level, we have 4 HOGs– each contains a cluster of extant genes.

2. Manually curate your gene family.

iHam displays orthologs paralogs predicted by OMA. However, some HOGs can be poorly supported. You can manually curate the tree by removing HOGs below a certain threshold of species coverage. In the above example, both species of Musa acuminata are the only species in one of the HOGs at the commelinids level. This either represents a loss at the Poaceae level, or a problem with the orthology inference, often due to imperfect genome data. Therefore OMA allows you to filter out columns (HOGs/putative ancestral genes) based on species coverage.

Tip: Click on Options and choose the threshold of minimum percent species coverage for a given HOG.

Additionally, in HOGvis you can color the genes according to either the protein length or GC content.

Click on Options -> Select Color Scheme to scale the color of each gene based on the GC content or protein length. Here we observe the clear separation between monocots and dicots in percentage of GC content.

3. Investigate when duplication and loss events took place.

Because of the interactivity of HOGvis, traversing along the species tree allows us to infer where OMA predicts duplications and losses of ancestral genes.

In our example, we can see all the genes in our modern extant species which came from a single ancestral gene in the ancient Mesangiosperm.

The number of duplicated genes can be inferred by the number of genes in the HOG.

For example, there are 4 genes in soybean (Glycine max)– 3 of these came from duplications which happened after the Mesangiosperm speciation.

Tip1: Vertical bars separate HOGs, and represent 1 ancestral gene at the chosen taxonomic level. Tip2: Polyploid species like Triticum aestivum or Glycine max tend to have multiples of their ploidy level as number of genes.

We can also see that the gene coming from ancestral gene #2 and #3 have probably been lost in the monocots (at the commelinids taxa level). Actually, we can see that the dicots in general have many more copies of this gene than the monocot, which makes sense, considering the literature has described how dicots have different cell wall structures than commelinoid monocots, and are known to be pectin-poor!

4. Explore different taxonomic levels

Dynamically interact with HogVis by clicking on the different nodes.

In our example, B. napus has undergone a whole genome duplication– hence the near systematic doubling of B. oleracea genes within every HOG.

That concludes the 4-part “OMA How-To” guides. Is there something you’d like to see covered in a How-To? Email the OMA team at: contact@omabrowser.org.

More info:

Tutorial on retrieving data via the REST API: https://zoo.cs.ucl.ac.uk/tutorials/rest_api_tutorial.html
OMA 2018 Paper: https://academic.oup.com/nar/article/46/D1/D477/4584623

Table of Contents