ePath Essential Gene Prediction and DatabaseDr. Xu's lab

Information on the ePath database:

1. Essential genes are defined as those genes that are critical for survival. Identification and prediction of essential genes for a given organism is therefore of great importance, particularly for understanding the gene function, evolutionary history, and selection of new drug targets. The ePath database was developed for essential gene annotation for 4000+ prokaryotic strains. With the ePath database, we aim to provide a comprehensive reference for gene essentiality prediction with an easily accessible online search tool, thereby facilitating the study on those organisms without gene essentiality information. The essentiality of genes is annotated based on two sets of information (Fig. 1).
Database overview

Figure 1. Overview of the ePath database.

2. The first part is the gene function annotation from various database, including KEGG, Gene Ontology (GO), Clusters of Orthologous Groups (COGs), etc. For any given gene, we collect the corresponding KEGG ortholog (KO), and link the KO to a group of annotations including KEGG KO annotation, KEGG pathway/Module/Reaction annotation, GO, and COGs. We thus score the essentiality of this gene based on all available annotation information, primarily based on the principle that a gene should be essential if it plays including role in genetic information processing, cell envelope maintenance or energy production. The second part of the annotation is based on the existing experimental results of gene essentiality. We have collected 30 strains listed in the Database of Essential Genes (DEG) and linked all the experimentally essential genes to KOs when possible, and summarized the essentiality frequency of these KOs in the 30 strains. In particular, we have identified all the genes in the 30 strains if their projections on the KEGG metabolic pathway map (ko01100) have experimental essential-gene neighbors on both sides, and considered these genes as ‘gap’ essential genes, which are false negatives in experiments due to paralogs or isozymes (Fig. 2: edge 2-3, 6-7 in the figure below).
eco01100

Figure 2. An example showing the procedure of 'gap' essential gene identification by metabolic pathways. Edges in red represent experimentally identified essential genes, while edges in blue represent non-essential genes. Nodes are chemical compounds. The linkage matrix is shown, the highlights of which demonstrate a 'deep first search (DFS)' algorithm based on the topology of the map from Node #1.

3. For a given gene in a strain, we thus score its essentiality based on the essentiality of its orthologue in the 30 strains (Fig. 3 for E.coli, as an example). In the end, the essentiality of any given gene will be predicted based on two scores from annotation and experimental inference as above.

Figure 3. Metabolic pathway diagram of E. coli (eco01100 in KEGG pathway database (Kanehisa and Goto 2000)) with the gene essentiality information. The edges in red represent the EGs identified by experiment (Baba et al. 2006). The edges in blue represent the missing EGs identified by the ‘remapping’ algorithm in this study. The edges in black represent the non-EGs. The original metabolic pathway map from KEGG (Kanehisa and Goto 2000) is used with KEGG copyright permission number 190025.

References:
Baba, T., T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K. A. Datsenko, M. Tomita, B. L. Wanner, and H. Mori. 2006. Construction of Escherichia coli K-12 in-frame, single-gene
knockout mutants: the Keio collection. Molecular systems biology 2: 2006.0008
Kanehisa, M., and S. Goto. 2000. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28:27-30.