Combrex.org                                       
 

Home
Our Mission
Project Description
Description for the General Public
Research Papers
Funding
About Us
Enter the Web Portal
Community
News and Press
Jobs
Network of Data Resources
Contact Us

Project Description

RESEARCH DESIGN AND METHODS

Research Area 

   This project integrates 30+ years of transformative computational biology work on gene function prediction with the crucial need to test such computational predictions in the laboratory using specialized but highly effective expertize in biochemical labs or novel biotechnology.

    The project will continue to evolve as the community is growing and ideas are multipling. The novel funding scheme used in the project was proposed by Rich Roberts in his "Call for Community Action" in 2004.


    The full conceptual details in the NIH project including the computational framework were conceived and outlined by Simon Kasif and Rich Roberts. Their ideas were influenced by the participants in the workshop organized by the
American Academy for Microbiology.

    The computational ideas proposed were influenced by the computational predictions methodologies from labs such as David Eisenberg, David Lipman, Berhard Palsson, Steven Brenner and databases such as STRING (Peer Bork Lab), COG (Koonin), CDD (Steve Bryant), PFAM (Alex Bateman), SEED (Overbeek/Stevens),PREDICTOME/VISANT (Charles DeLisi Lab) and many, many other
researchers.

The computatational framework in the NIH proposal relied on two key specific concepts:

a) Probabilistic Functional Linkage Graphs proposed by Letovsky/Kasif in 2003.

b) The Bayesian network methodology for integrating predictions from multiple biological data sources proposed by V. Pavlovic and
Simon Kasif (2000) and first demonstrated for bacterial function prediction in 2001/2002 by Yu Zheng working with Rich Roberts and Simon Kasif.

   
This proposal aims to build an interdisciplinary community that will enable a direct exchange of functional predictions between experimental and computational scientists that will drive rounds and cascades of predictions and experimental validations.
  

    The new proposed model of collaborative research is in part motivated in spirit by EBay where products are posted in a public forum and consumers are allowed to bid on these products. In our case the “products” are functional predictions that are produced by a wide range of computational methodologies. The “bids” (in the form of short pilot project proposals) are placed by biochemists.  Our model is naturally more conservative and was originatly called COMBREX.

    Later we decided to rename it as COMBREX ( Computational Bridge To Experiments).
  

    In this model the computational predictions are stored on COMBREX. A prediction can take a number of forms formalized by the collaborative interchange model to be established by the consortium. A prediction associates a particular biochemical function with a protein family, e.g.,  such a function could be a partial enzymatic descriptor, RNA binding, amino-peptidase, cellulase, a specific cytochrome, methyltransferase or a specific transporter.  With each gene carefully and transparently integrated predictions can be used to stimulate and direct bids for experimental validations by groups that are most qualified to perform the experiments. Both the experimental groups and the computational groups are evaluated according to their potential to deliver based appropriate criteria such as statistical evaluation of the prediction, prior record, cost of a proposed functional assay, priority of the gene family and peer review.


    We originally considered focusing only on the E.coli genes as it seemed more reasonable to target the few hundred or a thousand genes of a single microorganism. However, this would preclude the discovery of novel products such as the telomerase first identified in Tetrahymena, reverse transcriptase found in a retrovirus, and Taq polymerase found in the thermophile Thermus aquaticus.


    Since current databases are organized into protein clusters, the annotation of 100 proteins will have implications for many, many more. Annotate one of these proteins with experimental evidence, and inferences will be made for thousands of proteins. The impact of any given experimental annotation will be one of the most heavily weighted criteria for prioritizing potential experimental validations. We will provide an extensive set of ranking for every gene family that demonstrates its broad significance. Among these ranking criteria, whenever possible, we will focus on the genes from Helicobacter pylori, Mycobacterium tuberculosis, Pseudomonas aeruginosa and the model organisms E. coli and B. subtilis, but only when doing so will shed light on gene function in diverse taxonomic phyla - a situation we anticipate occurring with regularity.  We will place particular emphasis on the interaction among computational and wet-lab experimental investigators, because cooperation between these two groups can serve to improve both prediction and validation methodologies.

   
   We anticipate that COMBREX will necessarily grow in two stages. This reflects the very different requirements associated with creating a new community, and those required for sustaining a mature community. Initially, there needs to be greater emphasis on recruitment of participants, and publicizing its operations. This proposal covers the first stage of COMBREX, which will last for the entire two-year funding period. Our vision for the second stage is detailed in section 5.5 the Long-Term Sustainability Plan. Key to this initial phase was the identification of a large number of skilled biochemists, with diverse expertise, that are willing to be ”founding participants” in this venture, dependent upon the identification of a high-quality prediction within their range of expertise (Table 3). Any lab engaged in this initial period will be granted the equivalent of a small subcontract from us to cover the cost of a technician for a one year period. This technician will then engage in attempts to experimentally validate the prediction. Brief progress reports will be submitted to the Executive Committee at three and six months to ensure that the funded positions are addressing the agreed-upon experimental task. To bring symmetry to the project, we seeded the computational team with five seasoned computational biologists who will populate the database of initial predictions. However, we plan to award five small pilots to computational teams that forge collaborations with biochemists working in the consortium and promise to significantly improve the quality of predictions in a specific functional area that can be validated during the second year of the project. The project will be led by a significantly empowered and broadly based executive committee that will include scientists with experience in biochemistry, computation, software resources and community outreach.

5.2 Opportunity and Potential Impact

   The primary goal of this proposal is to increase the pace of experimental discovery of the function of large and high priority gene families in bacterial genomes. Specifically, we propose to catalyze the formation of a consortium of experimental and computational biologists that would collaborate directly in order to experimentally determine high-priority genes of unknown function or specificity. Central to this effort would be the creation of a community web-based database (portal) that would allow computational and experimental scientists to communicate easily, assist experimentalists in identifying those high-priority genes for which there are the highest-quality computational predictions for their molecular function, and provide feedback to the computational biologist, since it remains true that the insights and experience of the dedicated biochemist can be essential in guiding the development of algorithmic sophistication. Experimental validations of gene function would be reported in a manuscript when successful, or as annotations in the prediction database when negative. Many existing labs, both large and small, have the relevant expertise and could contribute to the overall effort by performing the pertinent gene function determination studies. The challenge thus becomes one of establishing a framework for coordinating of efforts and incentivizing participation.
The impacts are virtually limitless, potentially affecting: understanding the pathogenesis of disease and stimulating immunological and biodefense strategies; the production of biofuels for energy consumption; the environmental remediation of xenobiotics and pollutants, resistance of crops to pests, droughts, and the engineering of enhanced nutritive value; the fixation of carbon dioxide; and the reduction of greenhouse gases.


   This project will be important in providing fundamental knowledge to many aspects of infectious disease research.  Unknown genes from several important pathogens such as Mycobacterium tuberculosis and Helicobacter pylori will be high priority targets as will genes in model organisms with orthologous genes that are widely distributed in both bacterial pathogens and higher organisms including humans. A key factor in deciding priorities will be the health implications of successful predictions.

Aim 1 – Function prediction

Details are available by request from the computational team

Aim 2 – Prioritization

Details are available by request from the computational team

Aim 3 – Database, Web Server, and User Interface for Interactive Exploration of Prediction Foundations

Details are available by request from the staff.

Aim 4  -Experimental Validation and Testing of Novel Predictions

   The key feature of our approach is that once the database begins to contain high quality functional predictions they will be available for experimental testing by expert biochemists, that are already working in an area that is germane to the prediction.  For instance, if a strong prediction was made that a particular ORF encoded a DNA repair enzyme, such as a potentially novel DNA glycosidase, then Prof. Leona Samson’s laboratory at MIT would be an excellent place to test it explicitly.  Prof. Samson’s laboratory has most of the known substrates and so could quickly determine whether or not it really was new.  Similarly, if a strong prediction was made that an ORF encoded a new type of transporter, then Prof. Milton Saier’s laboratory at UCSD would be an ideal place to try and track down its specificity. Again Prof. Saier’s laboratory has extensive experience in the field and most importantly has the necessary biochemical expertise in terms of assays and substrates that are in constant use making it relatively easy to test the new prediction.  Both Profs. Samson and Saier have expressed great enthusiasm to work with us to validate such predictions experimentally.
    We have already contacted many other researchers who universally expressed enthusiasm for this project. Among them we have included 24, listed in Table 3, who have specifically agreed to bid on experimentally validating predictions in areas in which they have specific expertise.

 

Table 3: Experimental biochemists that have been contacted. All but the last two have already enthusiastically agreed to participate in the project. Dr. Karen Allen is not listed but agreed to validate Aldolase Isozymes.


    Our approach to recruit experimentalists to test predictions in other areas will be to contact key laboratories directly and encourage them to bid. We believe that initially a proactive approach will be the most successful in finding the very best collaborators.  However, as time passes and it becomes widely known within the community that this project is underway we anticipate that potential collaborators will contact us directly.  In this regard we will be advertising the project in as many venues as we can.  A letter to Nature and/or Science as well as other major journals will be written and we will certainly advertise the project at major scientific meetings.  We feel that if this grant is funded it will provide an opportunity to build momentum within the biological community for a major commitment to tackle this important problem.  Judging by the enthusiasm with which our initial outreach has been received we feel there is a unique opportunity to harness the biological community in an exciting collaborative endeavor that will be widely viewed as groundbreaking.
One important aspect of our approach is that rather than identifying laboratories to test predictions ahead of time, we plan on encouraging laboratories interested in participating, including those listed in Table 1, to submit a short plan to us indicating how they would tackle the problem.  This might include a short description of their special expertise and capabilities in a given area or it may consist of the application of their specialized knowledge in a field to refine the prediction and hence limit the scope of testing that would be needed.  This short plan, which we call a “bid”, would be reviewed by the evaluation committee chaired by Dr. Roberts, and will consist of investigators who are recruited specifically for this purpose.  For instance, Drs Gerlt, Osterman and Wanner who are submitting related proposals would be included to ensure that anything funded through this initiative would not be doubly-funded through monies they might control.  This evaluation committee would provide their input electronically and the logistic coordination of its activities would be under the control of Dr. Steffen.  We would anticipate soliciting bids from interested individuals by making the predictions freely available through our web site and also by direct email to Principal Investigators who could sign up for such alerts.  Initially, we would allow a short period of 2-3 weeks after a prediction became live for bids to come in and would then anticipate making a decision no more than 1-2 weeks after that.  As knowledge of this possibility grows within the community, then we might have to readjust the schedules, but we would be guided by past experience and by advice from our Advisory Board. We plan to make the review process of both computational predictions and experimental bids for validations as transparent and simple as possible and largely independent of each other.


    An annual workshop will be hosted in conjunction with an existing key meeting (Microbial Genomics) to enable direct interaction and cross education between experimental and computational scientists. Tutorials and workshops in both experimental methods and computation will be provided. Computational biologists will have to understand the validation methodologies as currently many annotation schemes, e.g. GO do not lend the predictions to a direct functional assay.



1.    Roberts, R.J. Identifying protein function--a call for community action. PLoS Biol 2, E42 (2004).
2.    Roberts, R.J., Karp, P., Kasif, S. & Kim, S. An Experimental Approach to Gene Function, Executive Report. American Academy for Microbiology. (2004).