New Data Driven Research Paradigm for the Post Genome Era:
A Transcription Regulation Example

Charles (Chip) Lawrence


Wadsworth Center, New York State Department of Health
ESP-P.O. Box 509, Albany, New York 12201-0509, USA
In the traditional hypothesis-driven biomedical research paradigm, the design of experiments to address specific prior hypotheses takes center stage. Because the goal has been to design experiments to give crisp answers, many biomedical scientists believed that "If your experiment needs statistics, you ought to have done a better experiment" (Earnest Rutherford). But with the genome era has come a large mass of fundamental biological data gathered without prior hypothesis, spawning a new "data-driven" paradigm. The coupling of this paradigm with high throughput experiments stemming for genomic scale data analytic studies offers the promise of greatly accelerating biomedical research. Studies of transcription regulation in E.coli provide one example of the potential of this approach. Elucidating the transcription regulatory networks of species is a grand challenge of the post-genomic era. Toward that end we have recently applied Bayesian algorithms with the goal of locating regulatory sites in non-coding sequence via cross species genome sequence comparison in proteobacteria (McCue et.al, NAR, 2001). Application of these technologies to a study set of 184 E. coli genes with documented transcription regulatory sites revealed that 81% of our predictions correspond with the documented sites. That the remaining predictions included bona fide TF binding sites was proven by affinity purification of a putative transcription factor (YijC) bound to predicted but undocumented sites upstream of the fabA, fabB, and yqfA genes. Through application to the complete set of intergenic regions in E. coli, regulatory sites for 2097 genes were predicted and are available at http://www.wadsworth.org/resnres/bioinfo/. These sites represent a set of testable hypotheses. The challenge now is to scale up validation from three oligomers (fabA, fabB, and yqfA) to thousands. The emergence of syntenic sequence from multiple vertebrate species offers similar opportunities generate testable hypotheses via cross species comparison (Wasserman et.al, Nature Genetics, 2000).