Research highlights

Research focus of the Hwa lab has been primarily at the interface of statistical physics and molecular biology. Ideas and methods of statistical physics bring a unique perspective towards making biology quantitative. At the same time, these biological problems stand to challenge and enrich the frontiers of statistical physics. Our theoretical/computational studies have been on the following topics:
I. The development of new bioinformatic methods and tools for various genomic and evolutionary studies
II. Characterization of complex interaction between biopolymers
III. Formulating systemic approaches to bio-molecular networks
IV. Exploring the nonequilibrium dynamics of molecular evolution.
Recently, we have also started a molecular biology lab to evolve novel gene regulatory sequences, signaling pathways, and genetic circuits. The experimental studies provide insight on the evolvability of the information processing capability of biological organisms. They also provide a large amount of data which can lead to the discovery of novel mechanisms of signaling and control in cells. The evolved control elements have numerous potential applications in bioengineering and biomedicine.

Theoretical/computational studies

I. Genomics and bioinformatics:

This section addresses the development of methods and tools to analyze biological data. Statistical physics is involved in the construction of the methods and the statistical evaluation of the results. Most of the methods described are geared towards addressing specific biological questions, in particular the natural history of genomic evolution and the mechanism of gene regulation, two topics which will be discussed in more detail in Sections III and IV.

a) Sequence matching statistics:
My contact with bioinformatics started with article A28 in which I noted the connection between the “sequence matching” problem ubiquitously encountered in bioinformatics and the directed polymer problem widely studied in statistical physics. A number of ensuing articles (A39, A40, A41, A42, A46, A48, A49) exploited this connection to devise more efficient approach to perform, characterize and evaluate the statistics of sequence alignment, culminating in a new “hybrid algorithm” (A51, A54, A55) which combined the speed advantage of the existing alignment algorithms while providing for well-characterized score statistics for arbitrary gap penalty parameters. The latter feature allows for the real-time adjustment of position-dependent gap penalties in sequence similarity search, and serves as the anchor of the next generation BLAST program currently being developed by the group of Stephen Altschul at NIH.

Statistical physics also has quite a bit to benefit from the above connection. Inspired by the statistical behavior of the global sequence alignment problem, we showed that elastic media subject to two independent sources of quenched disorders could be regarded as being embedded in one effective disorder medium in higher spatial dimensions. Applications include tribology (A30, A37), reptation (A38), and polymer absorption (A35). The local alignment problem also has interesting ramifications in statistical physics, e.g., the nonequilibrium phase transition discussed in article A36 and the glassy dynamics of DNA denaturation bubbles in article B2 (see Section II.b). The latter article introduces the large-deviation theory pedagogically to the statistical physics community while also presenting the theory of disordered systems to the biophysics community; it provides one of the simplest example of disordered system whose glassy dynamics can be explicitly characterized.

b) Genomic substitution patterns: We developed a strategy (B6) to quantify the history of nucleotide substitutions in the human genome going far back in time (around 250 Myr) by exploiting the vast number of “repetitive elements” as fossil records. Our analysis (B5) revealed a number of entirely unexpected but biologically meaningful results, in particular the sudden and drastic change of substitutional biases at around the time of mammalian radiation, which collectively revise the accepted picture of mammalian genomic evolution. They also shed light on the evolution of genomic “isochores” (large regions of homogeneous GC-content), the existence of which has been one of the most puzzling features of the human genome. We are currently repeating the analysis in mouse to establish events common to mammals and those specific to primates. We are also performing region-specific analysis to characterize the different substitutional biases along the length of the chromosome. In addition to peeking back into the history of genomic evolution, our findings have direct bearing on comparative genomic analysis (e.g., the search for genes and DNA motifs via human/mouse comparison) the success of which depends crucially on the proper description of the evolution of the genomic background.

c) Context-dependent mutation: Existing mutation models describe the mutation of any given nucleotide as a reversible, Markov process independent of its neighboring bases. However, this neglects context-dependent mutation processes, e.g., the CpG-methylation-deamination process which is the dominant channel of mutation in all vertebrates. These more complex processes do not satisfy detailed balance and are nonequilibrium in nature; they belong to a class of nonlinear stochastic dynamics problems well studied in statistical physics but unfamiliar to the bioinformatics community. We solved this problem (A59) by applying the “cluster approximation” which turned out to be very efficient and accurate. A web server (C7) was constructed to quickly extract context-dependent mutation processes from raw genomic sequences. We are currently developing algorithms to reconstruct phylogenetic trees for the repetitive elements in the human genome and across the mammalian species. This effort should yield detailed information on the evolution of mammalian genomes, e.g., duplications and rearrangements. Incorporating the CpG-methylation-deamination process is crucial to our approach since the much faster substitution rate by this process provides excellent statistics untenable otherwise.

d) Gene expression analysis: We have been collaborating with the laboratory of Bill Loomis (UCSD biology) to analyze large scale gene expression data obtained from DNA micro-array experiments on the development of the social amoeba Dictyostelium discodeum. We developed two methods: The first is a generic method to cluster genes with similar expression patterns (A50). It is based on the physics of percolation; the probabilistic nature of the method accommodates experimental uncertainties and allows for “multi-parenthood”, making it superior to commonly used methods such as the hierarchical clustering and the self-organizing map. The second method (A56) takes advantage of the temporal nature of the data from Loomis’ experiments. We assume a first-order kinetic process and work backward to deduce the onset and cessation of gene expression, thereby extracting a few vital numbers out of the vast raw data. The results can be straightforwardly (e.g., visually) analyzed. A pilot study involving 700 genes has already yielded valuable insights into the biology of Dictyostelium development (A53). Full genome-wide studies (including comparisons to hundreds of mutants) are now underway. We hope to extend our analysis to discover gene-gene interaction underlying the complicated expression patterns.

e) DNA motif search: Much of the regulatory information is coded in the genome via the relative positions and strengths of the protein-binding DNA sequence motifs. We are analyzing the composition of the DNA motifs for orthologous regulatory regions across different species of bacteria to discriminate aspects which are essential to functions and those due to stochastic fluctuations of the evolution dynamics (see also Section IV.a). We are also developing a method to detect DNA motifs for the localization of nucleosomes for the eukaryotic genomes. Our approach is based on the biophysics of histone-DNA interaction, combined with the experimental knowledge of exemplary histone-binding DNA sequences. With this method, we wish to test the hypothesis that the positions of nucleosomes mediate indirect interaction between the regulatory proteins and hence can serve important regulatory functions (see Section III.a).

II. Molecular Biophysics

This section addresses the biophysics of biopolymers and their interactions. We introduce the concepts and methods of modern statistical physics to characterize the complex interactions between heterogeneous DNA and RNA sequences with each other and with proteins. The main goal of these studies is to obtain molecular information and provide biophysical constraints on the properties of biomolecules to be used in the system-level studies to be described in Section III. Another objective is to establish biopolymers as interesting model systems to the statistical physics community.

a) Single molecule biophysics: Rapid advances in the technology of single molecule manipulation (e.g., by optical tweezers and AFM) open up many possibilities for detailed experimental studies of biomolecules. One recent thrust pioneered by the groups of Bustamante and Tinoco at Berkely is to pull a single RNA molecule from its two ends, and learn the structure of the molecule from the way it opens up (through the force-displacement characteristics). We developed a detailed theoretical model (articles A52, A62 and a web server C6) to describe this process by combining the popular RNA folding programs (MFOLD and the “Vienna package”) together with the statistical mechanics of pulling. We describe features in the structure of the molecule that can be extracted from the force-displacement characteristics, and propose different experimental strategies to enhance the extractable features. Of special interest is our recent proposal (B9) to pull the RNA through a molecular pinhole. We expect the additional positional information on the pulling process to yield a great deal more on the structure of the molecule than the experiments currently being carried out.

In addition to RNA, we are working with local biophysics experimentalists (the group of Doug Smith at UCSD) to probe protein-DNA binding by unzipping double-stranded DNA. The approach is based on the effect of DNA-binding proteins on the unzipping characteristics of double-stranded DNA, and the feasibility has been demonstrated recently by the group of Michelle Wang at Cornell. By designing appropriate DNA sequences, we hope to develop an efficient method to detect the binding sequences (and their strengths) for different regulatory proteins. By further placing multiple protein-binding sites on the same sequence, we hope to develop an efficient way of detecting protein-protein interaction.

b) Glassy properties of biopolymers: Heterogeneous biopolymers provide nice examples of “disordered” systems which have been of long-standing interest in statistical physics. Best known among these is the “protein-folding” problem, which is unfortunately too difficult to tackle either analytically or numerically. Over the years, I introduced a number of other biopolymeric problems to the statistical physics community. The simplest non-trivial problem is the localization of DNA denaturation bubbles (A34, B2): This system is a physical realization of the celebrated Random Energy Model. We showed the existence of a (inverted) glass transition, i.e., the denaturation bubble becomes localized to weak (AT-rich) regions of double-stranded DNA upon increase in temperature towards the bulk denaturation point. This phase transition is understood analytically in terms of the large-deviation theory developed in the context of local sequence alignment (see Section I.a). The theory can further be applied to describe dynamic properties of the denaturation bubble. We found the bubble to be sub-diffusive in the “glass” phase, with continuously varying dynamic exponents as predicted by the large-deviation theory.

Another problem of interest is that of RNA secondary structure formation. It has aspects similar to sequence alignment and directed polymer, but is more complex due to the nontrivial topology the RNA molecule can take on. In articles A57 and A58, we studied in detail the folding of random RNA sequences. By a combination of analytical and numerical calculations, we found the system to have a weak glass phase characterized by logarithmic energy barriers. In article A43, we introduced and solved a (Go-type) model to characterize the competition between the formation of native structure favored by designed RNA sequences and the formation of random structures due to configurational entropy.

c) Protein-DNA interaction: The binding of regulatory proteins (e.g. transcription factors) to their specific sequence target on the genome plays an essential role in the process of gene regulation (see Section III.a below). What type of protein-DNA interaction allows the protein to find its true DNA target(s) without getting trapped at unavoidable pseudo sites contained in the genomic background? In article A61, we examined the thermodynamics and kinetics of this system quantitatively for bacteria, based on the energetic forms of exemplary systems obtained experimentally by von Hipple and Stormo. We found that in order for the protein not to be trapped in the genomic background, the protein-DNA binding energy must be cutoff, i.e., become independent of DNA sequence, if the sequence is sufficiently different from the best binding sequence. The predicted cutoff energy is observed for the several transcription factors which have been quantitatively characterized. Together with results from Bob Sauer’s lab that biochemically the cutoff energy can be easily varied by a large amount with a few mutations of the protein sequence, we establish the hypothesis that the cutoff energy is functionally constrained to be of a specific value (dependent only on the length of the genome).

Given the proper choice of the cutoff energy which solves the kinetic search problem, the thermodynamics of binding to target sequences is determined by the composition of the sequence and the discrimination energy of the protein (i.e., the energetic preference for specific nucleotides at different positions of the binding sequence). A very important result derived from the study reported in A61 is that for a narrow range of discrimination energies (~2 kBT per base), the binding strength of a target sequence, say as measured by the protein concentration for 50% site occupation, can be tuned continuously throughout the typical protein concentration of 1~1000 molecules/cell, by selecting the number of bases of the target sequence matching the best binding sequence. Functional relevance from careful choices of binding sequences was implicated in a recent study of the E. coli flagella regulation system by the group of Uri Alon. We hypothesize that the ability to “set” the binding strengths of target sequences at will (a property we refer to as “programmability”) is essential to the universality of the regulatory functions a cell can take on; see Section III.a below.

d) Protein-protein interaction: This is another integral component to cellular information processing. As in protein-DNA interaction discussed above, there is a general problem of how proteins can find their intended protein-partners among millions of other cellular proteins which may cross interact with each other to some degree. A partial solution is spatial partitioning. For example, transcription factors are preferentially attracted to DNA due to electrostatic interaction and are essentially confined to a narrow tube along the DNA backbone. This minimizes their unintended interaction with other cellular proteins. For combinatorial transcription control in bacteria, some degree of cross interaction among the transcription factors themselves was thought to be desirable (see Section III.a below). We are investigating the degree of interaction the system can tolerate without causing system-wide aggregation.

We have performed another study (B10) on the functional role of protein dimerization. Many proteins from enzymes to transcription factors function in nature as dimers (or higher order oligomers), although the monomeric versions of a number of these proteins have been shown experimentally to be just as efficient in their immediate functions. In our investigation of gene circuits/networks (see Section III.b below), we encountered a generic system-level problem which can be elegantly solved by protein dimerization: The problem lies in the control of amplification in gene expression. Often the change in gene expression between the ON and OFF state is no more than ~10 fold at the mRNA level, while much larger fold changes (100~10,000) have been observed at the protein level. The ability to control amplification (e.g., to reduce leakage) is essential to the proper operation of gene circuits/networks. The amplification of fold change turns out to be difficult to achieve molecularly if the proteins are monomeric, but straightforward if the proteins form dimers or high order oligomers, due to the slower rate of proteolysis for proteins in multimer form. We show that the amplification factor can be readily controlled by the monomer-dimer dissociation constant, which can in turn be adjusted by changes of a few amino acids at the dimer interface. Hence we establish that amplification is another “programmable” feature in the construction of gene circuits.

III. Biomolecular Networks

This section addresses issues of molecular information processing at the system level. Characterizing and understanding the genetic circuitry controlling the cell is a major goal in “post-genome” biology. Here we integrate the biophysical properties of protein-DNA and protein-protein interactions discussed in Section II using statistical mechanics, and explore the capabilities of the emerging systems. We will take an “engineering” approach and investigate how one can put together the molecular components to perform various tasks central to gene regulation, rather than modeling possible behaviors of specific existing systems. The reason for taking this unusual approach is not for the lack of interest in existing systems. Rather, accurate modeling of a system requires detailed molecular information (binding constants, on/off rates, etc) the knowledge of which is almost always lacking even for some of the best-characterized systems. On the other hand, one only needs to constrain parameters within the biologically feasible range in the engineering approach. More importantly, the engineering approach allows us to address the underlying rules of gene regulation, which is essential in decoding existing systems. Finally, in vivo synthesis of various gene regulatory systems is of practical interest in its own right, with applications ranging from the detection of complex gene expression patterns to the controlled transcription of therapeutic genes to cure diseases.

a) Combinatorial transcription control: We first focus on one node of a gene network and investigate the complexity of the input/output characteristics, i.e., the complexity of transcription control functions implementable, given the known biophysical constraints on the molecular components for transcription regulation. We show in articles B3 and B4 that cis-regulatory systems with specific protein-DNA interaction and glue-like protein-protein interaction, supplemented by distal activation or repression mechanisms (e.g., DNA looping), are general-purpose molecular computers capable of executing a wide range of control functions encoded in the regulatory DNA sequences. This general result is established by mapping the system of transcription factors and their DNA binding sites to a neural network model –- transcription regulation systems satisfying the above conditions belong to the class of “Boltzmann machines” which are known to be powerful computing machines. [Note that there have been many previous attempts to model gene networks using neural networks. Here we assert that one node of the gene network IS already a neural network, a molecular implementation of the Boltzmann machine!] Within a mechanistic model of the bacterial transcription system, we provide recipes to “program” various control functions, by simply selecting the strengths and positions of the protein-binding DNA sequences in the regulatory region. The emerging architecture is naturally modular and highly evolvable. This opens numerous ways to make synthetic genetic control “devices” with a wide range of bioengineering applications.

The above theory is a quantitative formulation and extension of the “principle of regulated recruitment” advocated by Mark Ptashne and collaborators. This principle captures the essential biology of transcription regulation in bacteria and perhaps also in simple eukaryotes like yeast. The applicability of the general scheme to higher eukaryotes has been questioned due to potential problems with the generic, glue-like protein-protein interaction assumed, since extensive protein-protein interaction may lead to unmanageable cross talks. Also for a number of cases studied, transcription control is insensitive to the precise locations of protein-binding sites, implying that direct protein contact may not be necessary. We are currently exploring a suggestion due to Jon Widom that effective interaction between two proteins may be mediated by DNA-histone interaction prevalent in eukaryotes. This hypothesis predicts that the positions of some nucleosomes in the regulatory regions are co-localized with binding sites of certain transcription factors. We are developing bioinformatic tools (see Section I.e) to detect sequence motifs for nucleosomes localization and thereby test this important hypothesis quantitatively.

More generally, our formulation of transcription control can be taken as an effective description without regard to the actual molecular implementations. Because of the generality of our model (as manifested by the mapping to the Boltzmann machine), we can use it as a tool to relate binding site information and gene expression data in much more powerful ways than the linear models being currently used. We are developing this tool to analyze data obtained from the development of Dictyostelium and Drosophila. The Drosophila work is being carried out in collaboration with the laboratories of Bill McGinnis and Ethan Bier at UCSD.

b) Gene circuits and sequential logic: The power of combinatorial signal control and integration discussed above allows a cell to perform a large class of computations (e.g., logic functions) with individual genes, without the need of “cascades” as is routinely done in electronic circuits. Gene cascades (as in feed forward and feedback circuits) are on the other hand necessary for the manipulation of temporal information, e.g., to detect temporal correlations. With combinatorial integration scheme described in Section III.a, one can have very compact circuit constructions. For example, in article B8 we propose a cis-regulatory construct for a write-enabled memory gate using only one gene with self-feedback. (An equivalent integrated circuit would require a dozen or so transistors.) With this circuit, a gene can be used to “memorize” the state of a certain transcription factor (e.g., whether or not it is phosphorylated) only when triggered by certain other conditions (e.g., a specific phase of the cell cycle). This memory gate thus enables the cell to correlate information at different times. Mathematically, the behavior of our memory gate turns out to be very similar to that of a magnetic memory device, with the write-enabling agent playing the role of temperature, which is raised while “writing” and lowered for memory storage. The simplicity of the regulatory construct has led to a collaborative experimental effort to synthesize this device in vivo in E. coli. We are also exploring gene circuits to implement other temporal functions such as differentiation, integration, and counting.

c) Spatial patterns of gene expression: Understanding the mechanism of spatial pattern generation in multicellular organisms is a major challenge of developmental biology. Drosophila embryogenesis has often been used as a prototypical system for these studies. It was widely believed that spatial patterns result from differential gene expression in response to (maternal) morphogen gradients. However a recent single-embryo experiment by the group of Stan Leibler cast significant doubts on the morphogen-based mechanism, since the latter would predict sensitivity of spatial patterns to cellular protein concentrations, cell size, temperature, etc. while experimentally the patterns formed are insensitive to any of these factors. Recently, we have been exploring various nonequilibrium mechanisms of spatial pattern formation. In our approach, the embryo is treated as a reaction-diffusion medium. Spatial partition and the subsequent pattern formation process do not rely on protein concentrations reaching pre-set thresholds. Instead, they can be controlled temporally in spatio-temporally coupled reaction-diffusion systems, much as how the mid point of a row of dominos can be found by tipping pieces simultaneously from the two ends. Preliminary studies indicate that the resulting patterns are robust to changes in protein concentrations, cell size, etc., but can nevertheless be controlled by diffusion constants and degradation rates which are programmable parameters. Regardless of whether Drosophila embryogenesis actually utilizes this or another nonequilibrium mechanism of spatial pattern formation, we believe it is important to inject the nonequilibrium perspective into the collection of developmental models.

IV. Evolutionary Dynamics

It is often said that any progress in understanding biological systems will ultimately have to come from evolution. In the engineering approach to transcription control discussed in Section III, a crucial aspect of the theory is to identify the “programmable” parameters, i.e., those parameters of the system such as the binding strengths of various DNA motifs that can be fine-tuned easily by evolution. In this way, the difficult issue of parameter selection in modeling is turned into a simple optimization process once the function of the system is defined. Also in our planned experimental study of synthetic gene circuits, an important component will be the use of in vitro directed evolution assays.

This section addresses a number of theoretical issues that arise in the context of molecular evolution, e.g., effects due to different modes of mutation and selection. We are well aware of a large body of theoretical studies on evolution throughout the last century. Most of these studies are phenomenological in nature, as there was little concrete knowledge at the time on the genotype and phenotype which mutation and selection work on respectively, nor was there any practical way of controlling the modes of mutation and selection. Rapid advances in molecular biotechnology now allow experimentalists to have very good control on the mutation and selection of population of molecules using directed in vitro evolution. Also, the genotypes and phenotypes of the resulting molecules can be quantitatively characterized in many cases. It is thus time to reformulate and develop quantitative theories of molecular evolution, and compare them to experiments.

Ultimately, we would of course like to apply these theories to study processes that drive the evolution of molecules in cells. But this would require better knowledge of both the mutation processes in vivo and selection forces in the wild. We are trying to fill gaps in these areas through bioinformatic studies as discussed in Section I.b and also below.

a) DNA binding motif: The evolution of protein-binding DNA motif serves as a perfect example to illustrate the contact between in vitro molecular evolution and the phenomenological theory of evolution developed in the past. As described in article A60, the mean-field theory of DNA binding sequence evolution is a variant of Eigen's quasi-species theory, with the bases of the DNA sequence corresponding to Eigen’s “genes”, and the sequence-dependent affinity to protein being Eigen’s fitness landscape (in the shape of a “mesa” for this case). We find that the localization-delocalization transition that occurs in this system can be observed in RNA viruses upon varying the selection force. For DNA viruses, bacteria and eukaryotes, mutational forces are negligible for realistic selection forces, and the system is always in the localized phase. However, the motifs obtained are expected to be maximally fuzzy, i.e., the sequences will be at the edge of the mesa landscape, accumulating the maximal number of allowed mismatches compared to the best-binding (or consensus) sequence while maintaining affinity to proteins.

In bioinformatic approaches to finding DNA binding motifs, one often pool all of the known or suspected binding sequences of a particular protein in a genome together to form a “weight matrix”, use it to search for additional binding sites, and update the matrix recursively. Implicit in this general approach is the assumption that all binding sequences in the genome are subject to the same selection landscape (e.g., mesas with the same binding threshold). However in the theory of protein-DNA interaction (A61) and transcription regulation (B3 and B4) discussed in Sections II and III, it is critical to fine-tune the strengths of protein-DNA binding to satisfy different functional demands for different genes. We also mentioned that this feature of protein-DNA interaction is supported by recent experimental findings by Uri Alon’s group on E. coli flagella assembly. If our hypothesis is correct, then the “weight matrix” approach will need to be substantially modified. We are currently testing this hypothesis by examining the DNA binding motifs of well-known transcription factors in “orthologous” regulatory regions across different species of bacteria.

b) Time-dependent selection: A bacterium possesses many genetic circuits to regulate its metabolism. Why are certain genes regulated by negative feedback while others are regulated by positive feedback? Savageau accumulated a large collection of evidences supporting the trend that gene products under low/high demands are negatively/positively regulated respectively. A “use-it-or-lose-it” principle applied to the relevant protein-binding DNA sequence was proposed to explain this effect qualitatively. Quantitatively, this effect is difficult to study due to the vast difference between mutational time scales and the physiological/ecological time scale of demand variations. We recently devised an analytical approach (B11) to solve this problem. Our theory is based on Kimura's neutral evolution idea, which predicts the loss of neutral sequences (during the no-demand periods) due to stochastic genetic drift in finite populations. We find the loss probability to be very small during any given no-demand period. But the accumulated loss probability inevitably becomes significant after many demand cycles, leading to catastrophic extinction phenomena. However, using reasonable estimates of mutation rates and demand variations for bacteria, we find this effect to be quantitatively irrelevant for the very short DNA binding sequences Savageau had in mind.

Nevertheless, such rare extinction effect should be relevant to many of the genes which are also subject to similar time-dependent selection forces (e.g., the lacZ gene of E. coli which is only needed during the brief periods when the bacterium relies on lactose as the principle source of nutrient). This raises a system-level issue of how a cell protects the majority of its genes which are not constantly in use. We are collaborating with the laboratory of Lin Chao at UCSD to test aspects of this effect experimentally.

c) Competitive evolution: In the evolutionary studies above, the fitness landscape was treated as a passive function specified by the external environment. Quite often, the evolutionary dynamics is such that the fitness itself depends on the population being evolved. This is for example the case in the in vitro directed evolution of DNA binding sequences, where after every round of evolution one keeps only a small fraction of the better binders from the population. We formulated and solved many aspects of the nonequilibrium dynamics associated with such competitive evolution processes. In article B1, we studied analytically and numerically the competitive dynamics with the mutation process being point substitutions only. The mean-field evolution equation we formulated led to a propagating front, describing progress of the population in the attribute being competitive selected (e.g., the binding strengths of DNA sequences to a particular protein). The propagation speed of the front is not constant but slows down due to the larger entropic cost associated with the optimal solution. These predictions together with the expected (weak) correction due to finite population fluctuations were verified by numerical simulations.

In article B7, we tackled the more difficult case allowing for recombination among the evolving sequences. Recombination-based directed evolution (e.g., DNA shuffling) is widely used in the industry to breed proteins with novel properties. It is known to work much faster than directed evolution with substitutions only. Mathematically, the multi-parent, multi-crossover version of recombination encountered in the directed evolution experiments turns out to be much simpler to analyze than the traditional two-parent, single-crossover version of recombination much studied in population biology. We were able to solve the evolution dynamics analytically in certain limiting cases, and obtain for the general case a scaling solution with a qualitative characterization of its behaviors. We expect our scaling solution to be of use to experimentalists who need to select a large number of experimental parameters to optimize the evolution process.

Experimental studies

Background: The survival and well being of a cell depend crucially on its ability to coordinate its gene activities in response to a vast number of cellular and environmental signals. This is often accomplished combinatorially through a large number of protein-protein and protein-DNA interactions. A major goal of post-genome biology is to characterize these interactions and decipher the complex regulatory circuits/networks they define.

Objective: Instead of directly dissecting the regulatory program of a specific organism, we investigate different molecular strategies an organism could adopt to integrate and regulate signals. We proceed by challenging an organism with a complex set of regulatory tasks, and then evolve specific molecular components that allow the organism to meet its challenges.
Our initial focus is on combinatorial gene regulation. In a recent theoretical analysis [see, PNAS 100: 5136-5141 (2003)], we showed that the bacterial transcription machinery is capable of implementing an unlimited variety of combinatorial control functions (including the rather complex ones better known in higher eukaryotes), merely by appropriate arrangements of the cis-regulatory sequences. In our experimental program, we start with an approximate cis-regulatory construct based on our theoretical analysis, and mutate the regulatory sequences in vitro by mutagenic PCR and DNA shuffling. To “breed” the appropriate regulatory sequences, we insert the regulatory sequence upstream of a reporter gene in a host bacterium (E.coli), then apply positive selection for those combinations of signals we wish the gene to be turned on, and apply negative selection for the other combinations. The new regulatory sequences are then extracted for further rounds of mutation and selection until the desired regulatory functions are obtained.

We plan to apply similar strategies to study combinatorial signal transduction. We will focus on the better-characterized bacterial two-component system. Starting with two independent signaling pathways each regulating a distinct reporter gene, we can select for “cross talk” between the pathways by requiring that the activation of one pathway turn on both genes. We can also do the opposite: Starting with two identical pathways regulating two distinct reporter genes, we can select for the divergence of pathways by e.g., requiring one gene to be off while the other is on. (This mimics the evolution of genetic diversity via gene duplication.) To avoid parasitic solutions, we will evolve only the interaction domains of the proteins involved.

Significance: The synthetic approach described here gives us quantitative handles on the specific signals and their responses, since they are the very challenges we impose on the organism. In this way, we create in vivo models of complex regulatory systems, which can be used to cement a firm link between modeling and experiment so crucial to the success of quantitative biology. Our approach also allows us to quickly read off “solutions” to the challenges by sequencing the specific molecules, e.g., the cis-regulatory sequences or the amino acid sequences of the interacting protein pair. We can learn a great deal of biology from these solutions, since they reflect the diverse range of strategies available to implement complex regulation. They can be used as valuable “training set” to aid bioinformatics effort to dissect complex regulatory systems. In the case of two-component signaling for instance, the different solutions convey much needed information on the specificity of protein-protein interaction. Finally, the possibility to “program” genetic response at will can lead to potentially lucrative bioengineering and biomedical applications. For example, bacteria equipped with such regulatory systems may be used to recognize unique patterns of detectable traits corresponding to specific chemical pollutants or harmful biological agents. Also, a sensitive cellular reporter of complex transcription patterns can be invaluable in the diagnosis of complex diseases.