Research highlights
Research focus of the Hwa lab has been primarily at the interface of
statistical physics and molecular biology. Ideas and methods of
statistical physics bring a unique perspective towards making biology quantitative.
At the same time, these biological problems stand to challenge and enrich
the frontiers of statistical physics. Our theoretical/computational
studies have been on the following topics:
I. The development of new bioinformatic
methods and tools for various genomic and evolutionary studies
II. Characterization of complex interaction
between biopolymers
III. Formulating systemic approaches
to bio-molecular networks
IV. Exploring the nonequilibrium dynamics
of molecular evolution.
Recently, we have also started a molecular biology lab to evolve novel
gene regulatory sequences, signaling pathways, and genetic circuits. The
experimental studies provide insight on the evolvability
of the information processing capability of biological organisms. They also
provide a large amount of data which can lead to the discovery of novel mechanisms
of signaling and control in cells. The evolved control elements have numerous
potential applications in bioengineering and biomedicine.
Theoretical/computational studies
I. Genomics and bioinformatics:
This section addresses the development of methods and tools to analyze
biological data. Statistical physics is involved in the construction of the
methods and the statistical evaluation of the results. Most of the methods
described are geared towards addressing specific biological questions, in
particular the natural history of genomic evolution and the mechanism of
gene regulation, two topics which will be discussed in more detail in Sections
III and IV.
a) Sequence matching statistics: My contact
with bioinformatics started with article A28 in which I noted the connection
between the “sequence matching” problem ubiquitously encountered in bioinformatics
and the directed polymer problem widely studied in statistical physics. A
number of ensuing articles (A39, A40, A41, A42, A46, A48, A49) exploited
this connection to devise more efficient approach to perform, characterize
and evaluate the statistics of sequence alignment, culminating in a new “hybrid
algorithm” (A51, A54, A55) which combined the speed advantage of the existing
alignment algorithms while providing for well-characterized score statistics
for arbitrary gap penalty parameters. The latter feature allows for the real-time
adjustment of position-dependent gap penalties in sequence similarity search,
and serves as the anchor of the next generation BLAST program currently being
developed by the group of Stephen Altschul at NIH.
Statistical physics also has quite a bit to benefit from the above connection.
Inspired by the statistical behavior of the global sequence alignment problem,
we showed that elastic media subject to two independent sources of quenched
disorders could be regarded as being embedded in one effective disorder medium
in higher spatial dimensions. Applications include tribology (A30, A37),
reptation (A38), and polymer absorption (A35). The local alignment problem
also has interesting ramifications in statistical physics, e.g., the nonequilibrium
phase transition discussed in article A36 and the glassy dynamics of DNA
denaturation bubbles in article B2 (see Section II.b). The latter article
introduces the large-deviation theory pedagogically to the statistical physics
community while also presenting the theory of disordered systems to the biophysics
community; it provides one of the simplest example of disordered system whose
glassy dynamics can be explicitly characterized.
b) Genomic substitution patterns: We developed a
strategy (B6) to quantify the history of nucleotide substitutions in the
human genome going far back in time (around 250 Myr) by exploiting the vast
number of “repetitive elements” as fossil records. Our analysis (B5) revealed
a number of entirely unexpected but biologically meaningful results, in particular
the sudden and drastic change of substitutional biases at around the time
of mammalian radiation, which collectively revise the accepted picture of
mammalian genomic evolution. They also shed light on the evolution of genomic
“isochores” (large regions of homogeneous GC-content), the existence of which
has been one of the most puzzling features of the human genome. We are currently
repeating the analysis in mouse to establish events common to mammals and
those specific to primates. We are also performing region-specific analysis
to characterize the different substitutional biases along the length of the
chromosome. In addition to peeking back into the history of genomic evolution,
our findings have direct bearing on comparative genomic analysis (e.g., the
search for genes and DNA motifs via human/mouse comparison) the success of
which depends crucially on the proper description of the evolution of the
genomic background.
c) Context-dependent mutation: Existing mutation
models describe the mutation of any given nucleotide as a reversible, Markov
process independent of its neighboring bases. However, this neglects context-dependent
mutation processes, e.g., the CpG-methylation-deamination process which is
the dominant channel of mutation in all vertebrates. These more complex processes
do not satisfy detailed balance and are nonequilibrium in nature; they belong
to a class of nonlinear stochastic dynamics problems well studied in statistical
physics but unfamiliar to the bioinformatics community. We solved this problem
(A59) by applying the “cluster approximation” which turned out to be very
efficient and accurate. A web server (C7) was constructed to quickly extract
context-dependent mutation processes from raw genomic sequences. We are currently
developing algorithms to reconstruct phylogenetic trees for the repetitive
elements in the human genome and across the mammalian species. This effort
should yield detailed information on the evolution of mammalian genomes,
e.g., duplications and rearrangements. Incorporating the CpG-methylation-deamination
process is crucial to our approach since the much faster substitution rate
by this process provides excellent statistics untenable otherwise.
d) Gene expression analysis: We have been collaborating
with the laboratory of Bill Loomis (UCSD biology) to analyze large scale
gene expression data obtained from DNA micro-array experiments on the development
of the social amoeba Dictyostelium discodeum. We developed two methods: The
first is a generic method to cluster genes with similar expression patterns
(A50). It is based on the physics of percolation; the probabilistic nature
of the method accommodates experimental uncertainties and allows for “multi-parenthood”,
making it superior to commonly used methods such as the hierarchical clustering
and the self-organizing map. The second method (A56) takes advantage of the
temporal nature of the data from Loomis’ experiments. We assume a first-order
kinetic process and work backward to deduce the onset and cessation of gene
expression, thereby extracting a few vital numbers out of the vast raw data.
The results can be straightforwardly (e.g., visually) analyzed. A pilot study
involving 700 genes has already yielded valuable insights into the biology
of Dictyostelium development (A53). Full genome-wide studies (including comparisons
to hundreds of mutants) are now underway. We hope to extend our analysis
to discover gene-gene interaction underlying the complicated expression patterns.
e) DNA motif search: Much of the regulatory information
is coded in the genome via the relative positions and strengths of the protein-binding
DNA sequence motifs. We are analyzing the composition of the DNA motifs for
orthologous regulatory regions across different species of bacteria to discriminate
aspects which are essential to functions and those due to stochastic fluctuations
of the evolution dynamics (see also Section IV.a). We are also developing
a method to detect DNA motifs for the localization of nucleosomes for the
eukaryotic genomes. Our approach is based on the biophysics of histone-DNA
interaction, combined with the experimental knowledge of exemplary histone-binding
DNA sequences. With this method, we wish to test the hypothesis that the
positions of nucleosomes mediate indirect interaction between the regulatory
proteins and hence can serve important regulatory functions (see Section
III.a).
II. Molecular Biophysics
This section addresses the biophysics of biopolymers and their interactions.
We introduce the concepts and methods of modern statistical physics to characterize
the complex interactions between heterogeneous DNA and RNA sequences with
each other and with proteins. The main goal of these studies is to obtain
molecular information and provide biophysical constraints on the properties
of biomolecules to be used in the system-level studies to be described in
Section III. Another objective is to establish biopolymers as interesting
model systems to the statistical physics community.
a) Single molecule biophysics: Rapid advances in
the technology of single molecule manipulation (e.g., by optical tweezers
and AFM) open up many possibilities for detailed experimental studies of
biomolecules. One recent thrust pioneered by the groups of Bustamante and
Tinoco at Berkely is to pull a single RNA molecule from its two ends, and
learn the structure of the molecule from the way it opens up (through the
force-displacement characteristics). We developed a detailed theoretical
model (articles A52, A62 and a web server C6) to describe this process by
combining the popular RNA folding programs (MFOLD and the “Vienna package”)
together with the statistical mechanics of pulling. We describe features
in the structure of the molecule that can be extracted from the force-displacement
characteristics, and propose different experimental strategies to enhance
the extractable features. Of special interest is our recent proposal (B9)
to pull the RNA through a molecular pinhole. We expect the additional positional
information on the pulling process to yield a great deal more on the structure
of the molecule than the experiments currently being carried out.
In addition to RNA, we are working with local biophysics experimentalists
(the group of Doug Smith at UCSD) to probe protein-DNA binding by unzipping
double-stranded DNA. The approach is based on the effect of DNA-binding proteins
on the unzipping characteristics of double-stranded DNA, and the feasibility
has been demonstrated recently by the group of Michelle Wang at Cornell.
By designing appropriate DNA sequences, we hope to develop an efficient method
to detect the binding sequences (and their strengths) for different regulatory
proteins. By further placing multiple protein-binding sites on the same sequence,
we hope to develop an efficient way of detecting protein-protein interaction.
b) Glassy properties of biopolymers: Heterogeneous
biopolymers provide nice examples of “disordered” systems which have been
of long-standing interest in statistical physics. Best known among these
is the “protein-folding” problem, which is unfortunately too difficult to
tackle either analytically or numerically. Over the years, I introduced a
number of other biopolymeric problems to the statistical physics community.
The simplest non-trivial problem is the localization of DNA denaturation
bubbles (A34, B2): This system is a physical realization of the celebrated
Random Energy Model. We showed the existence of a (inverted) glass transition,
i.e., the denaturation bubble becomes localized to weak (AT-rich) regions
of double-stranded DNA upon increase in temperature towards the bulk denaturation
point. This phase transition is understood analytically in terms of the large-deviation
theory developed in the context of local sequence alignment (see Section
I.a). The theory can further be applied to describe dynamic properties of
the denaturation bubble. We found the bubble to be sub-diffusive in the “glass”
phase, with continuously varying dynamic exponents as predicted by the large-deviation
theory.
Another problem of interest is that of RNA secondary structure formation.
It has aspects similar to sequence alignment and directed polymer, but is
more complex due to the nontrivial topology the RNA molecule can take on.
In articles A57 and A58, we studied in detail the folding of random RNA sequences.
By a combination of analytical and numerical calculations, we found the system
to have a weak glass phase characterized by logarithmic energy barriers.
In article A43, we introduced and solved a (Go-type) model to characterize
the competition between the formation of native structure favored by designed
RNA sequences and the formation of random structures due to configurational
entropy.
c) Protein-DNA interaction: The binding of regulatory
proteins (e.g. transcription factors) to their specific sequence target on
the genome plays an essential role in the process of gene regulation (see
Section III.a below). What type of protein-DNA interaction allows the protein
to find its true DNA target(s) without getting trapped at unavoidable pseudo
sites contained in the genomic background? In article A61, we examined the
thermodynamics and kinetics of this system quantitatively for bacteria, based
on the energetic forms of exemplary systems obtained experimentally by von
Hipple and Stormo. We found that in order for the protein not to be trapped
in the genomic background, the protein-DNA binding energy must be cutoff,
i.e., become independent of DNA sequence, if the sequence is sufficiently
different from the best binding sequence. The predicted cutoff energy is
observed for the several transcription factors which have been quantitatively
characterized. Together with results from Bob Sauer’s lab that biochemically
the cutoff energy can be easily varied by a large amount with a few mutations
of the protein sequence, we establish the hypothesis that the cutoff energy
is functionally constrained to be of a specific value (dependent only on
the length of the genome).
Given the proper choice of the cutoff energy which solves the kinetic search
problem, the thermodynamics of binding to target sequences is determined
by the composition of the sequence and the discrimination energy of the protein
(i.e., the energetic preference for specific nucleotides at different positions
of the binding sequence). A very important result derived from the study
reported in A61 is that for a narrow range of discrimination energies (~2
kBT per base), the binding strength of a target sequence, say as measured
by the protein concentration for 50% site occupation, can be tuned continuously
throughout the typical protein concentration of 1~1000 molecules/cell, by
selecting the number of bases of the target sequence matching the best binding
sequence. Functional relevance from careful choices of binding sequences
was implicated in a recent study of the E. coli flagella regulation system
by the group of Uri Alon. We hypothesize that the ability to “set” the binding
strengths of target sequences at will (a property we refer to as “programmability”)
is essential to the universality of the regulatory functions a cell can take
on; see Section III.a below.
d) Protein-protein interaction: This is another
integral component to cellular information processing. As in protein-DNA
interaction discussed above, there is a general problem of how proteins can
find their intended protein-partners among millions of other cellular proteins
which may cross interact with each other to some degree. A partial solution
is spatial partitioning. For example, transcription factors are preferentially
attracted to DNA due to electrostatic interaction and are essentially confined
to a narrow tube along the DNA backbone. This minimizes their unintended
interaction with other cellular proteins. For combinatorial transcription
control in bacteria, some degree of cross interaction among the transcription
factors themselves was thought to be desirable (see Section III.a below).
We are investigating the degree of interaction the system can tolerate without
causing system-wide aggregation.
We have performed another study (B10) on the functional role of protein dimerization.
Many proteins from enzymes to transcription factors function in nature as
dimers (or higher order oligomers), although the monomeric versions of a
number of these proteins have been shown experimentally to be just as efficient
in their immediate functions. In our investigation of gene circuits/networks
(see Section III.b below), we encountered a generic system-level problem
which can be elegantly solved by protein dimerization: The problem lies in
the control of amplification in gene expression. Often the change in gene
expression between the ON and OFF state is no more than ~10 fold at the mRNA
level, while much larger fold changes (100~10,000) have been observed at
the protein level. The ability to control amplification (e.g., to reduce
leakage) is essential to the proper operation of gene circuits/networks.
The amplification of fold change turns out to be difficult to achieve molecularly
if the proteins are monomeric, but straightforward if the proteins form dimers
or high order oligomers, due to the slower rate of proteolysis for proteins
in multimer form. We show that the amplification factor can be readily controlled
by the monomer-dimer dissociation constant, which can in turn be adjusted
by changes of a few amino acids at the dimer interface. Hence we establish
that amplification is another “programmable” feature in the construction
of gene circuits.
III. Biomolecular Networks
This section addresses issues of molecular information processing at
the system level. Characterizing and understanding the genetic circuitry
controlling the cell is a major goal in “post-genome” biology. Here we integrate
the biophysical properties of protein-DNA and protein-protein interactions
discussed in Section II using statistical mechanics, and explore the capabilities
of the emerging systems. We will take an “engineering” approach and investigate
how one can put together the molecular components to perform various tasks
central to gene regulation, rather than modeling possible behaviors of specific
existing systems. The reason for taking this unusual approach is not for
the lack of interest in existing systems. Rather, accurate modeling of a
system requires detailed molecular information (binding constants, on/off
rates, etc) the knowledge of which is almost always lacking even for some
of the best-characterized systems. On the other hand, one only needs to constrain
parameters within the biologically feasible range in the engineering approach.
More importantly, the engineering approach allows us to address the underlying
rules of gene regulation, which is essential in decoding existing systems.
Finally, in vivo synthesis of various gene regulatory systems is of practical
interest in its own right, with applications ranging from the detection of
complex gene expression patterns to the controlled transcription of therapeutic
genes to cure diseases.
a) Combinatorial transcription control: We first focus on one node
of a gene network and investigate the complexity of the input/output characteristics,
i.e., the complexity of transcription control functions implementable, given
the known biophysical constraints on the molecular components for transcription
regulation. We show in articles B3 and B4 that cis-regulatory systems with
specific protein-DNA interaction and glue-like protein-protein interaction,
supplemented by distal activation or repression mechanisms (e.g., DNA looping),
are general-purpose molecular computers capable of executing a wide range
of control functions encoded in the regulatory DNA sequences. This general
result is established by mapping the system of transcription factors and
their DNA binding sites to a neural network model –- transcription regulation
systems satisfying the above conditions belong to the class of “Boltzmann
machines” which are known to be powerful computing machines. [Note that there
have been many previous attempts to model gene networks using neural networks.
Here we assert that one node of the gene network IS already a neural network,
a molecular implementation of the Boltzmann machine!] Within a mechanistic
model of the bacterial transcription system, we provide recipes to “program”
various control functions, by simply selecting the strengths and positions
of the protein-binding DNA sequences in the regulatory region. The emerging
architecture is naturally modular and highly evolvable. This opens numerous
ways to make synthetic genetic control “devices” with a wide range of bioengineering
applications.
The above theory is a quantitative formulation and extension of the “principle
of regulated recruitment” advocated by Mark Ptashne and collaborators. This
principle captures the essential biology of transcription regulation in bacteria
and perhaps also in simple eukaryotes like yeast. The applicability of the
general scheme to higher eukaryotes has been questioned due to potential
problems with the generic, glue-like protein-protein interaction assumed,
since extensive protein-protein interaction may lead to unmanageable cross
talks. Also for a number of cases studied, transcription control is insensitive
to the precise locations of protein-binding sites, implying that direct protein
contact may not be necessary. We are currently exploring a suggestion due
to Jon Widom that effective interaction between two proteins may be mediated
by DNA-histone interaction prevalent in eukaryotes. This hypothesis predicts
that the positions of some nucleosomes in the regulatory regions are co-localized
with binding sites of certain transcription factors. We are developing bioinformatic
tools (see Section I.e) to detect sequence motifs for nucleosomes localization
and thereby test this important hypothesis quantitatively.
More generally, our formulation of transcription control can be taken as
an effective description without regard to the actual molecular implementations.
Because of the generality of our model (as manifested by the mapping to the
Boltzmann machine), we can use it as a tool to relate binding site information
and gene expression data in much more powerful ways than the linear models
being currently used. We are developing this tool to analyze data obtained
from the development of Dictyostelium and Drosophila. The Drosophila work
is being carried out in collaboration with the laboratories of Bill McGinnis
and Ethan Bier at UCSD.
b) Gene circuits and sequential logic: The power of combinatorial
signal control and integration discussed above allows a cell to perform a
large class of computations (e.g., logic functions) with individual genes,
without the need of “cascades” as is routinely done in electronic circuits.
Gene cascades (as in feed forward and feedback circuits) are on the other
hand necessary for the manipulation of temporal information, e.g., to detect
temporal correlations. With combinatorial integration scheme described in
Section III.a, one can have very compact circuit constructions. For example,
in article B8 we propose a cis-regulatory construct for a write-enabled memory
gate using only one gene with self-feedback. (An equivalent integrated circuit
would require a dozen or so transistors.) With this circuit, a gene can be
used to “memorize” the state of a certain transcription factor (e.g., whether
or not it is phosphorylated) only when triggered by certain other conditions
(e.g., a specific phase of the cell cycle). This memory gate thus enables
the cell to correlate information at different times. Mathematically, the
behavior of our memory gate turns out to be very similar to that of a magnetic
memory device, with the write-enabling agent playing the role of temperature,
which is raised while “writing” and lowered for memory storage. The simplicity
of the regulatory construct has led to a collaborative experimental effort
to synthesize this device in vivo in E. coli. We are also exploring gene
circuits to implement other temporal functions such as differentiation, integration,
and counting.
c) Spatial patterns of gene expression: Understanding the mechanism
of spatial pattern generation in multicellular organisms is a major challenge
of developmental biology. Drosophila embryogenesis has often been used as
a prototypical system for these studies. It was widely believed that spatial
patterns result from differential gene expression in response to (maternal)
morphogen gradients. However a recent single-embryo experiment by the group
of Stan Leibler cast significant doubts on the morphogen-based mechanism,
since the latter would predict sensitivity of spatial patterns to cellular
protein concentrations, cell size, temperature, etc. while experimentally
the patterns formed are insensitive to any of these factors. Recently, we
have been exploring various nonequilibrium mechanisms of spatial pattern
formation. In our approach, the embryo is treated as a reaction-diffusion
medium. Spatial partition and the subsequent pattern formation process do
not rely on protein concentrations reaching pre-set thresholds. Instead,
they can be controlled temporally in spatio-temporally coupled reaction-diffusion
systems, much as how the mid point of a row of dominos can be found by tipping
pieces simultaneously from the two ends. Preliminary studies indicate that
the resulting patterns are robust to changes in protein concentrations, cell
size, etc., but can nevertheless be controlled by diffusion constants and
degradation rates which are programmable parameters. Regardless of whether
Drosophila embryogenesis actually utilizes this or another nonequilibrium
mechanism of spatial pattern formation, we believe it is important to inject
the nonequilibrium perspective into the collection of developmental models.
IV. Evolutionary Dynamics
It is often said that any progress in understanding biological systems
will ultimately have to come from evolution. In the engineering approach
to transcription control discussed in Section III, a crucial aspect of the
theory is to identify the “programmable” parameters, i.e., those parameters
of the system such as the binding strengths of various DNA motifs that can
be fine-tuned easily by evolution. In this way, the difficult issue of parameter
selection in modeling is turned into a simple optimization process once the
function of the system is defined. Also in our planned experimental study
of synthetic gene circuits, an important component will be the use of in
vitro directed evolution assays.
This section addresses a number of theoretical issues that arise in the context
of molecular evolution, e.g., effects due to different modes of mutation
and selection. We are well aware of a large body of theoretical studies on
evolution throughout the last century. Most of these studies are phenomenological
in nature, as there was little concrete knowledge at the time on the genotype
and phenotype which mutation and selection work on respectively, nor was
there any practical way of controlling the modes of mutation and selection.
Rapid advances in molecular biotechnology now allow experimentalists to have
very good control on the mutation and selection of population of molecules
using directed in vitro evolution. Also, the genotypes and phenotypes of
the resulting molecules can be quantitatively characterized in many cases.
It is thus time to reformulate and develop quantitative theories of molecular
evolution, and compare them to experiments.
Ultimately, we would of course like to apply these theories to study processes
that drive the evolution of molecules in cells. But this would require better
knowledge of both the mutation processes in vivo and selection forces in
the wild. We are trying to fill gaps in these areas through bioinformatic
studies as discussed in Section I.b and also below.
a) DNA binding motif: The evolution of protein-binding DNA motif serves
as a perfect example to illustrate the contact between in vitro molecular
evolution and the phenomenological theory of evolution developed in the past.
As described in article A60, the mean-field theory of DNA binding sequence
evolution is a variant of Eigen's quasi-species theory, with the bases of
the DNA sequence corresponding to Eigen’s “genes”, and the sequence-dependent
affinity to protein being Eigen’s fitness landscape (in the shape of a “mesa”
for this case). We find that the localization-delocalization transition that
occurs in this system can be observed in RNA viruses upon varying the selection
force. For DNA viruses, bacteria and eukaryotes, mutational forces are negligible
for realistic selection forces, and the system is always in the localized
phase. However, the motifs obtained are expected to be maximally fuzzy, i.e.,
the sequences will be at the edge of the mesa landscape, accumulating the
maximal number of allowed mismatches compared to the best-binding (or consensus)
sequence while maintaining affinity to proteins.
In bioinformatic approaches to finding DNA binding motifs, one often pool
all of the known or suspected binding sequences of a particular protein in
a genome together to form a “weight matrix”, use it to search for additional
binding sites, and update the matrix recursively. Implicit in this general
approach is the assumption that all binding sequences in the genome are subject
to the same selection landscape (e.g., mesas with the same binding threshold).
However in the theory of protein-DNA interaction (A61) and transcription
regulation (B3 and B4) discussed in Sections II and III, it is critical to
fine-tune the strengths of protein-DNA binding to satisfy different functional
demands for different genes. We also mentioned that this feature of protein-DNA
interaction is supported by recent experimental findings by Uri Alon’s group
on E. coli flagella assembly. If our hypothesis is correct, then the “weight
matrix” approach will need to be substantially modified. We are currently
testing this hypothesis by examining the DNA binding motifs of well-known
transcription factors in “orthologous” regulatory regions across different
species of bacteria.
b) Time-dependent selection: A bacterium possesses many genetic circuits
to regulate its metabolism. Why are certain genes regulated by negative feedback
while others are regulated by positive feedback? Savageau accumulated a large
collection of evidences supporting the trend that gene products under low/high
demands are negatively/positively regulated respectively. A “use-it-or-lose-it”
principle applied to the relevant protein-binding DNA sequence was proposed
to explain this effect qualitatively. Quantitatively, this effect is difficult
to study due to the vast difference between mutational time scales and the
physiological/ecological time scale of demand variations. We recently devised
an analytical approach (B11) to solve this problem. Our theory is based on
Kimura's neutral evolution idea, which predicts the loss of neutral sequences
(during the no-demand periods) due to stochastic genetic drift in finite
populations. We find the loss probability to be very small during any given
no-demand period. But the accumulated loss probability inevitably becomes
significant after many demand cycles, leading to catastrophic extinction
phenomena. However, using reasonable estimates of mutation rates and demand
variations for bacteria, we find this effect to be quantitatively irrelevant
for the very short DNA binding sequences Savageau had in mind.
Nevertheless, such rare extinction effect should be relevant to many of the
genes which are also subject to similar time-dependent selection forces (e.g.,
the lacZ gene of E. coli which is only needed during the brief periods when
the bacterium relies on lactose as the principle source of nutrient). This
raises a system-level issue of how a cell protects the majority of its genes
which are not constantly in use. We are collaborating with the laboratory
of Lin Chao at UCSD to test aspects of this effect experimentally.
c) Competitive evolution: In the evolutionary studies above, the fitness
landscape was treated as a passive function specified by the external environment.
Quite often, the evolutionary dynamics is such that the fitness itself depends
on the population being evolved. This is for example the case in the in vitro
directed evolution of DNA binding sequences, where after every round of evolution
one keeps only a small fraction of the better binders from the population.
We formulated and solved many aspects of the nonequilibrium dynamics associated
with such competitive evolution processes. In article B1, we studied analytically
and numerically the competitive dynamics with the mutation process being
point substitutions only. The mean-field evolution equation we formulated
led to a propagating front, describing progress of the population in the
attribute being competitive selected (e.g., the binding strengths of DNA
sequences to a particular protein). The propagation speed of the front is
not constant but slows down due to the larger entropic cost associated with
the optimal solution. These predictions together with the expected (weak)
correction due to finite population fluctuations were verified by numerical
simulations.
In article B7, we tackled the more difficult case allowing for recombination
among the evolving sequences. Recombination-based directed evolution (e.g.,
DNA shuffling) is widely used in the industry to breed proteins with novel
properties. It is known to work much faster than directed evolution with
substitutions only. Mathematically, the multi-parent, multi-crossover version
of recombination encountered in the directed evolution experiments turns
out to be much simpler to analyze than the traditional two-parent, single-crossover
version of recombination much studied in population biology. We were able
to solve the evolution dynamics analytically in certain limiting cases, and
obtain for the general case a scaling solution with a qualitative characterization
of its behaviors. We expect our scaling solution to be of use to experimentalists
who need to select a large number of experimental parameters to optimize
the evolution process.
Experimental studies
Background: The survival and well being of a cell depend crucially
on its ability to coordinate its gene activities in response to a vast number
of cellular and environmental signals. This is often accomplished combinatorially
through a large number of protein-protein and protein-DNA interactions. A
major goal of post-genome biology is to characterize these interactions and
decipher the complex regulatory circuits/networks they define.
Objective: Instead of directly dissecting the regulatory program of
a specific organism, we investigate different molecular strategies an organism
could adopt to integrate and regulate signals. We proceed by challenging
an organism with a complex set of regulatory tasks, and then evolve specific
molecular components that allow the organism to meet its challenges.
Our initial focus is on combinatorial gene regulation. In a recent theoretical
analysis [see http://matisse.ucsd.edu/~hwa/pub/logic.pdf, PNAS 100: 5136-5141
(2003)], we showed that the bacterial transcription machinery is capable
of implementing an unlimited variety of combinatorial control functions (including
the rather complex ones better known in higher eukaryotes), merely by appropriate
arrangements of the cis-regulatory sequences. In our experimental program,
we start with an approximate cis-regulatory construct based on our theoretical
analysis, and mutate the regulatory sequences in vitro by mutagenic PCR and
DNA shuffling. To “breed” the appropriate regulatory sequences, we insert
the regulatory sequence upstream of a reporter gene in a host bacterium (E.coli),
then apply positive selection for those combinations of signals we wish the
gene to be turned on, and apply negative selection for the other combinations.
The new regulatory sequences are then extracted for further rounds of mutation
and selection until the desired regulatory functions are obtained.
We plan to apply similar strategies to study combinatorial signal transduction.
We will focus on the better-characterized bacterial two-component system.
Starting with two independent signaling pathways each regulating a distinct
reporter gene, we can select for “cross talk” between the pathways by requiring
that the activation of one pathway turn on both genes. We can also do the
opposite: Starting with two identical pathways regulating two distinct reporter
genes, we can select for the divergence of pathways by e.g., requiring one
gene to be off while the other is on. (This mimics the evolution of genetic
diversity via gene duplication.) To avoid parasitic solutions, we will evolve
only the interaction domains of the proteins involved.
Significance: The synthetic approach described here gives us quantitative
handles on the specific signals and their responses, since they are the very
challenges we impose on the organism. In this way, we create in vivo models
of complex regulatory systems, which can be used to cement a firm link between
modeling and experiment so crucial to the success of quantitative biology.
Our approach also allows us to quickly read off “solutions” to the challenges
by sequencing the specific molecules, e.g., the cis-regulatory sequences
or the amino acid sequences of the interacting protein pair. We can learn
a great deal of biology from these solutions, since they reflect the diverse
range of strategies available to implement complex regulation. They can be
used as valuable “training set” to aid bioinformatics effort to dissect complex
regulatory systems. In the case of two-component signaling for instance,
the different solutions convey much needed information on the specificity
of protein-protein interaction. Finally, the possibility to “program” genetic
response at will can lead to potentially lucrative bioengineering and biomedical
applications. For example, bacteria equipped with such regulatory systems
may be used to recognize unique patterns of detectable traits corresponding
to specific chemical pollutants or harmful biological agents. Also, a sensitive
cellular reporter of complex transcription patterns can be invaluable in
the diagnosis of complex diseases.