Overview of molecular biology

INTRODUCTION — Every human began life as a single fertilized egg. This single cell contained all the necessary information to direct development of the various organs and tissues of the body, including germ cells. Understanding the basis of tissue diversity is an ongoing theme of biomedical research. Although many details of this process are unclear, the basic scheme of how tissue-specific functions are established is known.
The basics of molecular biology are reviewed here; this includes the relationship among DNA, RNA, and proteins, as well as the language used in describing molecular and cellular processes. The summarized material is essential to properly understand topics addressed elsewhere within UpToDate. (See "Polymerase chain reaction" and see "Repetitive DNA"). Readers who desire additional coverage of any topics discussed here may consult the cited literature, which was chosen to be both sufficiently detailed and written at an appropriate level for physicians.
CELLULAR DIVERSITY AND GENOMIC STABILITY — In general, all cells of an individual possess exactly the same genetic information; this is largely contained in nuclear DNA that is located within 46 discrete chromosomes (22 pairs of autosomes and 1 pair of sex chromosomes).
In contrast to the general constancy of DNA across tissues, the structure and function of different tissues are highly variable. As an example, cardiac muscle clearly differs from skin or liver. These differences in function are achieved by selective activation of genes in each cell type. The mechanisms by which this is accomplished are understood at the general level, but differ in their details in various tissues.
Some exceptions to the constancy of genetic information exist. As examples: Immunoglobulin and T cell receptor gene rearrangements occur in normal B and T cells, respectively. The degradation of genetic information, including mutations, chromosomal duplication or loss, and/or rearrangements, is a hallmark of neoplastic disease.
CENTRAL DOGMA OF MOLECULAR BIOLOGY — The discovery of the structure of DNA, RNA, and proteins, and of the genetic code provides the conceptual framework by which genetic stability and functional diversity is currently understood.
DNA is the information-storing molecule; each molecule is duplicated during the process of replication that accompanies each cell generation [1]. The process by which this information is transferred into cellular function begins when the genetic information residing in DNA is transcribed into messenger RNA (mRNA).
Messenger RNA is then used to direct the translation of genetic information to physiologically active proteins. This is performed by utilizing the information contained within the mRNA sequence to construct a unique polypeptide, which is defined by a linear chain of amino acids. Via the use of molecular machinery, the specific amino acids are placed within the polypeptide as directed by a template defined by unique triplets of bases (codons) found within the mRNA molecule. This process establishes the correspondence between DNA encoded information and its expression in protein via the universal genetic code [1].
Not all RNA molecules function as mRNA. Some act as components of ribosomes, others are involved in RNA splicing, and still others serve as transfer RNA. Finally, some double stranded RNAs direct targeted degradation of homologous mRNA molecules, inhibiting their translation into proteins [2].
Exceptions to the central dogma of molecular biology are essential to understanding the biology of viruses, organelles, and bacterial second-site suppression: Many viruses store their genetic information as RNA rather than as DNA. Several different mechanisms have been incorporated into viral life cycles to accommodate this difference from the biology of their host cells. Retroviruses utilize reverse transcription to integrate into the host genome following infection [3]. The mitochondrial genetic code differs from the universal genetic code in that UGA (see below) is used as a tryptophan codon and not as a termination codon [4,5]. Alteration of the genetic code in bacteria is the mechanism by which suppression of nonsense mutants is achieved [6].
Essential elements of the central dogma are discussed in each of the following sections. Mechanistic details, however, are not provided.
STRUCTURE OF DNA AND TEMPLATE-DIRECTED NUCLEIC ACID SYNTHESIS — DNA is normally present as an antiparallel polymeric double helix composed of four nucleotide subunits. The nucleotide subunits consist of the following bases: Adenine (A) Guanine (G) Thymine (T) Cytosine (C)
The two strands of the double helix are held together by specific hydrogen bonds that form between A and T (2 hydrogen bonds) or between G and C (3 bonds).
A and G, the larger bases, are purines, while T and C, the smaller bases, are pyrimidines. Double stranded DNA contains equimolar amounts of purine and pyrimidine. In addition, A and T are present in equimolar amounts, as are G and C [7]. The backbone of the DNA molecule is an alternating copolymer of deoxyribose, a 5 carbon sugar and phosphate groups, linked by phosphodiester bonds to the 5' and 3' carbons of each deoxyribose unit.
The hydrogen bonding between the complementary base pairs A and T or G and C provides the chemical basis for DNA's function as the storage medium for genetic information. The genetic information is encoded as the sequence of bases along a DNA strand and is read from the 5' to 3' direction. Since base pairing is specific, it follows that the opposite strands of the molecule carry redundant information, although their sequences are not identical. As a result, given the sequence of a single DNA strand, it is a simple exercise to write down the sequence of its complementary strand.
The weakness of hydrogen bonds, which each possess a strength of approximately 2 kcal/mol, is an important feature with regard to nucleic acid function. This weakness allows denaturation, or separation of the DNA strands, to occur at physiologic temperatures. Separation of the complementary DNA strands and synthesis of new DNA strands by sequences directed by the templates of the original strands allows for accurate copying of sequence information.
The website www.accessexcellence.org/RC/VL/GG/dna_replicating.html, contains a picture of DNA replicating itself. In general, DNA replication is semi-conservative, in that each daughter molecule contains one old and one newly synthesized strand.
During the S phase of each cell cycle, DNA is replicated by DNA polymerases to provide each daughter cell with a complete genome. The genome is the total genetic complement of an organism. Regulation of the cell cycle and consequences of improper cell cycle regulation are reviewed separately.
There are several important structural differences between DNA and RNA. In general, DNA is double-stranded and RNA is single stranded. In DNA, the sugar is deoxyribose, while it is ribose in RNA. In DNA, thymine is the pyrimidine complementary to adenine, but uracil replaces thymine in RNA.
Transcription — Template-directed synthesis, in which one strand of DNA provides sequence information, is used in both DNA replication and transcription of DNA to form mRNA. However, some DNA sequences do not encode protein: Some DNA sequence elements provide control information; they specify the location of an active gene or allow the binding of transcription factors that modulate the rate at which a gene is transcribed. (See "Overview of transcription factors"). Coding regions of a gene's DNA sequence are characteristically interrupted by introns or noncoding intervening sequences that are spliced (or removed) out of mature mRNA. Intron boundaries are marked by splice donor and splice acceptor sites, which provide sequence recognition sites for the spliceosomes; spliceosomes are the enzymatic ribonucleoprotein complex that removes introns from the primary transcript to produce mature mRNA. The 3' end of an mRNA molecule is a tail of adenines that are not present in the DNA, but are added when the transcriptional machinery recognizes a polyadenylation site.
The transcription initiation complex and various transcription factors recognize sequence signals present in the DNA to identify the presence of an active gene. Local denaturation of the DNA allows RNA polymerase to synthesize an mRNA molecule using the coding strand of the DNA as a template. The primary transcript synthesized in this step is spliced and polyadenylated to yield a mature mRNA molecule.
There are three major RNA polymerases present in mammalian cells: The outline of transcription given above is applicable to RNA polymerase 2, the polymerase that is responsible for the expression of most genes. RNA polymerase 1 functions primarily to transcribe ribosomal RNA RNA polymerase 3 functions primarily to transcribe a variety of small RNAs, such as tRNAs (transfer RNAs) and the RNA components of the spliceosomes.
GENETIC CODE AND TRANSLATION — Mature mRNA leaves the nucleus and reaches the ribosomes, where its sequence is recognized and used to direct the synthesis of a polypeptide chain. The ribosomes are complex ribonucleoprotein structures that include the enzymatic machinery for protein synthesis. Protein synthesis is template directed, with the mRNA's sequence information being used to specify the protein's amino acid sequence.
There is a complication concerning the relationship between mRNA and proteins: RNA contains 4 bases while proteins may contain up to 20 amino acids (if amino acid modification is excluded). To overcome this numerical difference, the genetic code establishes a correspondence between specific triplets of bases (codons) and specific amino acids. However, since there are 64 ways to combine three bases, the code is redundant: some amino acids are encoded by more than one codon. In addition, there are three codons (UAG, UGA, UAA) that do not encode amino acids; instead, they specify the end of a polypeptide chain.
By convention, the codons are given as the sequence of mRNA, not the sequence of the complementary DNA strand. The nucleic acid sequence is given 5' to 3' and the protein sequence is given N-terminal- to C-terminal. These directions correspond to the direction of synthesis.
To synthesize proteins, the following processes must occur sequentially. Beginning with the start codon, each codon is held at the synthetic site in the ribosome; at this site, an amino-acid-charged transfer RNA (tRNA) molecule containing a complementary anticodon base pairs with the mRNA. The carried amino acid is subsequently added to the nascent polypeptide chain. Once the amino acid is added, the ribosome moves processively codon by codon along the mRNA strand, adding an additional amino acid to the polypeptide at each step as dictated by the unique codon. When the ribosome encounters a chain termination, or nonsense, codon it releases both the mRNA and the newly synthesized protein.
The processive movement of the ribosome along the mRNA molecule allows multiple ribosomes to simultaneously synthesize multiple copies of a protein from a single mRNA molecule. This is observed microscopically by the presence of polyribosomes or polysomes, which is a tight spatial array of ribosomes translating a single mRNA molecule.
IMPLICATIONS OF THE CENTRAL DOGMA TO MEDICINE — The importance of specific base pairing for the development of a normal organism and/or the maintenance of health cannot be overstated. The mechanisms of template-directed replication and transcription allow preservation of genetic information and its use to encode functional proteins.
Errors in these processes as well as intrinsic properties of this molecular machinery have direct implications for the practice of medicine. As examples: Errors in replication account for mutations that cause a wide array of diseases, including inherited disorders and malignancies. Divergence between humans and bacteria in the enzymatic machinery that carry out the functions of transcription and translation provides the molecular targets for an array of antibiotics that are lethal to bacteria but harmless to humans.
Specific base pairing is also central to many routine laboratory methods. Hybridization of DNA or RNA to labeled probes is accomplished by denaturing a specimen and allowing the probe to base pair with the resulting single-stranded nucleic acid. The polymerase chain reaction (PCR) uses cycles of hybridization followed by template-directed DNA synthesis to produce multiple copies of a defined sequence. Recently introduced chip [8-12] and chromosome painting technologies [13,14] are simply refinements of the basic chemistry of hybridization. As these techniques become integrated into clinical laboratory practice, it is useful for practicing physicians to understand the principles that underlie them.
OTHER LITERATURE AND INFORMATION SOURCES — This topic review only provides a superficial overview of a vast literature. More complete accounts of this material are available from the following books and web sites: Molecular Cell Biology, 5th edition by Matthew P Scott, Paul Matsudaira, Harvey Lodish, James Darnell, Lawrence Zipursky, Chris A Kaiser, Arnold Berk, Monty Krieger, W.H. Freeman and Company, 2003. ISBN 0716743663. Molecular Biology of the Cell, 4th edition by Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, Peter Walter Garland Publishing, 2002, ISBN 0815332181. A Genetic Switch: Phage Lambda Revisited, 3rd edition by Mark Ptashne. Cold Spring Harbor Laboratory Press, 2004, ISBN 0879697164. The Massachussetts Institute of Technology's Experimental Study Group has an on-line Biology Hypertextbook which includes outstanding text and images. It is located at: web.mit.edu/esgbio/www The National Health Museum hosts the Access Excellence program, which was initiated by Genentech to provide educational tools for biology and genetic engineering. Its Graphics Gallery includes multiple high-quality images and can be accessed at: www.accessexcellence.org/AB/GG

No comments: