Genome Sequencing...
             "the sequencing of the Human genome is comparable to establishment
               of the Periodic Table of elements"
...           eric lander (2001).
    Human Genome Project was begun in 1988 - to sequence entire Human Genome
             1st draft of complete Human DNA sequence was published in February 2001
             you are the first generation to be educated during the GENOMIC ERA
   GenBank is the international repository for the 19 billion nucleotide sequenced so far    
 
   Whole genome sequencing:                automated sequencing machines*  
        1st species (gram negative bacillus) sequenced: 
                                   Haemophilus influenzae - 1 circular chromosome of 1.8 mil bp in 1995
        1st eukaryote was Saccharomyces cerevisiae in 1996
        table (as of 2005)* of species sequenced to date:       Genome sizes*           (summer 2002)
  
   DNA Sequencing Protocols & Procedures               
         

      dideoxy method with labeled tags*  &   reading fluorescent tags     -  dideoxy method
 back                                                                                

   
 

 
 

 

 

 

 

    

Strategies for Whole Genome Sequencing
       map-based sequencing - which sequences mapped fragments (1000kb)
       shotgun sequencing - which sequences random fragments (500kb)
           1. genome is cut into 160Kb fragments
           2. fragments inserted into BAC*g  -  bacterial artificial chromosomes
                    Yeast artificial chromosomes
           3. BAC segments are cut in 1000bp pieces
           4. 1K pieces are put into plasmids & plasmids copied
           5. copies are sequenced
           6. pieces are overlapped in sequence creating the entire 160KB  sequenced

 

 

 

 

 

 Understanding the genome sequence
       the genome sequence is like a parts list or dictionary, except that it has no
               punctuation "doormotorwheeltirewindowradiatorfendernutbolt"
       one needs to annotate the sequence by finding gene (coding segments),
               called Open Reading Frames (ORF's)
                   in prokaryote:  look for start (5'ATG'3) and stop (5'TGA'3) codons. Computers
                                          can also search for promoters, operators, regulators.
                   in eukaryotes:  introns & exons present a problem; known sequences of cDNA
                                          probes can be used by computers to spot ORF's.

       Open reading frames: once found one can check sequence against catalogs of known genes.           

 

 

 

 

 
   Problems in Sequences eukaryotic genomes...
       1.  size:
                Humans contain up to 2,000 more DNA bp's than other organisms (table)
                Human genome expected to have 100,000+ genes, but only has 30,000?
                       if proc's have a strong correlation between genome size, gene number,
                           and metabolic capability (# proteins), why don't eukaryotes?
                                   alternative splicing - on average each coding eucaryotic locus
                                               produces 3 distinct mRNA transcripts = 120,000 proteins?

 

 

 

 

 


       2.  repeat sequences:  > 50% human DNA is repeat sequences 
  

                 a.   most is due to transposable elements - segments capable of moving
                       from one location to another in a genome.
                               LINE elements - long interspersed nuclear elements, which hold
                                       a promoter and 2 genes (a reverse transcriptase & an integrase)
  

                  b. simple repeat sequences: makes up 3% of human genome
                               microsatellites - stretches of 1 to 13 bp - mostly dinucleotide ...ACACAC...
                               minisatellites - repeat units of 14 to 500 bp's
                       locations of satellites in chromosomes are highly variable from individual to individual
                       during replication repeat satellites misalign at synapse and DNA polymerase slips
                                   result is every individual has a UNIQUE number of repeats at satellite loci,
                                   which becomes the basis of DNA fingerprinting.
                                 

       3.  split genes:  in humans only 5% of DNA is exons


           

 

 

 

 

 What have we learned of bacterial & archaeal genomes to date (summer 2002)

    1. there is a general correlation between SIZE of genome & metabolic capability
                 parasites have small genomes and require host metabolism (Mycoplasms)
                 non-parasites have large genomes (E. coli & Pseudomonas 10x larger)

    2. many sequenced genes have NO known function
                 38% of E. coli genome's use is unknown

    3. about 15% of each prokaryotic genome is unique to that species

    4. redundancy of sequence is very common
                 E. coli has 86 gene pairs that are identical (diploidy?)

 

 

 


    5. more than one chromosome in proc's is common
                 many prokaryotes has 2 circular DNA molecules

    6. most prokaryotes have plasmids

    7. many prokaryotes are scavengers by acquiring DNA from other species
                 done via Lateral Transfer, i.e.,
           a) proc's take up raw pieces of DNA from environment
           b) Chlamydia bacteria (cause STD's) contain may euc-like genes
           c) pathogenic E.coli have 1,400 or more genes than non-pathogens,
                        that are similar to gene sequences from phage viruses, thus
                       "prokaryotic pathogenic genes may have a viral origin".

 

 

 

 

 

 

 

    Gene Families - groups of genes clustered along the same chromosome
                probably arose from a common ancestral gene sequence through gene duplication
                        especially via unequal crossing over.
                a Pseudogene - a DNA sequence that closely resembles a working gene,
                        but is not transcribed, & has no known function.

back