Advances In Genome Technology and Bioinformatics

Selected Reading List

Following are a few selected readings, including both classic and recent.  We cannot provide pdfs because of copyright issues, but many of the full text articles are freely available through PubMed (links provided).  Publications by TIGR faculty may also be available at http://www.tigr.org/tigr-scripts/publications/listing.pl.

Maxam, A. M. and W. Gilbert (1977). "A new method for sequencing DNA." Proc Natl Acad Sci U S A 74(2): 560-4.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=265521

            DNA can be sequenced by a chemical procedure that breaks a terminally labeled DNA molecule partially at each repetition of a base. The lengths of the labeled fragments then identify the positions of that base. We describe reactions that cleave DNA preferentially at guanines, at adenines, at cytosines and thymines equally, and at cytosines alone. When the products of these four reactions are resolved by size, by electrophoresis on a polyacrylamide gel, the DNA sequence can be read from the pattern of radioactive bands. The technique will permit sequencing of at least 100 bases from the point of labeling.

Sanger, F., S. Nicklen, et al. (1977). "DNA sequencing with chain-terminating inhibitors." Proc Natl Acad Sci U S A 74(12): 5463-7. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=271968

Birnboim, H. C. and J. Doly (1979). "A rapid alkaline extraction procedure for screening recombinant plasmid DNA." Nucleic Acids Res 7(6): 1513-23.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=388356

            A procedure for extracting plasmid DNA from bacterial cells is described. The method is simple enough to permit the analysis by gel electrophoresis of 100 or more clones per day yet yields plasmid DNA which is pure enough to be digestible by restriction enzymes. The principle of the method is selective alkaline denaturation of high molecular weight chromosomal DNA while covalently closed circular DNA remains double-stranded. Adequate pH control is accomplished without using a pH meter. Upon neutralization, chromosomal DNA renatures to form an insoluble clot, leaving plasmid DNA in the supernatant. Large and small plasmid DNAs have been extracted by this method.

Mullis, K. B. and F. A. Faloona (1987). "Specific synthesis of DNA in vitro via a polymerase-catalyzed chain reaction." Methods Enzymol 155: 335-50.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=3431465

Coulson, A., R. Waterston, et al. (1988). "Genome linking with yeast artificial chromosomes." Nature 335(6186): 184-6.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=3045566

            The haploid genome of Caenorhabditis elegans consists of some 80 x 10(6) base pairs of DNA contained in six chromosomes. The large number of interesting loci that have been recognized by mutation, and the accuracy of the genetic map, mean that a physical map of the genome is highly desirable, because it will facilitate the molecular cloning of chosen loci. The first steps towards such a map used a fingerprinting method to link cosmid clones together. This approach reached its practical limit last year, when 90-95% of the genome had been cloned into 17,500 cosmids assembled into some 700 clusters (contigs), but the linking clones needed were either non-existent or extremely rare. Anticipating this, we had planned to link by physical means--probably by hybridization to NotI fragments separated by pulse field gel electrophoresis. NotI recognizes an eight base sequence of GC pairs; thus the fragments should be large enough to bridge regions that clone poorly in cosmids, and, with no selective step involved, would necessarily be fully representative. However, with the availability of a yeast artificial chromosome (YAC) vector, we decided to use this alternative source of large DNA fragments to obtain linkage. The technique involves the ligation of large (50-1,000 kilobase) genomic fragments into a vector that provides centromeric, telomeric and selective functions; the constructs are then introduced into Saccharomyces cerevisiae, and replicate in the same manner as the host chromosomes.

Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=2231712

            A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

Hawkins, T. L., T. O'Connor-Morin, et al. (1994). "DNA purification and isolation using a solid-phase." Nucleic Acids Res 22(21): 4543-4.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=7971285

Fleischmann, R. D., M. D. Adams, et al. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae Rd." Science 269(5223): 496-512.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=7542800

            An approach for genome analysis based on sequencing and assembly of unselected pieces of DNA from the whole chromosome has been applied to obtain the complete nucleotide sequence (1,830,137 base pairs) of the genome from the bacterium Haemophilus influenzae Rd. This approach eliminates the need for initial mapping efforts and is therefore applicable to the vast array of microbial species for which genome maps are unavailable. The H. influenzae Rd genome sequence (Genome Sequence DataBase accession number L42023) represents the only complete genome sequence from a free-living organism.

Schena, M., D. Shalon, et al. (1995). "Quantitative monitoring of gene expression patterns with a complementary DNA microarray." Science 270(5235): 467-70.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=7569999

            A high-capacity system was developed to monitor the expression of many genes in parallel. Microarrays prepared by high-speed robotic printing of complementary DNAs on glass were used for quantitative expression measurements of the corresponding genes. Because of the small format and high density of the arrays, hybridization volumes of 2 microliters could be used that enabled detection of rare transcripts in probe mixtures derived from 2 micrograms of total cellular messenger RNA. Differential expression measurements of 45 Arabidopsis genes were made by means of simultaneous, two-color fluorescence hybridization.

Velculescu, V. E., L. Zhang, et al. (1995). "Serial analysis of gene expression." Science 270(5235): 484-7.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=7570003

            The characteristics of an organism are determined by the genes expressed within it. A method was developed, called serial analysis of gene expression (SAGE), that allows the quantitative and simultaneous analysis of a large number of transcripts. To demonstrate this strategy, short diagnostic sequence tags were isolated from pancreas, concatenated, and cloned. Manual sequencing of 1000 tags revealed a gene expression pattern characteristic of pancreatic function. New pancreatic transcripts corresponding to novel tags were identified. SAGE should provide a broadly applicable means for the quantitative cataloging and comparison of expressed genes in a variety of normal, developmental, and disease states.

Devine, S. E., S. L. Chissoe, et al. (1997). "A transposon-based strategy for sequencing repetitive DNA in eukaryotic genomes." Genome Res 7(5): 551-63.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9149950

            Repetitive DNA is a significant component of eukaryotic genomes. We have developed a strategy to efficiently and accurately sequence repetitive DNA in the nematode Caenorhabditis elegans using integrated artificial transposons and automated fluorescent sequencing. Mapping and assembly tools represent important components of this strategy and facilitate sequence assembly in complex regions. We have applied the strategy to several cosmid assembly gaps resulting from repetitive DNA and have accurately recovered the sequences of these regions. Analysis of these regions revealed six novel transposon-like repetitive elements, IR-1, IR-2, IR-3, IR-4, IR-5, and TR-1. Each of these elements represents a middle-repetitive DNA family in C. elegans containing at least 3-140 copies per genome. Copies of IR-1, IR-2, IR-4, and IR-5 are located on all (or most) of the six nematode chromosomes, whereas IR-3 is predominantly located on chromosome X. These elements are almost exclusively interspersed between predicted genes or within the predicted introns of these genes, with the exception of a single IR-5 element, which is located within a predicted exon. IR-1, IR-2, and IR-3 are flanked by short sequence duplications resembling the target site duplications of transposons. We have established a website database (http:(/)/www.welch.jhu.edu/approximately devine/RepDNAdb.html) to track and cross-reference these transposon-like repetitive elements that contains detailed information on individual element copies and provides links to appropriate GenBank records. This set of tools may be used to sequence, track, and study repetitive DNA in model organisms and humans.

Fraser, C. M. and R. D. Fleischmann (1997). "Strategies for whole microbial genome sequencing and analysis." Electrophoresis 18(8): 1207-16.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9298642

            The introduction of methods for automated DNA sequence analysis nearly a decade ago, together with more recent advances in the field of bioinformatics, have revolutionized biology and medicine and have ushered in a new era of genomic science, the study of genes and genomes. These new technologies have had an impact on many areas of research, including the association between genes and disease, in DNA-based diagnostics, and in the sequencing of genomes from human and other model organisms. The demonstration in 1995, that automated DNA sequencing methods could be used to decipher the entire genome sequence of a free-living organism, Haemophilus influenzae, was a milestone in both the genomics and microbial fields [1]. Since the first report of the complete sequence of H. influenzae, these methodologies have been adopted by laboratories around the world. The complete genomic sequence of five eubacterial species [1-5], one archaea [6], and the eukaryote, Saccharomyces cerevisiae [7], have been reported in the last 18 months. At the beginning of 1997 more than a dozen microbial genome projects are at or near completion, with many others in progress. It is likely that in the next few years we will see the complete sequence of perhaps as many as 30-40 microbial genomes. In this article, we will review methods for whole genome sequencing and analysis and examine how this information can be exploited to better understand microbial physiology and evolution.

Ewing, B. and P. Green (1998). "Base-calling of automated sequencer traces using phred. II. Error probabilities." Genome Res 8(3): 186-94.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9521922

            Elimination of the data processing bottleneck in high-throughput sequencing will require both improved accuracy of data processing software and reliable measures of that accuracy. We have developed and implemented in our base-calling program phred the ability to estimate a probability of error for each base-call, as a function of certain parameters computed from the trace data. These error probabilities are shown here to be valid (correspond to actual error rates) and to have high power to discriminate correct base-calls from incorrect ones, for read data collected under several different chemistries and electrophoretic conditions. They play a critical role in our assembly program phrap and our finishing program consed.

Ewing, B., L. Hillier, et al. (1998). "Base-calling of automated sequencer traces using phred. I. Accuracy assessment." Genome Res 8(3): 175-85.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9521921

            The availability of massive amounts of DNA sequence information has begun to revolutionize the practice of biology. As a result, current large-scale sequencing output, while impressive, is not adequate to keep pace with growing demand and, in particular, is far short of what will be required to obtain the 3-billion-base human genome sequence by the target date of 2005. To reach this goal, improved automation will be essential, and it is particularly important that human involvement in sequence data processing be significantly reduced or eliminated. Progress in this respect will require both improved accuracy of the data processing software and reliable accuracy measures to reduce the need for human involvement in error correction and make human review more efficient. Here, we describe one step toward that goal: a base-calling program for automated sequencer traces, phred, with improved accuracy. phred appears to be the first base-calling program to achieve a lower error rate than the ABI software, averaging 40%-50% fewer errors in the data sets examined independent of position in read, machine running conditions, or sequencing chemistry.

Gordon, D., C. Abajian, et al. (1998). "Consed: a graphical tool for sequence finishing." Genome Res 8(3): 195-202.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9521923

            Sequencing of large clones or small genomes is generally done by the shotgun approach (Anderson et al. 1982). This has two phases: (1) a shotgun phase in which a number of reads are generated from random subclones and assembled into contigs, followed by (2) a directed, or finishing phase in which the assembly is inspected for correctness and for various kinds of data anomalies (such as contaminant reads, unremoved vector sequence, and chimeric or deleted reads), additional data are collected to close gaps and resolve low quality regions, and editing is performed to correct assembly or base-calling errors. Finishing is currently a bottleneck in large-scale sequencing efforts, and throughput gains will depend both on reducing the need for human intervention and making it as efficient as possible. We have developed a finishing tool, consed, which attempts to implement these principles. A distinguishing feature relative to other programs is the use of error probabilities from our programs phred and phrap as an objective criterion to guide the entire finishing process. More information is available at http:// www.genome.washington.edu/consed/consed. html.

Lukashin, A. V. and M. Borodovsky (1998). "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Res 26(4): 1107-15.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9461475

            The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.

Osoegawa, K., P. Y. Woon, et al. (1998). "An improved approach for construction of bacterial artificial chromosome libraries." Genomics 52(1): 1-8.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=9740665

            Presented here are improved methodologies that enable the generation of highly redundant bacterial artificial chromosome/P1-derived artificial chromosome libraries, with larger and relatively uniform insert sizes. Improvements in vector preparation and enhanced ligation conditions reduce the number of background nonrecombinant clones. Preelectrophoresis of immobilized high-molecular-weight DNA removes inhibitors of the cloning process, while sizing DNA fragments twice within a single gel effectively eliminates small restriction fragments, thus increasing the average insert size of the clones. The size-fractionated DNA fragments are recovered by electroelution rather than the more common melting of gel slices with subsequent beta-agarase treatment. Concentration of the ligation products yields a 6- to 12-fold reduction in the number of electroporations required in preparing a library of desirable size. These improved methods have been applied to prepare PAC and BAC libraries from the human, murine, rat, canine, and baboon genomes with average insert sizes ranging between 160 and 235 kb.

Badger, J. H. and G. J. Olsen (1999). "CRITICA: coding region identification tool invoking comparative analysis." Mol Biol Evol 16(4): 512-24.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10331277

            Gene recognition is essential to understanding existing and future DNA sequence data. CRITICA (Coding Region Identification Tool Invoking Comparative Analysis) is a suite of programs for identifying likely protein-coding sequences in DNA by combining comparative analysis of DNA sequences with more common noncomparative methods. In the comparative component of the analysis, regions of DNA are aligned with related sequences from the DNA databases; if the translation of the aligned sequences has greater amino acid identity than expected for the observed percentage nucleotide identity, this is interpreted as evidence for coding. CRITICA also incorporates noncomparative information derived from the relative frequencies of hexanucleotides in coding frames versus other contexts (i.e., dicodon bias). The dicodon usage information is derived by iterative analysis of the data, such that CRITICA is not dependent on the existence or accuracy of coding sequence annotations in the databases. This independence makes the method particularly well suited for the analysis of novel genomes. CRITICA was tested by analyzing the available Salmonella typhimurium DNA sequences. Its predictions were compared with the DNA sequence annotations and with the predictions of GenMark. CRITICA proved to be more accurate than GenMark, and moreover, many of its predictions that would seem to be errors instead reflect problems in the sequence databases. The source code of CRITICA is freely available by anonymous FTP (rdp.life.uiuc.edu in/pub/critica) and on the World Wide Web (http:/(/)rdpwww.life.uiuc.edu).

Delcher, A. L., D. Harmon, et al. (1999). "Improved microbial gene identification with GLIMMER." Nucleic Acids Res 27(23): 4636-41.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10556321

            The GLIMMER system for microbial gene identification finds approximately 97-98% of all genes in a genome when compared with published annotation. This paper reports on two new results: (i) significant technical improvements to GLIMMER that improve its accuracy still further, and (ii) a comprehensive evaluation that demonstrates that the accuracy of the system is likely to be higher than previously recognized. A significant proportion of the genes missed by the system appear to be hypothetical proteins whose existence is only supported by the predictions of other programs. When the analysis is restricted to genes that have significant homology to genes in other organisms, GLIMMER misses <1% of known genes.

Tettelin, H., D. Radune, et al. (1999). "Optimized multiplex PCR: efficiently closing a whole-genome shotgun sequencing project." Genomics 62(3): 500-7.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10644449

            A new method has been developed for rapidly closing a large number of gaps in a whole-genome shotgun sequencing project. The method employs multiplex PCR and a novel pooling strategy to minimize the number of laboratory procedures required to sequence the unknown DNA that falls in between contiguous sequences. Multiplex sequencing, a novel procedure in which multiple PCR primers are used in a single sequencing reaction, is used to interpret the multiplex PCR results. Two protocols are presented, one that minimizes pipetting and another that minimizes the number of reactions. The pipette optimized multiplex PCR method has been employed in the final phases of closing the Streptococcus pneumoniae genome sequence, with excellent results.

Ashburner, M., C. A. Ball, et al. (2000). "Gene ontology: tool for the unification of biology. The Gene Ontology Consortium." Nat Genet 25(1): 25-9.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10802651

Carninci, P., Y. Shibata, et al. (2000). "Normalization and subtraction of cap-trapper-selected cDNAs to prepare full-length cDNA libraries for rapid discovery of new genes." Genome Res 10(10): 1617-30.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11042159

            In the effort to prepare the mouse full-length cDNA encyclopedia, we previously developed several techniques to prepare and select full-length cDNAs. To increase the number of different cDNAs, we introduce here a strategy to prepare normalized and subtracted cDNA libraries in a single step. The method is based on hybridization of the first-strand, full-length cDNA with several RNA drivers, including starting mRNA as the normalizing driver and run-off transcripts from minilibraries containing highly expressed genes, rearrayed clones, and previously sequenced cDNAs as subtracting drivers. Our method keeps the proportion of full-length cDNAs in the subtracted/normalized library high. Moreover, our method dramatically enhances the discovery of new genes as compared to results obtained by using standard, full-length cDNA libraries. This procedure can be extended to the preparation of full-length cDNA encyclopedias from other organisms.

Myers, E. W., G. G. Sutton, et al. (2000). "A whole-genome assembly of Drosophila." Science 287(5461): 2196-204.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10731133

            We report on the quality of a whole-genome assembly of Drosophila melanogaster and the nature of the computer algorithms that accomplished it. Three independent external data sources essentially agree with and support the assembly's sequence and ordering of contigs across the euchromatic portion of the genome. In addition, there are isolated contigs that we believe represent nonrepetitive pockets within the heterochromatin of the centromeres. Comparison with a previously sequenced 2.9- megabase region indicates that sequencing accuracy within nonrepetitive segments is greater than 99. 99% without manual curation. As such, this initial reconstruction of the Drosophila sequence should be of substantial value to the scientific community.

Pearson, W. R. (2000). "Flexible sequence similarity searching with the FASTA3 program package." Methods Mol Biol 132: 185-219.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=10547837

            The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments. Traditionally, sequence similarity searches have sought to ask one question: "Is my query sequence homologous to anything in the database?" Both FASTA and BLAST can provide reliable answers to this question with their statistical estimates; if the expectation value E is < 0.001-0.01 and you are not doing hundreds of searches a day, the answer is probably yes. In general, the most effective search strategies follow these rules: 1. Whenever possible, compare at the amino acid level, rather than the nucleotide level. Search first with protein sequences (blastp, fasta3, and ssearch3), then with translated DNA sequences (fastx, blastx), and only at the DNA level as a last resort (Table 5). 2. Search the smallest database that is likely to contain the sequence of interest (but it must contain many unrelated sequences for accurate statistical estimates). 3. Use sequence statistics, rather than percent identity or percent similarity, as your primary criterion for sequence homology. 4. Check that the statistics are likely to be accurate by looking for the highest-scoring unrelated sequence, using prss3 to confirm the expectation, and searching with shuffled copies of the query sequence [randseq, searches with shuffled sequences should have E approx 1.0]. 5. Consider searches with different gap penalties and other scoring matrices. Searches with long query sequences against full-length sequence libraries will not change dramatically when BLOSUM62 is used instead of BLOSUM50 (20), or a gap penalty of -14/-2 is used in place of -12/-2. However, shallower or more stringent scoring matrices are more effective at uncovering relationships in partial sequences (3,18), and they can be used to sharpen dramatically the scope of the similarity search. However, as illustrated in the last section, the E value is only the first step in characterizing a sequence relationship. Once one has confidence that the sequences are homologous, one should look at the sequence alignments and percent identities, particularly when searching with lower quality sequences. When sequence alignments are very short, the alignment should become more significant when a shallower scoring matrix is used, e.g., BLOSUM62 rather than BLOSUM50 (remember to change the gap penalties). Homology can be reliably inferred from statistically significant similarity. Whereas homology implies common three-dimensional structure, homology need not imply common function. Orthologous sequences usually have similar functions, but paralogous sequences often acquire very different functional roles. Motif databases, such as PROSITE (21), can provide evidence for the conservation of critical functional residues. However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology.

Stekel, D. J., Git, Y., and Falciani, F. (2000) "The comparison of gene expression from multiple cDNA libraries." Genome Res 10, 2055-2061

 

Das, M., I. Harvey, et al. (2001). "Full-length cDNAs: more than just reaching the ends." Physiol Genomics 6(2): 57-80.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11459922

            The development of functional genomic resources is essential to understand and utilize information generated from genome sequencing projects. Central to the development of this technology is the creation of high-quality cDNA resources and improved technologies for analyzing coding and noncoding mRNA sequences. The isolation and mapping of cDNAs is an entree to characterizing the information that is of significant biological relevance in the genome of an organism. However, a bottleneck is often encountered when attempting to bring to full-length (or at least full-coding) a number of incomplete cDNAs in parallel, since this involves the nonsystematic, time consuming, and labor-intensive iterative screening of a number of cDNA libraries of variable quality and/or directed strategies to process individual clones (e.g., 5' rapid amplification of cDNA ends). Here, we review the current state of the art in cDNA library generation, as well as present an analysis of the different steps involved in cDNA library generation.

Lander, E. S., L. M. Linton, et al. (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11237011

            The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence.

Venter, J. C., M. D. Adams, et al. (2001). "The sequence of the human genome." Science 291(5507): 1304-51.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11181995

            A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.

 

Wendl, M. C., M. A. Marra, et al. (2001). "Theories and applications for sequencing randomly selected clones." Genome Res 11(2): 274-80.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11157790

            Theory is developed for the process of sequencing randomly selected large-insert clones. Genome size, library depth, clone size, and clone distribution are considered relevant properties and perfect overlap detection for contig assembly is assumed. Genome-specific and nonrandom effects are neglected. Order of magnitude analysis indicates library depth is of secondary importance compared to the other variables, especially as clone size diminishes. In such cases, the well-known Poisson coverage law is a good approximation. Parameters derived from these models are used to examine performance for the specific case of sequencing random human BAC clones. We compare coverage and redundancy rates for libraries possessing uniform and nonuniform clone distributions. Results are measured against data from map-based human-chromosome-2 sequencing. We conclude that the map-based approach outperforms random clone sequencing, except early in a project. However, simultaneous use of both strategies can be beneficial if a performance-based estimate for halting random clone sequencing is made. Results further show that the random approach yields maximum effectiveness using nonbiased rather than biased libraries.

 

Stein, L. D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J. E., Harris, T. W., Arva, A., and Lewis, S. (2002) The generic genome browser: a building block for a model organism system database. Genome Res 12, 1599-1610

 

Bateman, A., E. Birney, et al. (2002). "The Pfam protein families database." Nucleic Acids Res 30(1): 276-80.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11752314

            Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the World Wide Web in the UK at http://www.sanger.ac.uk/Software/Pfam/, in Sweden at http://www.cgb.ki.se/Pfam/, in France at http://pfam.jouy.inra.fr/ and in the US at http://pfam.wustl.edu/. The latest version (6.6) of Pfam contains 3071 families, which match 69% of proteins in SWISS-PROT 39 and TrEMBL 14. Structural data, where available, have been utilised to ensure that Pfam families correspond with structural domains, and to improve domain-based annotation. Predictions of non-domain regions are now also included. In addition to secondary structure, Pfam multiple sequence alignments now contain active site residue mark-up. New search tools, including taxonomy search and domain query, greatly add to the functionality and usability of the Pfam resource.

Batzoglou, S., D. B. Jaffe, et al. (2002). "ARACHNE: a whole-genome shotgun assembler." Genome Res 12(1): 177-89.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=11779843

            We describe a new computer system, called, for assembling genome sequence using paired-end whole-genome shotgun reads. has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency. To test, we created simulated reads providing approximately 10-fold coverage of the genomes of H. influenzae, S. cerevisiae, and D. melanogaster, as well as human chromosomes 21 and 22. The assemblies of these simulated reads yielded nearly complete coverage of the respective genomes, with a small number of contigs joined into a smaller number of supercontigs (or scaffolds). For example, analysis of the D. melanogaster genome yielded approximately 98% coverage with an N50 contig length of 324 kb and an N50 supercontig length of 5143 kb. The assembly accuracy was high, although not perfect: small errors occurred at a frequency of roughly 1 per 1 Mb (typically, deletion of approximately 1 kb in size), with a very small number of other misassemblies. The assembly was rapid: the Drosophila assembly required only 21 hours on a single 667 MHz processor and used 8.4 Gb of memory.

Ruijter, J. M., Van Kampen, A. H., and Baas, F. (2002) "Statistical evaluation of SAGE libraries: consequences for experimental design." Physiol Genomics 11, 37-44

 

Saha, S., Sparks, A. B., Rago, C., Akmaev, V., Wang, C. J., Vogelstein, B., Kinzler, K. W., and Velculescu, V. E. (2002) "Using the transcriptome to annotate the genome." Nat Biotechnol 20, 508-512

 

Eisen, J. A. and C. M. Fraser (2003). "Phylogenomics: intersection of evolution and genomics." Science 300(5626): 1706-7.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12805538

            Much has been gained from genomic and evolutionary studies of species. Combining the perspectives of these different approaches suggests that an integrated phylogenomic approach will be beneficial.

Salzberg, S., E. Birney, et al. (2003). "Unrestricted free access works and must continue." Nature 422(6934): 801.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Citation&list_uids=12712164  

           

Doolan DL, Aguiar JC, Weiss WR, Sette A, Felgner PL, Regis DP, Quinones-Casas P, Yates JR 3rd, Blair PL, Richie TL, Hoffman SL, Carucci DJ. (2003 Nov) "Utilization of genomic sequence information to develop malaria vaccines." J Exp Biol. 206(Pt 21):3789-802. Review. PMID: 14506214

 

Jaffe, D. B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J. P., Zody, M. C., and Lander, E. S. (2003) Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res 13, 91-96

 

Pleasance, E. D., Marra, M. A., and Jones, S. J. (2003) Assessment of SAGE in Transcript Identification. Genome Res

 

Makarova KS, Skuce PJ, Yaga R, Lainson FA, Knox DP. (2005 May) "An evaluation of serial analysis of gene expression (SAGE) in the parasitic nematode, Haemonchus contortus." Parasitology. 130(Pt 5): 553-9. PMID: 15991498

 

Wheeler DB, Carpenter AE, Sabatini DM. (2005 Jun) "Cell microarrays and RNA interference chip away at gene function." Nat Genet. 37 Suppl:S25-30. Review. PMID: 15920526

 

Chen K, Pachter L. (2005 Jul) "Bioinformatics for whole-genome shotgun sequencing of microbial communities." PLoS Comput Biol. 1(2):e24. PMID: 16110337

 

Makarova KS, Koonin EV. (2005) "Evolutionary and functional genomics of the Archaea." Curr Opin Microbiol. 2005 Aug 16 PMID: 16111915

 

Rong J, Bowers JE, Schulze SR, Waghmare VN, Rogers CJ, Pierce GJ, Zhang H, Estill JC, Paterson AH. "Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy." Genome Res. 2005 Aug 18; PMID: 16109973

 


 

Thank you to our sponsors:

 

[ Week 1 ][ Week 2 ][ Week 3 ][ Week 4 ][ Selected Reading ]
[ Faculty List ]
[ Class Group Pictures ]
[ Class Pictures Slideshow ]
[ Class Pictures Download (440.2 MB!!!) ]
[ Class Home ]