Whole Genome Assemblies of the Drosophila and Human Genomes.

 

GENE MYERS

Celera Genomics

 

Shotgun sequence assembly is a classic inverse problem: given a set of segments randomly sampled from a target sequence, the problem is to reconstruct the target.  Early programs for this problem assisted a user by finding potential overlapping segments which were then assembled by hand.  As the programs became progressively more sophisticated the problem was completely solved by the software but still followed by a manual curatorial pass by the users.  Until 1995 it was believed that the practical limit on the size of problems that could be solved was on the order of 30 to 50Kbp, due to the intrinsic difficulties posed by repetitive sequence in the target.  In 1995 the assembly of a whole genome shotgun dataset for H. Influenza dispelled the notion of such a barrier.  While the process involved significant human curation and bacterial genomes are less repetitive than those of higher organisms, it still portended an economy of effort unmatched by the more laborious map-based approaches then being pursued for large genomes.  In 1996, Weber and Myers proposed a whole genome shotgun approach for the human genome suggesting a protocol that involved sampling several individuals in order to simultaneously obtain polymorphism information.  Critics claimed that the computation would involve an impossible amount of computer time, that the size and repetitiveness of the genome would confound all attempts at assembly should sufficient computer efficiency be achieved, and that even if an assembly was produced it would be of an extremely poor quality and partial nature.

 

In 1999 the informatics research team at Celera produced an assembly of the Drosophila genome from a whole genome shotgun data set consisting of 3.2 million reads, 72% of which were paired-end reads from 2Kbp and 10Kbp inserts in a 1 to 1.32 mix.  The assembly consisted of completely ordered and oriented contigs covering an estimated 97.2% of the genome with only 1630 gaps of average size 1,415bp. The smaller gaps were PCR closed by the Berkeley Drosophila project in a three month period following the publication of the assembly, and the remaining gaps closed in the ensuing 6 months.  The assembly is consistent with STS maps and physical clone maps and was compared against 24% of the genome independently sequenced by other groups.  The sequence level comparison revealed that the sequence is better than 99.998% accurate within non-repetitive regions of the assembly and 99.62% accurate within repetitive constructs.  The basic conclusion is that whole genome assembly is not only feasible but produces a high-quality result that requires comparatively little finishing work.

 

In this talk, we will cover the approaches to sequencing whole genomes, illustrate the key computational steps of Celera’s whole genome assembler in an attempt to explain what the critics didn’t understand, and describe our current strategies and progress towards a penultimate assembly of the human genome.