The first draftThe human genome unveiledThe completion of the first draft of the human genome is an awesome technological feat. How did the researchers of the Human Genome Project consortium go about the task? |
When the Human Genome Project was launched in 1990, decoding the ‘book of life’ was a controversial and far-off goal. But now, with the announcement on 26 June that 90 per cent of the human genome - the ‘working draft’ - is in the public databases, the main chapters of the book have been deciphered. Not a bedtime read maybe, but the first draft of the human genome sequence gives researchers access to the most invaluable medical reference book. For the next three years, the Human Genome Project will tackle an even more challenging task - filling in the missing paragraphs and rigorously checking the spelling and grammar to produce the final ‘gold standard’ sequence.
The acceleration in human genome sequencing has been remarkable. The original target date for completion was 2005, but in 1995 the timetable was revised to 2003, with a ‘first draft’ sequence to be produced in 2001. Even in October 1998, only about 6 per cent of the human genome had been completed. And then in 1999, with researchers hungry for useful sequence and lots of it, the publicly funded Human Genome Project announced that the draft would be produced in 2000.
At first, some researchers were worried that a draft sequence would lower standards, but these concerns were allayed by the rapid flow of useful information into the DNA databases - all sequence is released onto the Internet within 24 hours - and by the ‘tasters’ of the final product provided by the publication of chromosomes 22 and 21 in December 1999 and March 2000, respectively. Any worries were put to rest when, on 26 June, the Human Genome Project consortium announced that the milestone of the working draft had been reached.
What is the first draft?
The working draft covers about 90 per cent of the euchromatic part of the genome (which contains most of the genes), has some gaps in the sequence and an error rate of about 1 in 1000. Each part of the working draft has been sequenced four or five times. To reach the ‘gold standard’, the sequencing is repeated about ten times - producing a final sequence with almost no gaps and an error rate of less than 1 in 10 000. Indeed, the largest sequencing factories such as the Sanger Centre achieve error rates of fewer than one mistake in 100 000 bases of finished sequence.
The method used for much of the genome is ‘shotgun sequencing’ which, in essence, involves breaking the genome up into conveniently sized chunks. The total size of the human genome is estimated to be about 3 billion base pairs, arrayed in 23 chromosomes. The chromosomes themselves are 50-250 million bases (megabases) long, too large to be sequenced directly (automated machines sequence fragments of between 400 and 700 bases), so the Human Genome Project fragments them into chunks of about 150 kilobases. Each of these large clones is then ‘shotgunned’ - broken into pieces of perhaps 1500 base pairs, either by enzymes or by physical shearing - and the fragments are sequenced separately. Shotgunning the original large clone randomly several times ensures that some of the fragments will overlap; computers then analyse the sequences of these small fragments, looking for end sequences that overlap - indicating neighbouring fragments - and assembling the original sequence of the clone
An alternative approach, ‘whole-genome shotgun sequencing’, was first used in 1982 by the inventor of shotgun sequencing, Fred Sanger, while working on phages (viruses of bacteria). In this technique, which has been used by the commercial company Celera Genomics, the whole genome is broken into small fragments that can be sequenced then reassembled. Although this approach can be highly automated and efficient - and has been very successful for the sequencing of the genomes of microorganisms and the fruitfly Drosophila - reassembling the fragments from the human genome is far more difficult and requires powerful computers. Not only is it not clear where in the 3 billion base pairs a particular fragment might belong, but this approach also faces the problem that the human genome contains many repetitive regions (short stretches of DNA repeated many times) and many repeats exceed the length of a single sequence read - confusing the computer programs trying to piece together the sequences.
The Human Genome Project gets around this problem by using the shotgun method on large clones (bacterial artificial chromosomes) that have been previously mapped so that it is known exactly where they are located on the genome. Reassembling the sequenced fragments to reflect their original position in the genome is thus easier and more accurate. Nevertheless, assembled shotgun data may still contain gaps and ambiguities (problems that the whole-genome shotgun approach can never resolve). So, in the ‘finishing’ phase, which is more labour intensive than the shotgun phase, the gaps are filled in and discrepancies resolved.
To improve the accuracy to the end goal of 99.99 per cent, each portion of the sequence has to be sequenced about ten times. So the finishing phase involves further sequencing from the clones used in the preparation of the working draft, closing gaps and resolving ambiguities. Using different cloning strategies or different clones from the same region can help to finish off difficult parts of the genome, but in many cases shotgun sequencing cannot provide the answers. Researchers turn instead to ‘directed sequencing’, a laborious process where the clone is sequenced step by step, from end to end.
When is a chromosome finished? The international consortium has agreed on three criteria: more than 95 per cent of the chromosome must be sequenced, the number, location and size of remaining gaps must be pinned down, and individual gaps must be shorter than about 150 000 bases. The first two chromosomes to be finished surpassed these criteria. The 33.5 megabase sequence of chromosome 21 covers 99.7 per cent of the long arm, has only ten gaps totalling 100 000 bases, and has an estimated accuracy of 99.995 per cent.
Making sense of the genome
To be useful, the sequence needs to be ‘annotated’ - new sequences have to be mapped onto their locations in the genome, genes identified, and genes linked to any known information about what they do or might do. But in the last year or so, the flow of human genome sequence has swelled from a stream to a tidal wave, and the DNA databases are being flooded with 10 000 DNA letters every minute. Manual approaches would be overwhelmed, so groups at the Sanger Centre and its next-door neighbour the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI, which hosts some of the world’s largest databases of genomic information), have developed a new software system and database called Ensembl. This program automatically annotates new sequence by searching other databases for identical or similar sequences and then makes a prediction or confirms the presence of a gene.
Using Ensembl, the bioinformaticians have confirmed the location of more than 35 000 genes in the human genome and have identified a further 150 000 potential gene fragments (estimates of the final number of human genes range from 40 000 to 140 000). Both the data and the program source code for Ensembl are free and distributed via the Internet - a model similar to that used to develop the popular LINUX computer operating system. Updates planned for the summer include adding the locations of single nucleotide polymorphisms (SNPs), positions in the human genome sequence that vary between individuals.
The final push
The Sanger Centre’s sequencing teams have already made the transition from producing working draft sequence to producing the finished, gold standard sequence. In October 1998, Francis Collins and colleagues wrote of the completion of the human genome by 2003: "This is a highly ambitious, even audacious goal…sequence completion by the end of 2003 is a major challenge but within reach and well worth the risks and effort." Now that 90 per cent of the human genome is in the public databases, researchers can look forward with confidence to the finishing line in 2003 - which, appropriately, is the 50th anniversary of the discovery of the double helix structure of DNA by James Watson and Francis Crick.
See also
- First Draft: Historic announcement and Human Genome Project details
- Sanger Centre
External links
- Human Genome Project USA: Key information

