Home

What is the Assemblathon?

The Assemblathon is a set of periodic collaborative efforts that all help improve methods of genome assembly. It will hopefully become an annual event that will spur improvements in this computationally intensive field. The overall goal of each Assemblathon event is to have participating groups try to use their own software to each assemble one or more genomes that the organizers of the Assemblathon will make available (see the rules page for more details of the latest challenge). All participants will have the same amount of time to try to assemble the genomes, and then the organizers will evaluate each group's efforts. In March 2011, a genome assembly workshop was held in Santa Cruz, where Assemblathon participants and organizers met to discuss what they have learnt from the first event. This meeting was also used to plan the details of Assemblathon 2, which started in June 2011. The final results of Assemblathon 1 were announced at the CSHL Biology of Genomes meeting, and were published in Genome Research in late 2011. Apart from the ongoing Assemblathon 2, other Assemblathion events are also in the planning stages.

 

Why do we need to have Assembathons?

There are many genome assembly programs out there, but it is not always clear as to which is the best. Part of the problem is that it is not easy to define what 'best' is and an assembler that might work well in one situation (e.g. assembling a high-repeat-content genome) might not fare as well in other situations. Part of the reason for organizing this Assemblathon is to see if we can produce newer metrics for assessing the quality of a genome assembly that will complement existing statistics such as N50 contig size.

The ever changing landscape of sequencing technology also means that it is important to continually appraise new methods as well as re-appraise old ones. Assemblers that work well with the short reads from 'next generation' sequencers (e.g. Illumina and SOLiD) might not work as well (or at all) with reads from even newer technologies such as the new sequencers from PacBio. There are also separate, but related, needs to assemble transcriptomes from RNA-Seq data and to assemble metagenome datasets.

Another more fundamental need for a project such as the Assemblathon is that even when we believe that an assembler has made a good job of assembling a genome, we are never entirely sure what the actual solution is. This is a bit like putting a jigsaw together where the pieces all all one of four different colors; how do you know what the final picture is supposed to look like? To tackle this issue, the first Assemblathon provided participants with simulated reads from synthetic genomes. By starting with a complete genome that has been generated in silico, we will know what the final 'answer' should be. In Assemblathon 2, there are 'real' genome sequences to assemble for three different species, one of which is represented by three different sequencing technologies.

 

Who is organizing the Assemblathon?

The idea to have a genome assembly challenge was conceived by David Haussler (UCSC), Joe DeRisi (UCSF), Oliver Ryder (UCSD), and Stephen O'Brien (NIH CCR). The project organization was then handed over to the Genome Center at UC Davis with collaboration from the Haussler lab at UC Santa Cruz. The Haussler lab were also responsible for producing the synthetic genome data used by the first Assemblathon. Both groups were then involved in evaluating the results Assemblathon 1:

Genome center personnel: Ian KorfKeith BradnamDawei LinAaron Darling, Joseph Fass, Ken Yu, Andrew Tritt, Vince Buffalo, Richard Michelmore, and the UC Davis Bioinformatics Core

Haussler lab personnel: Benedict Paten, Dent Earl, John St. John, Ngan Nguyen, Mark Diekhans, David Haussler

 

Is the Assemblathon connected to dnGASP and GAGE?

There is a similar, and independently organized, event happening that is also trying to improve the methods of genome assembly. The de Novo Genome Assessment Project (dnGASP) is a European organized event and is being run as part of the Sequence Mapping and Assembly Assessment Project (SMAAP). This project will also culminate in a workshop hosted by the International Center for Scientific Debate (ICSD) in Barcelona (April 4–7, 2011). Due to the obvious similarities between dnGASP and the Assemblathon, we are currently discussing about possible collaborations and we encourage people who are interested in the Assemblathon to also enter dnGASP. Papers from both events should hopefully appear in a journal in the near future.

Another project called Genome Assembly Gold-standard Evaluations (GAGE) has also been set up to investigage different genome assemblers. This project is led by Steven Salzberg and will run various genome assemblers 'in house' against a set of five whole-genome shotgun datasets.

 

How is the Assemblathon connected to the Genome 10K project?

The first Assemblathon culminated in the Genome Assembly Workshop that was held in March 2011. This workshop was sponsored, and inspired by, by the Genome 10K projectThis project is an international effort that aims to produce a genomic zoo – sequences that represent the genomes of 10,000 vertebrate species. Such an ambitious project has become possible due to increases in sequencing technology which has also hugely driven down the cost of producing a draft genome sequence from an organism. The Genome 10K project is a collaboration that represents zoos, museums, research centers and universities around the world. Improving the methods of genome assembly will clearly be important for the Genome 10K project but the Assemblathon will have broader applications as well. E.g. improving the assembly of human cancer genomes and invertebrate genomes.

G10k-teal-c-101209

 

Keith Bradnam, UC Davis Genome Center