The Assemblathon

  • Background
  • Mailing lists
  • Assemblathon 1
  • Assemblathon 2
  • Contact us
  • Archive
  • RSS
banner

Slides from an Assemblathon 2 talk

Here are some slides that I made for a talk that I gave to a general audience at UC Davis. These slides were also used by Prof. Ian Korf for his presentation at the Genome 10K workshop (5/25/2013). Notes are included for each slide.

The first part of this talk presents a simple overview of genome assembly and introduces many of the terms that are used. The next part of the talk discusses results from Assemblathon 2.

Assemblathon 2 talk from Keith Bradnam
    • #assemblathon 2
  • 1 month ago
  • 1
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Assemblathon has moved to a new home

Goodbye Posterous.

Hello tumblr.

Please let me know if you spot any errors or broken links. If you had any bookmarks to pages on the Assemblathon site, these most likely will need to be updated.

  • 3 months ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

Feedback and analysis of the Assemblathon 2 pre-print

There has already been some discussion of the pre-print of the Assemblathon 2 manuscript. Although a pre-print is not the same thing as a peer-reviewed, accepted paper — I don’t want us to get too ahead of ourselves! — I thought it useful to start collecting together some of the online commentaries:

  • Homologus blog post 1:  highlights a few conclusions from the paper
  • Homologus blog post 2: delves into the results, and attempts to estimate some of the costs of genome assembly. Assemblathon  co-author Sébastien Boisvert adds some useful comments. 
  • Haldane’s Sieve post: an invited blog post by lead author Keith Bradnam, that summarizes what the Assemblathons are all about by way of a pizza-themed analogy
  • Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons: this is not a blog post, but a recently published paper that evaluates some of the Assemblathon 2 data
  • Thoughts on the Assemblathon 2 paper: by C. Titus Brown (a reviewer of the manuscript)
  • Homologus blog post 3: reactions to the previous post by C. Titus Brown
  • Assemblathon 2 review, round 1, parts thereof: a concise version of C. Titus Brown’s formal manuscript review (minus the specific suggestions)
  • On assembly uncertainty (inspired by the Assemblathon 2 debate): blog post by Lex Nederbragt in response to post by C. Titus Brown

Remember, the Assemblathon twitter account has also tweeted about these pieces, as well as many other articles relating to genome assembly.

    • #assemblathon 2
  • 4 months ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+

The Assemblathon 2 paper has been submitted!

The manuscript has been submitted to the GigaScience journal and a pre-print of the paper is now available on arXiv.org. Additional data files that support the paper are currently available here (these are also part of the manuscript submission).

Please note that the pre-print has not undergone peer review and we do not assume that the manuscript will be automatically accepted by the journal.

If all goes well, the submitted Assemblathon 2 assemblies, along with sets of CEGMA gene predictions for each assembly, will be available from GigaScience’s GigaDB database. These are also currently available from the Korflab website.

    • #assemblathon 2
  • 4 months ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Image taken from http://www.flickr.com/photos/incrediblehow/5577200173/

Why we need the Assemblathon
This short essay was originally written in Fall 2011 for a Genome 10K newsletter that never came to pass so I’m posting it here. The views in this essay are my own and should not necessarily be taken to represent official views of the Assemblathon.
Keith Bradnam, Jan 20th 2012

Note: article updated 2/21/12 to include mention of Meraculous among the best performing assemblers in Assemblathon 1
There was a time when scientists were grateful for any sort of genome sequence no matter its level of quality or completeness (let alone its N50 length). In the late 1990s there were still only a handful of completed eukaryotic genome sequences in existence and many of these had been painstakingly completed using the ‘clone-by-clone’ sequencing approach. If you had a detailed genetic map for your species of interest – which is what the clone-by-clone approach ideally required – then you also had a good chance of knowing how complete your genome sequence might be. End users of these genome sequences were mostly grateful for what they had been given. Even a ‘working draft’ version of a genome which might contain lots of errors was better than no sequence at all.
 
This era of ‘genomic innocence’ did not last long. As sequencing projects moved away from the clone-by-clone approach and instead started to employ the whole genome shotgun (WGS) strategy, it became clear that there are many different ways to put a genome sequence together. Sometimes you could produce a drastically different assembly from the same sequence data just by using a different version of the same assembly program (e.g. the N50 length of scaffolds from the dog genome was seen to double just by using a later version of the ARACHNE assembler).
 
It also became clear that just because you can assemble a genome in a way that increases its N50 size (or whatever metric du jour you care to use), it may not necessarily be a better genome. For example, the v1.95 assembly of the sea squirt Ciona intestinalis had a N50 length of 234 Kbp. A later v2.0 assembly produced over a tenfold increase in this metric, but in the process the newer assembly removed several highly conserved genes that were present in the earlier assembly. One step forward, two steps back.
 
These challenges were happening in an era when a genome sequence was something that represented a species rather than an individual (let alone a tissue type); an era when sequence reads were typically longer than your email signature; and when a genome sequencing project was expected to take years not days. In short, these challenges took place before the dark times, before the Empire before next generation sequencing (NGS) technologies took off.
 
How times change. Whereas the human genome project took about ten years before it was published – longer before all chromosomes were actually finished – we can now generate enough raw sequence data to produce a similar genome sequence in just ten days. And that’s only considering the output from a single NGS machine. Of course that’s assuming that you can put it all together in the correct order. 
 
The question of ‘how do I assemble a genome from NGS data?’ is not always an easy question to answer. It can be made easier if you have some expectation of what the final genome should look like. However, it is one thing to try assembling a genome sequence with NGS data when you already have a completed sequence from a closely-related species to compare it to. It’s another thing altogether when there is no such ‘reference’ genome. Imagine trying to put a jigsaw puzzle together when there is no picture on the box to guide you, no edge pieces, and only a ‘best guess’ as to how many pieces there are. Oh, and there may be up to 100 copies of every piece. 
 
The biggest headache about dealing with NGS data is that things are probably going to get worse before they get better. It is true that read lengths are increasing year-on-year, and newer sequencing chemistries continue to claim ever greater accuracy. These improvements should make genome assembly a much easier problem in the very near future. However, we still have to deal with the explosion of sequence data that is happening now. Everybody has genomes that need to be sequenced, and without genome assembly the raw output from a genome project is pretty much a meaningless set of very large files taking up space on a disk somewhere. 
 
Projects such as Genome 10K are aiming to sequence the genomes from – wait for it – 10,000 vertebrate species (the clue was in the name) and all those genomes will need assembling. It was thanks to Genome 10K that the Assemblathon project was born. The Assemblathon is more than just a made-up word…it is an ongoing project that is attempting to help drive improvements in the field of genome assembly. It will try to do this by bringing together all of the major players in the genome assembly field together for regular ‘contests’. To date there has been one such contest that has been completed which we oh-so-cleverly named ‘Assemblathon 1’. We  are currently in the middle of the inventively named ‘Assemblathon 2’. We plan – subject to our Assemblathon grant application being successful – to organize more Assemblathons in future. So how exactly can the Assemblathon help to assemble genomes?
 
In many ways, a more pertinent question to ask about genome assembly is not ‘how do I assemble a genome?’ but rather ‘how will I know if my assembly is any good?’. This is a very easy question to ask, but not such an easy question to answer. Genome assembly programs – like most bioinformatics software – will always spit out some output…the challenge is to make sense of that output. There are a range of fairly simple statistical metrics that are commonly used to describe genome assemblies. In most cases we like to believe that higher values for those metrics indicates a better genome assembly, but as the earlier example with C. intestinalis showed, this is not always the case. 
 
Hopefully though, there are always some intelligent questions that we can ask of any genome assembly, and many of these will involve leveraging any other sequence information that may exist for that species (finished BACs, optical maps, ESTs etc.). For example, we should expect that a good genome assembly will contain most of the genes that are present in the genome. We often do not know what those genes will look like, but we know that a large handful of important genes remain essentially unchanged across millions of years of evolution and so we should be able to find those at least. This approach has previously been used by our group to assess the ‘gene space’ in a variety of new genome sequences, and we are currently employing this technique as just one of many strategies in the evaluation of Assemblathon 2 entries.
 
In the world of genome assembly, beauty is very much in the eye of the beholder. Some people consider a genome assembly that is full of genes to be a very beautiful thing. However, a genome assembly can theoretically contain 100% of the genes, but still only represent ~5–10% of the genome. Some researchers want genome assemblies that capture the maximum amount of the underlying genome. Other researchers don’t care for getting all of the genome – or even all of the genes – just as long as what they do get is highly accurate. If there is a lot of heterozygosity in your genome of interest, then you may prefer an assembly that attempts to best resolve that heterozygosity and produce two haplotype sequences. Or maybe your research concerns repetitive elements and you just want to assemble a genome in a way that still captures a lot of the hard-to-assemble repetitive regions.
 
If you want the best genome assembly possible, you may have to accept some trade-offs. The assemblers that may perform well in one area may not perform as well in other areas. Everybody wants to be told ‘genome assembler X will give you the best assembly’ but at the moment it doesn’t seem fair to make such bold assertions. What we did find out from Assemblathon 1 was that a number of genome assemblers performed admirably across many, but not all, of the different metrics. For example, the assembler that did the best job at increasing coverage (the amount of the known genome present in the assembly), ranked 9th when considering the number of substitution errors present in the assembly. Conversely, the assembler that did the best job at minimizing substitution errors ranked 8th in terms of coverage. You pays your money and you takes your choice. We should mention, however, that these two assemblers (SOAPdenovo and SGA), along with ALLPATHS and Meraculous were consistently ranked highly for many of the metrics that we used and were the four best overall assemblers in Assemblathon 1. 
 
One of the unusual features of Assemblathon 1 was that it involved synthetic genome data. We actually made a small artificial genome, let it evolve for millions of years, and then produced synthetic reads from the final genome. We did this so that we would know what the final answer should look like (a luxury in the world of genome assembly). However, many people who are developing the latest and greatest algorithms for putting genomes together do not necessarily want to devote too much time to such ‘toy’ problems. They would much rather get their hands dirty with real world data, so their efforts can be of actual use to many researchers around the world. Ask, and you shall receive. In Assemblathon 2 we provided real data from three (count ‘em!) vertebrate genomes: a snake, a fish, and a bird. For the latter species (a parrot), we actually provided sequencing data from three different technologies (Illumina, 454, and PacBio) to see if groups would try a ‘mix & match’ strategy.
 
Assemblathon 2 is now in the ‘evaluation phase’ where the 43 genome assemblies that we received are being prodded, poked, and subjected to expert scrutiny. We hope that we can produce new metrics to complement those that are currently used to assess genome assemblies. We also hope that we can provide accurate assessments of how good these assemblies are with respect to what different users might want from a genome sequence. But most of all, we hope that we can continue to be useful by drawing the genome assembly community together – both physically and virtually – in order to share ideas and and to start occasional arguments promote vigorous discussion. The Assemblathon will be maximally useful if we help improve the state of genome assembly to the point where we no longer need to organize Assemblathons.
 
Pop-upView Separately

Image taken from http://www.flickr.com/photos/incrediblehow/5577200173/

Why we need the Assemblathon

This short essay was originally written in Fall 2011 for a Genome 10K newsletter that never came to pass so I’m posting it here. The views in this essay are my own and should not necessarily be taken to represent official views of the Assemblathon.

Keith Bradnam, Jan 20th 2012


Note: article updated 2/21/12 to include mention of Meraculous among the best performing assemblers in Assemblathon 1

There was a time when scientists were grateful for any sort of genome sequence no matter its level of quality or completeness (let alone its N50 length). In the late 1990s there were still only a handful of completed eukaryotic genome sequences in existence and many of these had been painstakingly completed using the ‘clone-by-clone’ sequencing approach. If you had a detailed genetic map for your species of interest – which is what the clone-by-clone approach ideally required – then you also had a good chance of knowing how complete your genome sequence might be. End users of these genome sequences were mostly grateful for what they had been given. Even a ‘working draft’ version of a genome which might contain lots of errors was better than no sequence at all.

 

This era of ‘genomic innocence’ did not last long. As sequencing projects moved away from the clone-by-clone approach and instead started to employ the whole genome shotgun (WGS) strategy, it became clear that there are many different ways to put a genome sequence together. Sometimes you could produce a drastically different assembly from the same sequence data just by using a different version of the same assembly program (e.g. the N50 length of scaffolds from the dog genome was seen to double just by using a later version of the ARACHNE assembler).

 

It also became clear that just because you can assemble a genome in a way that increases its N50 size (or whatever metric du jour you care to use), it may not necessarily be a better genome. For example, the v1.95 assembly of the sea squirt Ciona intestinalis had a N50 length of 234 Kbp. A later v2.0 assembly produced over a tenfold increase in this metric, but in the process the newer assembly removed several highly conserved genes that were present in the earlier assembly. One step forward, two steps back.

 

These challenges were happening in an era when a genome sequence was something that represented a species rather than an individual (let alone a tissue type); an era when sequence reads were typically longer than your email signature; and when a genome sequencing project was expected to take years not days. In short, these challenges took place before the dark times, before the Empire before next generation sequencing (NGS) technologies took off.

 

How times change. Whereas the human genome project took about ten years before it was published – longer before all chromosomes were actually finished – we can now generate enough raw sequence data to produce a similar genome sequence in just ten days. And that’s only considering the output from a single NGS machine. Of course that’s assuming that you can put it all together in the correct order. 

 

The question of ‘how do I assemble a genome from NGS data?’ is not always an easy question to answer. It can be made easier if you have some expectation of what the final genome should look like. However, it is one thing to try assembling a genome sequence with NGS data when you already have a completed sequence from a closely-related species to compare it to. It’s another thing altogether when there is no such ‘reference’ genome. Imagine trying to put a jigsaw puzzle together when there is no picture on the box to guide you, no edge pieces, and only a ‘best guess’ as to how many pieces there are. Oh, and there may be up to 100 copies of every piece. 

 

The biggest headache about dealing with NGS data is that things are probably going to get worse before they get better. It is true that read lengths are increasing year-on-year, and newer sequencing chemistries continue to claim ever greater accuracy. These improvements should make genome assembly a much easier problem in the very near future. However, we still have to deal with the explosion of sequence data that is happening now. Everybody has genomes that need to be sequenced, and without genome assembly the raw output from a genome project is pretty much a meaningless set of very large files taking up space on a disk somewhere. 

 

Projects such as Genome 10K are aiming to sequence the genomes from – wait for it – 10,000 vertebrate species (the clue was in the name) and all those genomes will need assembling. It was thanks to Genome 10K that the Assemblathon project was born. The Assemblathon is more than just a made-up word…it is an ongoing project that is attempting to help drive improvements in the field of genome assembly. It will try to do this by bringing together all of the major players in the genome assembly field together for regular ‘contests’. To date there has been one such contest that has been completed which we oh-so-cleverly named ‘Assemblathon 1’. We  are currently in the middle of the inventively named ‘Assemblathon 2’. We plan – subject to our Assemblathon grant application being successful – to organize more Assemblathons in future. So how exactly can the Assemblathon help to assemble genomes?

 

In many ways, a more pertinent question to ask about genome assembly is not ‘how do I assemble a genome?’ but rather ‘how will I know if my assembly is any good?’. This is a very easy question to ask, but not such an easy question to answer. Genome assembly programs – like most bioinformatics software – will always spit out some output…the challenge is to make sense of that output. There are a range of fairly simple statistical metrics that are commonly used to describe genome assemblies. In most cases we like to believe that higher values for those metrics indicates a better genome assembly, but as the earlier example with C. intestinalis showed, this is not always the case. 

 

Hopefully though, there are always some intelligent questions that we can ask of any genome assembly, and many of these will involve leveraging any other sequence information that may exist for that species (finished BACs, optical maps, ESTs etc.). For example, we should expect that a good genome assembly will contain most of the genes that are present in the genome. We often do not know what those genes will look like, but we know that a large handful of important genes remain essentially unchanged across millions of years of evolution and so we should be able to find those at least. This approach has previously been used by our group to assess the ‘gene space’ in a variety of new genome sequences, and we are currently employing this technique as just one of many strategies in the evaluation of Assemblathon 2 entries.

 

In the world of genome assembly, beauty is very much in the eye of the beholder. Some people consider a genome assembly that is full of genes to be a very beautiful thing. However, a genome assembly can theoretically contain 100% of the genes, but still only represent ~5–10% of the genome. Some researchers want genome assemblies that capture the maximum amount of the underlying genome. Other researchers don’t care for getting all of the genome – or even all of the genes – just as long as what they do get is highly accurate. If there is a lot of heterozygosity in your genome of interest, then you may prefer an assembly that attempts to best resolve that heterozygosity and produce two haplotype sequences. Or maybe your research concerns repetitive elements and you just want to assemble a genome in a way that still captures a lot of the hard-to-assemble repetitive regions.

 

If you want the best genome assembly possible, you may have to accept some trade-offs. The assemblers that may perform well in one area may not perform as well in other areas. Everybody wants to be told ‘genome assembler X will give you the best assembly’ but at the moment it doesn’t seem fair to make such bold assertions. What we did find out from Assemblathon 1 was that a number of genome assemblers performed admirably across many, but not all, of the different metrics. For example, the assembler that did the best job at increasing coverage (the amount of the known genome present in the assembly), ranked 9th when considering the number of substitution errors present in the assembly. Conversely, the assembler that did the best job at minimizing substitution errors ranked 8th in terms of coverage. You pays your money and you takes your choice. We should mention, however, that these two assemblers (SOAPdenovo and SGA), along with ALLPATHS and Meraculous were consistently ranked highly for many of the metrics that we used and were the four best overall assemblers in Assemblathon 1. 

 

One of the unusual features of Assemblathon 1 was that it involved synthetic genome data. We actually made a small artificial genome, let it evolve for millions of years, and then produced synthetic reads from the final genome. We did this so that we would know what the final answer should look like (a luxury in the world of genome assembly). However, many people who are developing the latest and greatest algorithms for putting genomes together do not necessarily want to devote too much time to such ‘toy’ problems. They would much rather get their hands dirty with real world data, so their efforts can be of actual use to many researchers around the world. Ask, and you shall receive. In Assemblathon 2 we provided real data from three (count ‘em!) vertebrate genomes: a snake, a fish, and a bird. For the latter species (a parrot), we actually provided sequencing data from three different technologies (Illumina, 454, and PacBio) to see if groups would try a ‘mix & match’ strategy.

 

Assemblathon 2 is now in the ‘evaluation phase’ where the 43 genome assemblies that we received are being prodded, poked, and subjected to expert scrutiny. We hope that we can produce new metrics to complement those that are currently used to assess genome assemblies. We also hope that we can provide accurate assessments of how good these assemblies are with respect to what different users might want from a genome sequence. But most of all, we hope that we can continue to be useful by drawing the genome assembly community together – both physically and virtually – in order to share ideas and and to start occasional arguments promote vigorous discussion. The Assemblathon will be maximally useful if we help improve the state of genome assembly to the point where we no longer need to organize Assemblathons.

 

  • 1 year ago
  • Comments
  • Permalink
Share

Short URL

TwitterFacebookPinterestGoogle+
Page 1 of 6
← Newer • Older →

About

Avatar

An offshoot of the Genome 10K project, and primarily organized by the UC Davis Genome Center, Assemblathons are contests to assess state-of-the-art methods in the field of genome assembly.

Assemblathon 1 occurred at the end of 2010 and the results were published in late 2011. A second Assemblathon, using real data from three vertebrate species, started in June 2011 and a manuscript has recently been submitted.

Twitter

loading tweets…

Top

  • RSS
  • Random
  • Archive
  • Mobile
Effector Theme by Pixel Union