Thoughts on Assemblathon 3
The Assemblathon 2 paper has finally been published and in the process it generated a lot of discussion. I have previously written some thoughts on the open nature of the project, but would now like to say a little about the prospects of Assemblathon 3, and how the lessons learned from the last Assemblathon might change how we would do things in future.
But first I’ll get straight to the point and say that there are no immediate plans for an Assemblathon 3 contest. This is due to two main reasons:
1. Assemblathon fatigue
I think that some of the organizers as well as the previous participants would like a bit of a break before we even consider doing this all again.
2. Low expectations
Many of the software tools used by teams in Assemblathon 2 have been superseded by newer versions and there are also several new assemblers out there. However, it is not at all certain if an Assemblathon 3 contest run today would produce a different outcome. I.e. it seems likely that we would still see a lack of consistency in the results between different assemblers.
There are other reasons as well (e.g. lack of funding), but I think that the above two points are why there will not be another Assemblathon in the immediate future. That’s not to stop anyone else organizing a genome assembly assessment exercise, and so the next two sections offer some thoughts on what could be improved.What did we learn from Assemblathon 2?
Assemblathon 2 potentially suffered from having too many species, with too much sequence data for some species. Not all participants had the time, resources, and/or inclination to assemble all three genomes, meaning that we couldn’t fairly compare each assemblers’s performance across multiple species (only two teams assembled all three genomes).
Furthermore, very few teams utilized all of the diverse parrot sequencing data (Illumina, 454, and PacBio), with most teams opting to just use some of the Illumina reads. It could be argued that, while laudable, providing 285x coverage of the parrot genome is hardly a real-world scenario.
The final metrics that were used to judge assemblies were modified slightly throughout the evaluation process. This was in response to feedback from the participants. In hindsight — always a wonderful thing — we should have worked hard to finalize the metrics before the evaluation process started. Any time a single metric was changed, many downstream changes had to be made (e.g. a dozen or so figures had to be redrawn) and this slowed down work on the manuscript.
The first question which occurs to me is this: what was the purpose of this international effort? What were the authors trying to achieve? Was it:
1) To catalogue available assemblers?
2) To compare available assemblers?
3) To develop best practice?
4) To develop a set of guidelines? i.e. which assembler should I use on my data?
5) To compare assembly metrics?
6) To develop better assembly metrics?
This is valid criticism and I personally think that while we were trying to do a bit of everything on Mick’s list, we ended up spreading ourselves too thinly. Perhaps, if there had only been assemblies from one species to focus on, it may have been possible to push a bit harder at answering these questions.
Another weakness of Assemblathon 2 was that we should have been more rigorous in collecting information on how assemblies were generated by participating teams. Most teams provided instructions, though the amount of detail in those instructions varied a lot.Proposed differences for Assemblathon 3
In light of what we learnt from Assemblathon 2, I propose a set of constraints/rules/guidelines for any potential Assemblathon 3 contest. Hopefully, these would help produce a speedier competition, with more of an emphasis on ‘real world’ genome assembly, and which could produce a more informative analysis.
- Focus on only one species and make it clear why we have chosen this species (something that was not obvious for Assemblathon 2).
- Generate a mixture of sequences based on allcommonly used NGS technologies (Illumina, 454, Ion Torrent, PacBio, and possibly Moleculo).
- Sequence information will (initially) be kept private from participants.
- Ideally, sequences would all be sourced from a single sequencing facility (for consistency of pre/post sequencing steps), or from the NGS companies themselves.
- Teams have to ‘buy’ sequence resources to better reflect real-world usage (where a typical research lab has limited resources):
- Teams would be allocated a fictional budget, e.g. $20,000.
- Teams could opt to use a mixture of $10,000 of PacBio sequence & $10,000 of Illumina, or just $20,000 of Illumina etc.
- Only once a team has ‘placed their order’ would we make sequence data available to that team.
- All input sequence read data should be submitted to SRA/ENA/DDBJ as soon as it is generated, to prevent delays later on.
- As soon as submission deadline is closed, the final set metrics should be agreed with participants before analysis is started. This will also save lots of time later on.
- Make it a requirement that participation requires submission of full assembly instructions at the same time that a team submits their assembly (potentially using a detailed form to ensure we collect all necessary data).
- A virtual cost of computing time/resources could also be factored into the budget if desired (e.g. an assembly that used a high CPU cluster for a week would have a cost added that would be higher than an assembly running on a desktop computer for a day). This might be difficult to assign costs though.
- Optionally request all assemblies use new FASTG format, in order to allow us to reward teams that better capture heterozygosity in their assemblies (will probably require new analysis tools).
- Maximum of two assemblies per participant (with teams encouraged to use experimental ideas for second assembly…in Assemblathon 2, these sometimes turned out to be better assemblies than the competitive entries).
- Potentially seek sponsorship for some sort of official prize/trophy for the winner(s) in order to encourage participation.
Every now and then, people have made suggestions as to what species or genomes would be good candidates for an Assemblathon 3 contest. These suggestions often reflect the desire to have the community assemble a genome for someone’s favorite species, or sometimes just reflect the idea that we should be assembling something with a genome that is: large/complex/polyploid etc. E.g.
— Mario Caccamo (@mcaccamo)
Will @assemblathon 3 attack some small but difficult genomes? e.g. Plasmodium?
— Jason Chin (@infoecho)
— Steven Robbens (@stevenrobbens)
Such feedback leads me to believe that if there is an Assemblathon 3, then it will probably disappoint many people as soon as a species is chosen! Personally, I still like the idea of using synthetic genome data (as in Assemblathon 1). It not only helps to know what the answer is meant to look like, but this approach could be broadened to include multiple (small) genome assemblies that all differ in a controlled fashion. E.g. we could make a series of small genomes that progressively differed in their heterozygosity and/or repeat content. This would allow us to see how well different assemblers fare under a range of test conditions that we get to control precisely.Conclusion
If there is to be an Assemblathon 3, then you’ll most likely hear about it on this blog or on the Assemblathon’s twitter account. But even if there isn’t another Assemblathon, I will probably continue to use the twitter account to highlight news and developments in the field genome assembly.