Assemblathon 1 results
The majority of results were initially made available during the Genome Assembly Workshop, though some were not fully available until after this meeting. The final publication describing the official set of Assemblathon 1 results was published in Genome Research, in late 2011.
Collectively, the Assemblathon evaluation groups at UC Santa Cruz and UC Davis generated over 100 different genome assembly metrics, many of which make use of the fact that we had a definitive picture of what the genome of species A actually looks like.
This page links to various results, submitted assemblies, and Assemblathon-related talks. We advise against focusing on a single metric in order to claim that any one assembly is 'better' than another. As we will hopefully make clear, there are many complex ways of describing assemblies and sometimes assemblies perform very well when using one metric, but very badly when using a different – and equally valid – metric.
Note that most of the files described below are contained in one directory on the Korf Lab website. You might find it easier accessing them from this directory.
Assemblathon Talks
- March 2011 - Genome Assembly Workshop: UC Davis Assemblathon talk - 1.8 MB PDF with notes
- March 2011 - Genome Assembly Workshop: UC Santa Cruz Assemblathon talk - 33 MB PDF
- May 2011 - CSHL Biology of Genomes: Assemblathon talk - 2.6 MB PDF with notes
The Assemblies
Each submitted assembly is available here as a gzipped file of scaffolds:
- A1
- B1 B2
- C1 C2
- D1 D2 D3 D4 D5
- E1 E2 E3
- F1 F2 F3 F4 F5
- G1
- H1 H2 H3 H4 H5
- I1 I2
- J1
- K1 K2 K3
- L1
- M1 M2 M3 M4 M5
- N1 N2 N3
- O1
- P1
- Q1
Results and reports
The Perl script that generated these results is available here (requires FAlite.pm). Most of these basic measures are hopefully self-explanatory, but here are some further notes:
- All sequences in submitted assemblies are initially treated as 'scaffolds'.
- Each scaffold was split on runs of 25 or more 'N' characters to form 'contigs'. Any contigs that could be split in this way were regarded as 'scaffolded contigs'
- Scaffolds that didn't have runs of 25 or more 'N' characters were also counted as 'unscaffolded contigs'.
- N50 scaffold/contig length is calculated by summing lengths of scaffolds/contigs from the longest to the shortest and determining at what point you reach 50% of the total assembly size. The length of the scaffold/contig at that point is the N50 length.
- The L50 measure is the number of scaffolds/contigs that are greater than, or equal to, the N50 length.
- The NG50 and LG50 measures are the same as the N50 and L50 measures except that rather than compare against the total assembly size, we compare against the known genome size of species A (using the average size of haplotypes 1 and 2). These measures permit fairer comparisons between assemblies of different sizes.
- Preliminary UC Davis report - version 0.6 (4/28/11)
This PDF contains provisional findings from some of the analysis performed by the Assemblathon group at the UC Davis Genome Center. It will be updated in subsequent days and some of this information will end up in the main Assemblathon paper that is being written.
This PDF report was prepared by Aaron Darling and uses the Mauve multiple genome alignment software to map genome assemblies to the known genome of species A. As with the previous report, some of these results will be included in the main Assemblathon paper.
- UC Santa Cruz Assemblathon analysis - A summary of all of their results, 16 MB PDF. N.B. still being updated!
- UC Santa Cruz Assemblathon 1 materials - a comprehensive set of documents, data, and code



Keith Bradnam
Comments (0)
Add a Comment