Present research activities result from the group's involvement in the German plant genome project (GABI-FUTURE
and "Pflanzenbiotechnologie der Zukunft") , and technology-driven cooperations with groups at the CRG and other institutions.
BeetSeq - A reference genome for sugar beet ( Beta vulgaris)
A genome sequence for sugar beet is needed to fully exploit the species' value for evolutionary genomics and as a crop plant. Presently, the completed genomes of representatives of several genera of flowering plants are available, e.g. from Arabidopsis, poplar, grapevine, papaya, soy bean, cucumber, apple, sorghum, maize, and rice. Sequencing of other plant genomes is underway, including Solanum (potato, tomato), Lotus, and melon. Since sugar beet is not a close relative to any of the mentioned taxa, its genome sequence (~800 Mbp) will provide essential information on plant genome evolution. We are aware of the intraspecific variation between sugar beet accessions. In a pilot genomic sequencing project (Dohm JC et al., Plant J, 2009), sugar beet BAC sequences from two haplotypes differed in exons by 1% (nucleotide level) and in non-coding regions by 9% (6% mismatches, 3% gaps; alignable regions only). Large indels or high sequence divergence comprised 10% of either sequence. A large proportion of such indels could be attributed to haplotype-specific integration of transposable elements. Sequencing therefore focuses on the DH line KWS2320. The sequencing strategy of the BeetSeq project is chiefly based on whole-genome shotgun approaches, using next-generation sequencing technologies (454, Solexa). Long-range continuity of scaffolds is established by the integration of Sanger end sequences from BACs and fosmids, 454 paired-end reads, and Solexa mate-pair reads. A current assembly (Oct. 2010) has a size of 900 Mb with a scaffold N50 size exceeding 0.8 Mb. Annotation of the genome draft is ongoing, and current data suggest the existence of about 28,000 genes in the sugar beet genome. A pre-publication draft of the sugar beet genome sequence has first been made publicly available from our Berlin webpages at http://bvseq.molgen.mpg.de on January 14th, 2012.|
Strand-specific transcriptome sequencing
Several studies support that antisense-mediated regulation may affect a large proportion of genes. Using the Solexa next-generation sequencing platform, we developed DSSS (Direct Strand Specific Sequencing), a strand specific protocol for transcriptome sequencing. We tested DSSS with RNA from two samples, prokaryotic (Mycoplasma pneumoniae) as well as eukaryotic (Mus musculus), and obtained data containing strand specific information, using single-read and paired-end sequencing. We validated our results by comparison with a strand specific tiling array dataset for strain M129 of the simple prokaryote M. pneumoniae, and by quantitative PCR (qPCR). The results of DSSS were very well supported by the results from tiling arrays and qPCR. Moreover, DSSS provided higher dynamic range and single-base resolution, thus enabling efficient antisense detection and the precise mapping of transcription start sites and untranslated regions. DSSS data for mouse confirmed strand-specificity of the protocol and the general applicability of the approach to studying eukaryotic transcription. We propose DSSS as a simple and efficient strategy for strand-specific transcriptome sequencing and as a tool for genome annotation exploiting the increased read lengths that next generation sequencing technology now is capable to deliver.
Publication: Strand-specific deep sequencing of the transcriptome (Genome Res)
Analysis of errors and biases in Illumina sequencing data (Genome Analyzer and HiSeq2000)
The generation and analysis of high-throughput sequencing data is becoming a major component of many studies in molecular biology and medical research. Illumina's Genome Analyzer (GA) and HiSeq instruments are currently the most widely used sequencing devices. Here, we comprehensively evaluate properties of genomic HiSeq and GAIIx data derived from two plant genomes and one virus (read length 95-150 bases). We provide quantifications and evidence for GC bias, error rates, error sequence context, effects of quality filtering, and the reliability of quality values. By combining different filtering criteria we reduced error rates 7-fold at the expense of discarding 12.5% of alignable bases. While overall error rates are low in HiSeq data we observed regions of accumulated wrong base calls. Only 3% of all error positions accounted for 24.7% of all substitution errors. Analyzing the forward and reverse strand separately revealed error rates of up to 18.7%. Insertions and deletions occurred at very low rates on average but increased to up to 2% in homopolymers. A positive correlation between read coverage and GC content was found depending on the GC content range. The errors and biases we report have implications on the use and the interpretation of Illumina sequencing data. GAIIx and HiSeq data sets show slightly different error profiles. Quality filtering is essential to minimize downstream analysis artifacts. Supporting previous recommendations, the strand-specificity provides a criterion to distinguish sequencing errors from low abundance polymorphisms.
: Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems (Genome Biology)
: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing (Nucleic Acids Res)
De novo contig assembly of short-read data sets using SHARCGS
Integrated genetic and physical mapping
of the genome of sugar beet (Beta vulgaris