Processing and analysis of Illumina sequencing data

Illumina Sequencing Analysis Pipeline

Illumina provides a standard pipeline used to conduct image analysis and base calling for all types of samples. This process converts the images taken by the Illumina GA sequencer to digital intensity levels, and subsequently uses these intensity levels for base calling.

Data were processed with the following Illumina pipeline versions:

GA :

  • 2008-01-07 to 2008-03-12: version 0.2.6

  • 2008-04-25 to 2008-11-20: version 0.3.0

  • 2008-12-23: version 1.0

  • 2008-11-25 to 2009-05-28: version 1.3.2

  • 2009-05-28 to 2009-11-19: version 1.4.0 (within SCS 2.3)

  • 2009-11-19 to 2010-04-07: version 1.5.0 (within SCS 2.5)

  • 2010-04-07 to 2010-07-30: version 1.6.0 (within SCS 2.6)

  • 2010-07-30 to 2010-08-05: version 1.7.0 (within SCS 2.6)

  • 2010-08-05 to 2011-03-22: version 1.7.0 (within SCS 2.8)

  • 2011-03-22 to 2011-07-18: version 1.7.0 (within SCS 2.9)

  • 2011-07-18 to 2011-10-14: version 1.7.0 (within SCS 2.8)

  • 2011-10-14 to present: version 1.8.1 (within SCS 2.8)

HiSeq :

  • 2010-08-05 to 2010-11-17: version 1.7 (within HCS 1.1.37)

  • 2010-11-17 to 2011-02-17: version 1.7.0 (within HCS 1.1.37.8)

  • 2011-02-17 to 2011-03-22: version 1.7.0 (within HCS 1.1.37.19)

  • 2011-03-22 to 2011-06-03: version 1.7.0 (within HCS 1.3.8)

  • 2011-06-03 to 2011-07-12: version 1.7.0 (within HCS 1.4.5)

  • 2011-07-12 to 2011-10-07: version 1.7.0 (within HCS 1.4.8)

  • 2011-10-07 to 2012-06-19: version 1.8.1 (within HCS 1.4.8)

  • 2012-06-19 to 2013-02-04: version 1.8.2 (within HCS 1.4.8)

  • 2013-02-04 to 2013-07-05: version 1.8.2 (within HCS 1.5.15.1)

  • 2013-07-05 to present: version 1.8.4 (within HCS 2.0)

For pipeline versions 1.7 and earlier, the output consists of qseq.txt files for each tile for each lane. The qseq file contains both sequence and base by base quality information, and is provided to the users. The quality values encoded are Phred + 64. From pipeline versions 1.8.1 onwards, fastq files are generated instead, which are provided to the users. The quality values encoded are Phred + 33 ( Sanger quality values). A summary of standard files is provided below.

NOTE: For applications (e.g. miRNA sequencing) where the fragment length is lower than the read length, the adapter sequence must be located and trimmed. For this reason we recommend using the sequence files we provide with the adapter sequence removed, as described below.

Read data are passed through the Illumina alignment and sequence analysis program, GERALD, where they are aligned to a reference sequence using the Eland software. Depending on the sample, the reference sequence will be a genome sequence, RefSeq dataset, or mirbase sequences, etc. However, for pipeline versions 1.8.1 onwards, for the HiSeq runs, the Quality control is performed by randomly selecting 10% of the reads and aligning them to the reference using the GEM mapper. Additionally, FastQC reports for the sequences are generated and provided to the users.

Eland creates files containing alignment locations and information, which can be provided to the user at their request. One file of interest may be the s_[lane#]_sequence.txt file, which contains only reads passing quality filtering, in FastQ format, and is available upon request. Reads pass filtering when the base-calling software is confident in the first X bases (12 by default) of the read.

Before providing any data, we review the results of the GERALD analysis to be sure the reads are of high quality and sufficient in number. We do so by reviewing intensity level and sequencing error reports, as well as reviewing the number of reads which match the reference unambiguously, ambiguously, and not at all. If all these standards are passed, we then pass the sequence data on to the user. Eland is a convenient tool to assess the quality of read datasets. However, it has limitations (e.g. it tolerates only two mismatches, does not allow indel errors, has read length limitation, and trims all reads to the same length). We therefore recommend to users to consider the analysis of their data employing further mapping algorithms without these restrictions (e.g. BWA, Bowtie, GEM, or others).

Details regarding the Illumina Sequencing Analysis Software were provided by the Illumina Sequencing Analysis Software User Guide for CASAVA Version 1.8 by Illumina, Inc.

Read processing with non-Illumina software

We generally use the output from the MIRO package which uses Eland, GEM, or SeqMap as alignment programs.

The first step involved is to trim the adapter for samples with small insert sizes (e.g. miRNA and small RNA samples, NlaIII or DpnII Digital Gene Expression). No sequence in the resulting file contains the adapter sequence, apart from rare cases where the adapter sequence was not detected. In this file, only one copy of identical reads is kept (along with the the read count).

After preprocessing, the reads are aligned using the most appropriate alignment software (Eland, GEM, SeqMap). Three separate files are created containing reads which mapped unambiguously (1 place in the reference), ambiguously (2 or more places), or not at all, as well as a statistics file describing the results. MIRO also possesses an option to map reads to multiple references, and a set of these 3 files is created for each reference used. All of these files are provided to users.

A URL from which the aforementioned files can be downloaded is emailed to the user once available. If such an email is not received, please contact Debayan Datta.

Application Specific Files

Below is a description of standard files available for a given application. Any additional analysis should be discussed with Heinz Himmelbauer.

ChIP-Seq and whole-genome sequencing

These reads are generally aligned using the GEM alignment program (Ribeca, Guigo et al., in preparation), called from within MIRO. Bed and wig files are then generated for all ChIP-Seq samples. These files can be loaded into various genome browsers to view the location of the reads sequenced. They can also be used in conjunction with other software, such as that of Genomatix, and various programs designed for peak calling.

miRNA sequencing

Several different files are created by MIRO during analysis of miRNA samples. By default, the following files are created for quantile normalization, linear-scaling normalization and for non-normalized data:

1. A heatmap pdf with a colour coded representation of miRNA abundance, and a file containing a legend for the heatmap.
2. A lineplot pdf. This is primarily intended for viewing miRNA abundance for samples which represent progressive time points.
3. Scatterplot pdfs showing the correlation between samples.
4. Tables containing a list of miRNAs, and the number of reads found for each miRNA.

The following files may be created upon request:

1. A text file containing clusters of overlapping reads.
2. A text file containing clustered reads using a sliding window approach
3. Logos may be created for selected miRNAs (see WebLogo 3.0)
4. miRNA profiles which show the position of the reads on the reference sequence. The RNA folding prediction is also displayed.
5. 'bed' and 'wig' files for display with the UCSC Genome Browser
6. Files which are compatible with miRDeep for de novo miRNA identification

In depth descriptions of all these files, and the columns in each table, can be found in the MIRO documentation

File Formats

fastq (Gerald)

@HWI-ST227:163:D0DV5ACXX:5:2105:3514:72813 1:N:0:

GTCATTGTGAGCATTTTCATCCCGAAGTTGCGGCTCATTCTGATTCTGAA

+

CCCFFFFFHHHHHJJJJJIJJJJJJJJJJJIJJJIJBFIGDD>FHBDD@F

The first line contains various identifiers such as the instrument name, lane, x- and y- coordinates of the cluster, whether the read passed filtering or not etc. The second line contains the sequence, while the fourth line contains the quality string

qseq.txt (Gerald)

SOLEXA 90403 4 1 23 1566 0 1 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG `aaaaa```aZa^`]a``a``a]a^`a\Y^`^^]V` 1

Relevant information is contained in the last 4 columns. They are:

  1. read number (1 or 2 for paired end runs)

  2. sequence

  3. quality string

  4. filtering (1 for passed, 0 for failed).

These files typically range between 500 Megabytes and 1.5 Gigabytes in size each. A single file will be created for each lane if requested.

seq.txt (Gerald)

4 1 23 1566 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG

The first 4 columns are an ID that provides location details of the cluster in the flow cell, followed by the read sequence. Seq.txt file size typically ranges between 250 Megabytes and 1 Gigabyte. One file is generated per lane.

s_[lane]_sequence.txt (Gerald)

@SOLEXA:4:1:22:769#0/1 ATTTCCTGAGCTTAATGCTTGGGAGCGTGCTGGTGC +SOLEXA:4:1:22:769#0/1 aaba_^``^^\_`_`Z_^_^Z^_^[^\WZW`^UW_Z

Essentially the same information as in the qseq file, but in FastQ format and only including reads passing filtering. These files typically range between 250 Mb and 1 Gigabyte in size each.

unambiguous/ambiguous.txt (Miro)

3070||Count=45 GCCAAGCCCGTTCCCTTGGCTGTGGTTTCGCTTGATAGTA chr8 1 2 F 70764945

The columns in these files are as follows:

  1. ID||Count, where count equals the total number of identical reads found

  2. Sequence

  3. Reference sequence (Chromosome number, transcript name, etc)

  4. Number of locations in the reference to which the read was mapped (1 for unambiguous, 2 or more for ambiguous)

  5. Number of mismatches in alignment

  6. Strand

  7. Location within reference sequence

These files typically range between 50 and 100 Megabytes in size each. Typically these are created for each lane, but can be pooled by sample if desired.

Bed files (Miro)

track name=region 1 description="unknown" useScore=1  chr22  23901283        23901298        id:715          600     -

The columns in these files are as follows:

  1. Reference ID (eg chromosome)

  2. Start

  3. End

  4. Sequence ID

  5. Score

  6. Strand (+ or -)

These files typically range between 100 and 500 Megabytes. For more details on bed files created, visit, http://seq.crg.es/download/software/Miro/doc/Hit2Bed.html .

Wig files (Miro)

track type=wiggle_0 name="unknown" description="variableStep format" visibility=full  variableStep chrom=chr7 span=1000  156001         30 

The columns in these files are as follows:

  1. Start

  2. Number of sequences found at Start + span (ie 156001 to 15700 0)

These files typically range between 5 and 50 Megabytes. For more details on bed files created, visit http://seq.crg.es/download/software/Miro/doc/Hit2Wiggle.html .

Data Storage

As of November 7th, 2011, data will be deleted six months after they have been made available to the user. This policy includes the sequence data which the user has received, as well as other files (such as intensities) which we make available to users upon request. Beyond six months, the Unit will resume no responsibility on data that have not been collected.

 
 Ultrasequencing Unit
Centre for Genomic Regulation (CRG)
Barcelona, Spain