NAME

run_Preprocessing.pl - Preprocesses solexa reads into a map-able format

SYNOPSIS

 # Start preprocessing using the default settings file: Settings/Settings_Preprocessing.txt
 run_Preprocessing.pl
 
 # Start preprocessing using an alternative settings file
 run_Preprocessing.pl alternative_settings.txt
 
 # Start preprocessing using the command line
 # The displayed parameters are the default params (except input and output file)
 run_Preprocessing.pl --input lane1 --output preprocessed.fa --outputNoAdaptor noadaptor.fa
 --temp "/tmp" --max_N 1 --min_length 15 --max_length 32 --min_counts 1
 --adaptor "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA" --trim 1 --min_complexity 0.5
 --remove_adaptor 1 --logfile preproc.log --logconsole 1
 
 # Start preprocessing using the minimum number of required arguments
 run_Preprocessing.pl --input lane1 --output preprocessed.fa

OPTIONS

--input

the input file; has to be a solexa seq file; Mandatory parameter

 1       1       26      688     AAAACACCAACAAAACAACCAAAAATAATACAACAA
 1       1       117     645     GCACCAACAACAAGCAAAAAACGACTAAACACACAA
 1       1       391     248     GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA
 1       1       24      746     ACTCAACACGAAACAAACCAAAACGACAAAAACACA

--output

the output file; will be multiple fasta format; Preprocessing aggregates reads having an identical sequence. The counts for each sequence will be displayed in the fasta-header; Mandatory parameter

 >1293||Count=1
 TAAGATGTTGATATATGNTTTGTCCGAATNCG
 >1294||Count=177
 GCATTGAAGGGACTGCA
 >1295||Count=1
 TGTGATGTGCATGCGTGCTTTCTACTTTCCCC
 >1296||Count=4
 TGTGATGTGCATTTGTTCTTTTTCTTTTTCTT

--outputNoAdaptor

Similar to --output. Reads for which no adaptor could be identified will be written into a separate output file. This is optional, if not provided reads without identified adaptor will be written into --output

--temp

The temporary directory which will be used; Make sure that the temporary directory contains about 3Gb of free disk space. default: "/tmp"

--max_N

The maximum number of 'N'-letters in a row. Solexa reads having a higher number of 'N'-letters will be discarded. default: 1

--min_length

The minimum length of the reads (after adaptor removal). Shorter reads will be discarded. default: 15

--max_length

The maximum length of reads (after adaptor remvoal). Longer reads will be discarded. default: 100

--min_counts

The minimum number of counts for a sequence (after aggregating). Sequences having less counts will be discarded. default: 1

--adaptor

The adaptor sequence which will be removed (if identified); default: TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA

--trim

Should reads which exceed --max_length be trimmed to length --max_length. The 3'-end will be trimmed. This implies that no reads will be discarded because of a length exceeding --max_length; 0..no, 1..yes; default: 0

--min_complexity

Low complexity sequences like AAAAAAAGAAAAAAAAAAAA should be removed prior to mapping since they distort the mapping statistics. This low complexity sequences can be mapped to almost any genome and they are thus not informative. The script calculates the complexity of each sequence using a simple equation. Reads having a complexity lower than the specified min. value will be discarded. default: 0.5

 The complexity for a read is calculated as:
 c=1-fA^2-fT^2-fC^2-fG^2
 Wheras fA, fT, fC, fG is the frequency of the corresponding nucleotide in the read
 
 Examples:
 AAAAAAAAAAAAAAAA c = 0.00  # this is the lowest possible complexity
 GAAAAAAAAAAAAAAG c = 0.21
 AGAGAGAGAGAGAGAG c = 0.50
 ACGTACGTACGTACGT c = 0.75 # this is the highest possible complexity

--remove_adaptor

Indicate whether the adaptor sequence should be removed. For certain applications like ChipSeq there is no point in removing an adaptor sequence. 0..no, 1..yes; default: 1

--logfile

Write the preprocessing log to the specified file. no default (no file)

--logconsole

Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1

--help

Display the help pages

DESCRIPTION

Input Output

The input file has to be a solexa sequence file for example:

 1       1       26      688     AAAACACCAACAAAACAACCAAAAATAATACAACAA
 1       1       117     645     GCACCAACAACAAGCAAAAAACGACTAAACACACAA
 1       1       391     248     GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA
 1       1       24      746     ACTCAACACGAAACAAACCAAAACGACAAAAACACA

The output file will be a multiple fasta file containing only the tallied unique sequences. The number of actually counted reads for the given sequence have to be extracted from the fasta ID: Example:

 >1293||Count=1
 TAAGATGTTGATATATGNTTTGTCCGAATNCG
 >1294||Count=177
 GCATTGAAGGGACTGCA
 >1295||Count=1
 TGTGATGTGCATGCGTGCTTTCTACTTTCCCC
 >1296||Count=4
 TGTGATGTGCATTTGTTCTTTTTCTTTTTCTT

Workflow

Preprocessing proceeds in several distinct steps:

First the character '.' will be converted to 'N' and reads having more than --max_N in a row will be discarded

Identical sequences will be aggregated and counts will be tallied.

 # For example: the two sequences
 >5
 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC
 >6
 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC
 
 will be aggregated to:
 
 >1||Count=2
 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC

Attention, unique read identifiers are lost during aggregating. New IDs will be assigned

Adaptor sequences will be removed, if identified. The algorithm is especially adapted to the idiosyncracies of Solexa sequencing, i.e an increasing error rate towards the 3'-end of the reads. Thus the mismatch penalty during adaptor identification decreases lineary with distance from the 5'-end. Adaptor removal is computationally very demanding and requires a lot of time. This is an optional step
If the --trim property is set sequences exceeding in length --max_length will be trimmed to --max_length. This is an optional step.
After adaptor removal and trimming, sequences which have not been identical before might be identical. It is therefore necessary to repeat aggregating identical reads. This step is only executed if adaptors have been removed or trimming has been done. Reads for which the adaptor has been identified will be kept separated from reads for which the adaptor has not been identified, even if they have identical sequences (e.g.: after trimming).
Low complexity sequences will be removed. See --min_complexity
Discard reads which do not meet the specified criteria. Discard all reads which are shorter than --min_length, longer than --max_length or have less than --min_counts counts.
Distribute the reads into the different output files
Calculate the preprocessing statistics, for example the read length distribution.

The Settings file

Alternatively to the OPTIONS it is possible to specify a settings file. The different parameters will be read from the settings file. You may either use the default settings file Settings/Settings_Preprocessing.txt or you may specify an alternative settings file. If you want to specify an alternative settings file we recommend to copy the default file and edit the parameters accordingly. The parameters in the Settings file exactly mirror the options in OPTIONS. Comprehensive details are provided in the settings file

Preprocessing Statistics

Miro automatically calculates preprocessing statistics. This statistics are included into the log. The statistics contain the number of reads initially found in the solexa file, number of reads passing the preprocessing and the read length distribution.

Example of preprocessing statistics:

 PREPROCESS STATISTICS:
 Total reads initialy found: 999
 Total reads passing preprocessing: 991
 Unique sequences: 758
 
 Statistics for reads WITH an identified adaptor
 Total reads: 259
 Unique sequences: 181
 Average counts per sequence: 1.43
 
 Read length profile:
 length seq     reads
 21     2       2
 22     2       2
 23     7       16
 25     3       3
 26     4       4
 27     5       5
 28     12      13
 29     44      64
 30     23      23
 31     53      96
 32     26      31 
 
 Statistics for reads WITHOUT an identified adaptor
 Total reads: 732
 Unique sequences: 577
 Average counts per sequence: 1.27 
 
 Read length profile:
 length seq     reads
 36     577     732

length: The length of the reads after preprocessing. If no adaptor removing step was chosen or no adaptors have been found all reads should have exactly the same length.
seq: The number of unique sequences having the given length.
reads: The total number of reads having the given length.

Logging

Independent of the console log --logconsole and the file-log --logfile the log of preprocessing is always sent to the MIRO logger.

Subsequent Steps

run_Mapping.pl

Map fasta formated input file (one lane) to multiple reference sequences.

REQUIREMENTS

Perl 5.8 or higher

AUTHORS

Robert Kofler

Juliane Dohm

Matthew Ingham

Heinz Himmelbauer

CONTACT

robert.kofler at crg.es