run_Preprocessing.pl - Preprocesses solexa reads into a map-able format
# Start preprocessing using the default settings file: Settings/Settings_Preprocessing.txt run_Preprocessing.pl # Start preprocessing using an alternative settings file run_Preprocessing.pl alternative_settings.txt # Start preprocessing using the command line # The displayed parameters are the default params (except input and output file) run_Preprocessing.pl --input lane1 --output preprocessed.fa --outputNoAdaptor noadaptor.fa --temp "/tmp" --max_N 1 --min_length 15 --max_length 32 --min_counts 1 --adaptor "TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA" --trim 1 --min_complexity 0.5 --remove_adaptor 1 --logfile preproc.log --logconsole 1 # Start preprocessing using the minimum number of required arguments run_Preprocessing.pl --input lane1 --output preprocessed.fa
the input file; has to be a solexa seq file; Mandatory parameter
1 1 26 688 AAAACACCAACAAAACAACCAAAAATAATACAACAA 1 1 117 645 GCACCAACAACAAGCAAAAAACGACTAAACACACAA 1 1 391 248 GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA 1 1 24 746 ACTCAACACGAAACAAACCAAAACGACAAAAACACA
the output file; will be multiple fasta format; Preprocessing aggregates reads having an identical sequence. The counts for each sequence will be displayed in the fasta-header; Mandatory parameter
>1293||Count=1 TAAGATGTTGATATATGNTTTGTCCGAATNCG >1294||Count=177 GCATTGAAGGGACTGCA >1295||Count=1 TGTGATGTGCATGCGTGCTTTCTACTTTCCCC >1296||Count=4 TGTGATGTGCATTTGTTCTTTTTCTTTTTCTT
Similar to --output
. Reads for which no adaptor could be identified will be written into a separate output file.
This is optional, if not provided reads without identified adaptor will be written into --output
The temporary directory which will be used; Make sure that the temporary directory contains about 3Gb of free disk space. default: "/tmp"
The maximum number of 'N'-letters in a row. Solexa reads having a higher number of 'N'-letters will be discarded. default: 1
The minimum length of the reads (after adaptor removal). Shorter reads will be discarded. default: 15
The maximum length of reads (after adaptor remvoal). Longer reads will be discarded. default: 100
The minimum number of counts for a sequence (after aggregating). Sequences having less counts will be discarded. default: 1
The adaptor sequence which will be removed (if identified); default: TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA
Should reads which exceed --max_length
be trimmed to length --max_length
.
The 3'-end will be trimmed. This implies that no reads will be discarded because of a length exceeding --max_length
;
0..no, 1..yes; default: 0
Low complexity sequences like AAAAAAAGAAAAAAAAAAAA
should be removed prior to mapping since they distort the mapping statistics.
This low complexity sequences can be mapped to almost any genome and they are thus not informative.
The script calculates the complexity of each sequence using a simple equation. Reads having a complexity lower than the specified min. value will be discarded.
default: 0.5
The complexity for a read is calculated as: c=1-fA^2-fT^2-fC^2-fG^2 Wheras fA, fT, fC, fG is the frequency of the corresponding nucleotide in the read Examples: AAAAAAAAAAAAAAAA c = 0.00 # this is the lowest possible complexity GAAAAAAAAAAAAAAG c = 0.21 AGAGAGAGAGAGAGAG c = 0.50 ACGTACGTACGTACGT c = 0.75 # this is the highest possible complexity
Indicate whether the adaptor sequence should be removed. For certain applications like ChipSeq there is no point in removing an adaptor sequence. 0..no, 1..yes; default: 1
Write the preprocessing log to the specified file. no default (no file)
Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1
Display the help pages
The input file has to be a solexa sequence file for example:
1 1 26 688 AAAACACCAACAAAACAACCAAAAATAATACAACAA 1 1 117 645 GCACCAACAACAAGCAAAAAACGACTAAACACACAA 1 1 391 248 GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA 1 1 24 746 ACTCAACACGAAACAAACCAAAACGACAAAAACACA
The output file will be a multiple fasta file containing only the tallied unique sequences. The number of actually counted reads for the given sequence have to be extracted from the fasta ID: Example:
>1293||Count=1 TAAGATGTTGATATATGNTTTGTCCGAATNCG >1294||Count=177 GCATTGAAGGGACTGCA >1295||Count=1 TGTGATGTGCATGCGTGCTTTCTACTTTCCCC >1296||Count=4 TGTGATGTGCATTTGTTCTTTTTCTTTTTCTT
Preprocessing proceeds in several distinct steps:
First the character '.' will be converted to 'N' and reads having more than --max_N
in a row will be discarded
Identical sequences will be aggregated and counts will be tallied.
# For example: the two sequences >5 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC >6 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC will be aggregated to: >1||Count=2 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC
Attention, unique read identifiers are lost during aggregating. New IDs will be assigned
Adaptor sequences will be removed, if identified. The algorithm is especially adapted to the idiosyncracies of Solexa sequencing, i.e an increasing error rate towards the 3'-end of the reads. Thus the mismatch penalty during adaptor identification decreases lineary with distance from the 5'-end. Adaptor removal is computationally very demanding and requires a lot of time. This is an optional step
If the --trim
property is set sequences exceeding in length --max_length
will be trimmed to --max_length
. This is an optional step.
After adaptor removal and trimming, sequences which have not been identical before might be identical. It is therefore necessary to repeat aggregating identical reads. This step is only executed if adaptors have been removed or trimming has been done. Reads for which the adaptor has been identified will be kept separated from reads for which the adaptor has not been identified, even if they have identical sequences (e.g.: after trimming).
Low complexity sequences will be removed. See --min_complexity
Discard reads which do not meet the specified criteria.
Discard all reads which are shorter than --min_length
, longer than --max_length
or have less than --min_counts
counts.
Distribute the reads into the different output files
Calculate the preprocessing statistics, for example the read length distribution.
Alternatively to the OPTIONS
it is possible to specify a settings file.
The different parameters will be read from the settings file.
You may either use the default settings file Settings/Settings_Preprocessing.txt or you may specify an alternative settings file.
If you want to specify an alternative settings file we recommend to copy the default file and edit the parameters accordingly.
The parameters in the Settings file exactly mirror the options in OPTIONS
.
Comprehensive details are provided in the settings file
Miro automatically calculates preprocessing statistics. This statistics are included into the log. The statistics contain the number of reads initially found in the solexa file, number of reads passing the preprocessing and the read length distribution.
Example of preprocessing statistics:
PREPROCESS STATISTICS: Total reads initialy found: 999 Total reads passing preprocessing: 991 Unique sequences: 758 Statistics for reads WITH an identified adaptor Total reads: 259 Unique sequences: 181 Average counts per sequence: 1.43 Read length profile: length seq reads 21 2 2 22 2 2 23 7 16 25 3 3 26 4 4 27 5 5 28 12 13 29 44 64 30 23 23 31 53 96 32 26 31 Statistics for reads WITHOUT an identified adaptor Total reads: 732 Unique sequences: 577 Average counts per sequence: 1.27 Read length profile: length seq reads 36 577 732
The length of the reads after preprocessing. If no adaptor removing step was chosen or no adaptors have been found all reads should have exactly the same length.
The number of unique sequences having the given length.
The total number of reads having the given length.
Independent of the console log --logconsole
and the file-log --logfile
the log of preprocessing is always sent to the MIRO logger.
Map fasta formated input file (one lane) to multiple reference sequences.
See also run_Mapping.plMap several fasta formated input files (several lanes) to multiple reference sequences. run_Multimapper.html
See also run_Multimapping.pl
Perl 5.8 or higher
Robert Kofler
Juliane Dohm
Matthew Ingham
Heinz Himmelbauer
robert.kofler at crg.es