run_Multimapper.pl - A perl script for mapping several Solexa-lanes to multiple reference sequences
# Start mapping using the default settings file run_Multimapper.pl # Start mapping using an alternative settings file run_Multimapper.pl alternative_settings.txt
Display help for the MIRO-multimapper
A solexa sequence file. For example:
1 1 26 688 AAAACACCAACAAAACAACCAAAAATAATACAACAA 1 1 117 645 GCACCAACAACAAGCAAAAAACGACTAAACACACAA 1 1 391 248 GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA 1 1 24 746 ACTCAACACGAAACAAACCAAAACGACAAAAACACA
A short ID for the reads which allows the user to identify the reads, e.g.: time0, time3, time6 etc
The output directory
The temporary directory used by this perl modules and by Gem
The short-read mapper to be used by MIRO. Either Eland, Gem, SOAP or Seqmap.
The path to the specified mapper above. Has to be the folder which contains the mapper. (GEM requires to be added to the environment variable PATH. When using GEM this option is thus ignored)
The number of mismatches allowed for mapping. Attention not all mappers support this feature and different mappers react differently. Eland only allows only two mismatches. Consult the documentation of the different mappers for more information.
The path to the reference sequence which will be used by gem. May either be the path to a fasta file or a preindexed .map
file.
If no .map
file is specified the fasta file will be moved to the temporary directory and indexed there.
A short identifier for the reference sequence, e.g.: mRNA, mirnastar, whole-genome etc
The mapping mode. Specifies what should be mapped to the given reference sequence. The mode may either be i (input), pnm (previous no-matches), pm (previous matches), pi (previous input).
Write the preprocessing log to the specified file. no default (no file)
Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1
The maximum number of 'N'-letters in a row. Solexa reads having a higher number of 'N'-letters will be discarded. default: 1
The minimum length of the reads (after adaptor removal). Shorter reads will be discarded. default: 15
The maximum length of reads (after adaptor remvoal). Longer reads will be discarded. default: 32
The minimum number of counts for a sequence (after aggregating). Sequences having less counts will be discarded. default: 1
The adaptor sequence which will be removed (if identified); default: TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA
Should reads which exceed MAX_LENGTH
be trimmed to length MAX_LENGTH
.
The 3'-end will be trimmed. 0..no, 1..yes; default: 1
The script calculates the complexity of each sequence using a simple equation. Reads having a complexity lower than the specified min. value will be discarded. default: 0.5
Indicate whether the adaptor sequence should be removed. For certain applications like ChipSeq there is no point in removing an adaptor sequence. 0..no, 1..yes; default: 1
The MIRO-multimapper allows to map any number of Solexa read files to any number of reference sequences.
The MIRO-multimapper also preprocesses the reads.
The MIRO-multimapper should be regarded as a wrapper for the two scripts run_Preprocessing.pl
and run_Mapping.pl
.
The MIRO-multimapper is merely a convenience script which does not provide new functionality, instead it allows a convenient processing of similar Solexa-lanes.
The main disadvantage of the MIRO-multimapper is that processing and mapping of the Solexa-lanes is done subsequently and may thus require lots of time to finish.
Since the multimapper is a wrapper for the two scripts run_Preprocessing.pl
and run_Mapping.pl
we refer to the documentation of these scripts for a detailed documentation.
A number of Solexa sequence files, where usually one file represents one lane. Example:
1 1 26 688 AAAACACCAACAAAACAACCAAAAATAATACAACAA 1 1 117 645 GCACCAACAACAAGCAAAAAACGACTAAACACACAA 1 1 391 248 GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA 1 1 24 746 ACTCAACACGAAACAAACCAAAACGACAAAAACACA
The MIRO-multimapper generates five files for each specified reference sequence. If, for example, 4 Solexa read-lanes should successively be mapped to 3 reference sequences, the MIRO-multimapper will generate 60 (4x3x5) output files.
A file containing the hits which could unambiguously be mapped to the reference sequence. A hit is considered to be unambiguously mapped if no other hit could be identified having an equal (or lower) number of mismatches.
A file containing the hits which could only ambiguously be mapped to the reference sequence. A hit is considered to be ambiguously mapped if at least one additional hit could be identified having a identical number of mismatches.
The reads which could not be mapped to the reference sequence in fasta format.
The reads which could be mapped to the reference sequence in fasta format.
A plain text file which contains mapping-statistics in human readable format. See Mapping Statistics
# # Settings file for miRNA-pipeline Multimapper # This file is only relevant for the script: run_Multimapper.pl # # MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER # # To run the mapping, edit this file, specify the input and output files, and # simple type: perl run_Multimapper.pl #uses the default settings file # or: perl run_Multimapper.pl alternative_settings_file.txt # # This setting file contains: # 1: comments which are marked with a # # e.g.: # this is a comment # 2: options which are in uppercase, ending with the symbol '=' followed by the argument # e.g.: INPUT_FILE=/home/usr/solexa_reads # # The settings file tolerates: # 1: any number of newlines between the different options # 2: any number of comments marked with a # # 3: tabs or spaces before and after each option # 4: rearangement of the options within the file # # The setting file does not tolerate # 1: space, newline, tab befor or after the symbol '=' # e.g: INPUT_FILE = /home/usr/solexareads -> this will not be accepted # 2: more than one option in one line # 3: arguments should not contain whitespaces or special characters # 4: arguments without values # # In case of an error or bug please contact robert.kofler@crg.es ## ## INPUT - OUTPUT ## # Specify the input files for the Multimapper # have to be a file containing solexa read sequences INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1 INPUT_FILE_ID_1=day1 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2 INPUT_FILE_ID_2=day2 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3 INPUT_FILE_ID_3=day3 INPUT_FILE_4= INPUT_FILE_ID_4= INPUT_FILE_5= INPUT_FILE_ID_5= INPUT_FILE_6= INPUT_FILE_ID_6= INPUT_FILE_7= INPUT_FILE_ID_7= # Specify the output file. This file will contain the new miRNA candidates OUTPUT_DIR=/home/rkofler/playground/pipe/multimapper # Specify the temporary directory of your machine TEMP_DIR=/home/rkofler/tmp ## ## REFERENCE SEQUENCES ## #Specify what should be mapped to the reference sequence # i..the input file specified above # pnm..previous no matches, i.e the reads which could not be mapped to the previous reference sequence # pm...previous matches, i.e the reads which could be mapped to the previous reference sequence # pi...previous input,i,e the input file for the previous reference sequence REFERENCE_SEQUENCE_FILE_1=/home/usr/mature.fa REFERENCE_SEQUENCE_ID_1=mat REFERENCE_SEQUENCE_MODE_1=i #Attention the first mode has always to be i REFERENCE_SEQUENCE_FILE_2=/home/usr/hairpin.fa REFERENCE_SEQUENCE_ID_2=hp REFERENCE_SEQUENCE_MODE_2=pnm REFERENCE_SEQUENCE_FILE_3= REFERENCE_SEQUENCE_ID_3= REFERENCE_SEQUENCE_MODE_3= REFERENCE_SEQUENCE_FILE_4= REFERENCE_SEQUENCE_ID_4= REFERENCE_SEQUENCE_MODE_4= REFERENCE_SEQUENCE_FILE_5= REFERENCE_SEQUENCE_ID_5= REFERENCE_SEQUENCE_MODE_5= REFERENCE_SEQUENCE_FILE_6= REFERENCE_SEQUENCE_ID_6= REFERENCE_SEQUENCE_MODE_6= REFERENCE_SEQUENCE_FILE_7= REFERENCE_SEQUENCE_ID_7= REFERENCE_SEQUENCE_MODE_7= ## ## Preprocessing ## #specify whether an adaptor should be removed; 1..yes 0..no REMOVE_ADAPTOR=1 # Specify the sequence of the adaptor ADAPTOR_SEQ=TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA # Specify the maximum number of 'N's in a row, which should be tolerated # This is a kind of quality control, for example the following sequence will be removed # using a MAX_N of 2: GAANNNGAGAGTCT MAX_N=1 # Specify the minimum length of a read MIN_LENGTH=15 # Specify the maximum length of a read MAX_LENGTH=32 # Trim sequences to max_length if they exceed max_length 0..no 1..yes TRIM=1 # Specify the minimum complexity of a read (for details see below) MIN_COMPLEXITY=0.5 # Specify the minimum number of counts for a read # At this step it is recommendet to leave the 1 MIN_COUNTS=1 ## ## Mapping ## #Specify the mapper which should be used #Mappers supported at the moment: Eland, Seqmap, Soap, Gem MAPPER=eland #Specify the path of the directory in which the mapper can be found. The directory!! #see Additional comments at bottom MAPPER_PATH=/home/usr/programs/eland ## ## Pipeline Logging ## # output log to console 0..no 1..yes CONSOLE=1 # output log into file; leave blank if you do not want to log the process LOGFILE= __END__ #do not remove this entry; Options comming after this line will be ignored!
Usually one solexa lane represents one sample. For example different stages of cell development or different tissues. The input has to be provided in the following format.
INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1 INPUT_FILE_ID_1=day1
In contrast to the script run_Mapping.pl
the MIRO-multimapper allows to specify several input file for example
INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1 INPUT_FILE_ID_1=day1 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2 INPUT_FILE_ID_2=day2 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3 INPUT_FILE_ID_3=day3
If not all specified input files are required just leave them blank. Note that both the ID and the input_file have to be blank. For example:
INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1 INPUT_FILE_ID_1=day1 INPUT_FILE_2= INPUT_FILE_ID_2= INPUT_FILE_3= INPUT_FILE_ID_3=
If the nubmer of specified input files is not sufficient just add new ones and increase the counter. For example:
INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1 INPUT_FILE_ID_1=day1 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2 INPUT_FILE_ID_2=day2 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3 INPUT_FILE_ID_3=day3 ... ... ... INPUT_FILE_99=/home/usr/solexa/D.mel_embryo_week4 INPUT_FILE_ID_99=week4
Independent of the console log --logconsole
and the file-log --logfile
the log of preprocessing is always sent to the MIRO logger.
Perl 5.8 or higher
At least one of the following mappers: Eland, Seqmap, SOAP, GEM
Robert Kofler
Ana Vivancos
Matt Ingham
Heinz Himmelbauer
robert.kofler at crg.es