NAME

run_Multimapper.pl - A perl script for mapping several Solexa-lanes to multiple reference sequences


SYNOPSIS

 # Start mapping using the default settings file
 run_Multimapper.pl
 
 # Start mapping using an alternative settings file
 run_Multimapper.pl alternative_settings.txt


OPTIONS

--help

Display help for the MIRO-multimapper

INPUT_FILE_N

A solexa sequence file. For example:


 1       1       26      688     AAAACACCAACAAAACAACCAAAAATAATACAACAA
 1       1       117     645     GCACCAACAACAAGCAAAAAACGACTAAACACACAA
 1       1       391     248     GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA
 1       1       24      746     ACTCAACACGAAACAAACCAAAACGACAAAAACACA
INPUT_ID_N

A short ID for the reads which allows the user to identify the reads, e.g.: time0, time3, time6 etc

OUTPUT_DIR

The output directory

TEMP_DIR

The temporary directory used by this perl modules and by Gem

MAPPER

The short-read mapper to be used by MIRO. Either Eland, Gem, SOAP or Seqmap.

MAPPER_PATH

The path to the specified mapper above. Has to be the folder which contains the mapper. (GEM requires to be added to the environment variable PATH. When using GEM this option is thus ignored)

MISMATCHES

The number of mismatches allowed for mapping. Attention not all mappers support this feature and different mappers react differently. Eland only allows only two mismatches. Consult the documentation of the different mappers for more information.

REFERENCE_SEQUENCE_FILE_N

The path to the reference sequence which will be used by gem. May either be the path to a fasta file or a preindexed .map file. If no .map file is specified the fasta file will be moved to the temporary directory and indexed there.

REFERENCE_SEQUENCE_ID_N

A short identifier for the reference sequence, e.g.: mRNA, mirnastar, whole-genome etc

REFERENCE_SEQUENCE_MODE_N

The mapping mode. Specifies what should be mapped to the given reference sequence. The mode may either be i (input), pnm (previous no-matches), pm (previous matches), pi (previous input).

LOGFILE

Write the preprocessing log to the specified file. no default (no file)

CONSOLE

Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1

MAX_N

The maximum number of 'N'-letters in a row. Solexa reads having a higher number of 'N'-letters will be discarded. default: 1

MIN_LENGTH

The minimum length of the reads (after adaptor removal). Shorter reads will be discarded. default: 15

MAX_LENGTH

The maximum length of reads (after adaptor remvoal). Longer reads will be discarded. default: 32

MIN_COUNTS

The minimum number of counts for a sequence (after aggregating). Sequences having less counts will be discarded. default: 1

ADAPTOR_SEQ

The adaptor sequence which will be removed (if identified); default: TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA

TRIM

Should reads which exceed MAX_LENGTH be trimmed to length MAX_LENGTH. The 3'-end will be trimmed. 0..no, 1..yes; default: 1

MIN_COMPLEXITY

The script calculates the complexity of each sequence using a simple equation. Reads having a complexity lower than the specified min. value will be discarded. default: 0.5

REMOVE_ADAPTOR

Indicate whether the adaptor sequence should be removed. For certain applications like ChipSeq there is no point in removing an adaptor sequence. 0..no, 1..yes; default: 1


DESCRIPTION

General

The MIRO-multimapper allows to map any number of Solexa read files to any number of reference sequences. The MIRO-multimapper also preprocesses the reads. The MIRO-multimapper should be regarded as a wrapper for the two scripts run_Preprocessing.pl and run_Mapping.pl. The MIRO-multimapper is merely a convenience script which does not provide new functionality, instead it allows a convenient processing of similar Solexa-lanes. The main disadvantage of the MIRO-multimapper is that processing and mapping of the Solexa-lanes is done subsequently and may thus require lots of time to finish.

Since the multimapper is a wrapper for the two scripts run_Preprocessing.pl and run_Mapping.pl we refer to the documentation of these scripts for a detailed documentation.

run_Preprocessing.pl

run_Mapping.pl

Input

A number of Solexa sequence files, where usually one file represents one lane. Example:

 1       1       26      688     AAAACACCAACAAAACAACCAAAAATAATACAACAA
 1       1       117     645     GCACCAACAACAAGCAAAAAACGACTAAACACACAA
 1       1       391     248     GTAAGCACTCCCCTATCCTGTCAGTTGCCTAGTATA
 1       1       24      746     ACTCAACACGAAACAAACCAAAACGACAAAAACACA

Output

The MIRO-multimapper generates five files for each specified reference sequence. If, for example, 4 Solexa read-lanes should successively be mapped to 3 reference sequences, the MIRO-multimapper will generate 60 (4x3x5) output files.

Unambiguos hits

A file containing the hits which could unambiguously be mapped to the reference sequence. A hit is considered to be unambiguously mapped if no other hit could be identified having an equal (or lower) number of mismatches.

Ambiguous hits

A file containing the hits which could only ambiguously be mapped to the reference sequence. A hit is considered to be ambiguously mapped if at least one additional hit could be identified having a identical number of mismatches.

No matches

The reads which could not be mapped to the reference sequence in fasta format.

Matches

The reads which could be mapped to the reference sequence in fasta format.

Statistics file

A plain text file which contains mapping-statistics in human readable format. See Mapping Statistics

Example of a Multimapper Settings file

 #
 # Settings file for miRNA-pipeline Multimapper
 # This file is only relevant for the script: run_Multimapper.pl
 #
 # MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER MULTIMAPPER
 #
 # To run the mapping, edit this file, specify the input and output files, and
 # simple type:   perl run_Multimapper.pl           #uses the default settings file
 # or:            perl run_Multimapper.pl alternative_settings_file.txt
 #
 # This setting file contains:
 # 1: comments which are marked with a #
 #    e.g.: # this is a comment
 # 2: options which are in uppercase, ending with the symbol '=' followed by the argument
 #    e.g.: INPUT_FILE=/home/usr/solexa_reads
 #
 # The settings file tolerates:
 # 1: any number of newlines between the different options
 # 2: any number of comments marked with a #
 # 3: tabs or spaces before and after each option
 # 4: rearangement of the options within the file
 #
 # The setting file does not tolerate
 # 1: space, newline, tab befor or after the symbol '='
 #    e.g: INPUT_FILE = /home/usr/solexareads -> this will not be accepted
 # 2: more than one option in one line
 # 3: arguments should not contain whitespaces or special characters
 # 4: arguments without values
 #
 # In case of an error or bug please contact robert.kofler@crg.es
 
 ##
 ## INPUT - OUTPUT
 ##
 
 # Specify the input files for the Multimapper
 # have to be a file containing solexa read sequences
 INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1
 INPUT_FILE_ID_1=day1
 
 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2
 INPUT_FILE_ID_2=day2
 
 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3
 INPUT_FILE_ID_3=day3
 
 INPUT_FILE_4=
 INPUT_FILE_ID_4=
 
 INPUT_FILE_5=
 INPUT_FILE_ID_5=
 
 INPUT_FILE_6=
 INPUT_FILE_ID_6=
 
 INPUT_FILE_7=
 INPUT_FILE_ID_7=
 
 # Specify the output file. This file will contain the new miRNA candidates
 OUTPUT_DIR=/home/rkofler/playground/pipe/multimapper
 
 # Specify the temporary directory of your machine
 TEMP_DIR=/home/rkofler/tmp
 
 ##
 ## REFERENCE SEQUENCES
 ##
 
 #Specify what should be mapped to the reference sequence
 # i..the input file specified above
 # pnm..previous no matches, i.e the reads which could not be mapped to the previous reference sequence
 # pm...previous matches, i.e the reads which could be mapped to the previous reference sequence
 # pi...previous input,i,e the input file for the previous reference sequence
 
 REFERENCE_SEQUENCE_FILE_1=/home/usr/mature.fa
 REFERENCE_SEQUENCE_ID_1=mat
 REFERENCE_SEQUENCE_MODE_1=i
 #Attention the first mode has always to be i
 
 REFERENCE_SEQUENCE_FILE_2=/home/usr/hairpin.fa
 REFERENCE_SEQUENCE_ID_2=hp
 REFERENCE_SEQUENCE_MODE_2=pnm
 
 REFERENCE_SEQUENCE_FILE_3=
 REFERENCE_SEQUENCE_ID_3=
 REFERENCE_SEQUENCE_MODE_3=
 
 REFERENCE_SEQUENCE_FILE_4=
 REFERENCE_SEQUENCE_ID_4=
 REFERENCE_SEQUENCE_MODE_4=
 
 REFERENCE_SEQUENCE_FILE_5=
 REFERENCE_SEQUENCE_ID_5=
 REFERENCE_SEQUENCE_MODE_5=
 
 REFERENCE_SEQUENCE_FILE_6=
 REFERENCE_SEQUENCE_ID_6=
 REFERENCE_SEQUENCE_MODE_6=
 
 REFERENCE_SEQUENCE_FILE_7=
 REFERENCE_SEQUENCE_ID_7=
 REFERENCE_SEQUENCE_MODE_7=
 
 ##
 ## Preprocessing
 ##
 
 #specify whether an adaptor should be removed; 1..yes 0..no
 REMOVE_ADAPTOR=1
 
 # Specify the sequence of the adaptor
 ADAPTOR_SEQ=TCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAAAAA
 
 # Specify the maximum number of 'N's in a row, which should be tolerated
 # This is a kind of quality control, for example the following sequence will be removed
 # using a MAX_N of 2: GAANNNGAGAGTCT
 MAX_N=1
 
 # Specify the minimum length of a read
 MIN_LENGTH=15
 
 # Specify the maximum length of a read
 MAX_LENGTH=32
 
 # Trim sequences to max_length if they exceed max_length 0..no 1..yes
 TRIM=1
 
 # Specify the minimum complexity of a read (for details see below)
 MIN_COMPLEXITY=0.5
 
 # Specify the minimum number of counts for a read
 # At this step it is recommendet to leave the 1
 MIN_COUNTS=1 
 
 ##
 ## Mapping
 ##
 
 #Specify the mapper which should be used
 #Mappers supported at the moment: Eland, Seqmap, Soap, Gem
 MAPPER=eland
 
 #Specify the path of the directory in which the mapper can be found. The directory!!
 #see Additional comments at bottom
 MAPPER_PATH=/home/usr/programs/eland
 
 ##
 ## Pipeline Logging
 ## 
 
 # output log to console 0..no 1..yes
 CONSOLE=1
 
 # output log into file; leave blank if you do not want to log the process
 LOGFILE=
 
 __END__ #do not remove this entry; Options comming after this line will be ignored!

Input files in detail

Usually one solexa lane represents one sample. For example different stages of cell development or different tissues. The input has to be provided in the following format.

 INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1
 INPUT_FILE_ID_1=day1

In contrast to the script run_Mapping.pl the MIRO-multimapper allows to specify several input file for example

 INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1
 INPUT_FILE_ID_1=day1
 
 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2
 INPUT_FILE_ID_2=day2
 
 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3
 INPUT_FILE_ID_3=day3

If not all specified input files are required just leave them blank. Note that both the ID and the input_file have to be blank. For example:

 INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1
 INPUT_FILE_ID_1=day1
 
 INPUT_FILE_2=
 INPUT_FILE_ID_2=
 
 INPUT_FILE_3=
 INPUT_FILE_ID_3=

If the nubmer of specified input files is not sufficient just add new ones and increase the counter. For example:

 INPUT_FILE_1=/home/usr/solexa/D.mel_embryo_day1
 INPUT_FILE_ID_1=day1
 
 INPUT_FILE_2=/home/usr/solexa/D.mel_embryo_day2
 INPUT_FILE_ID_2=day2
 
 INPUT_FILE_3=/home/usr/solexa/D.mel_embryo_day3
 INPUT_FILE_ID_3=day3
 
 ...
 ...
 ...
 
 INPUT_FILE_99=/home/usr/solexa/D.mel_embryo_week4
 INPUT_FILE_ID_99=week4

Logging

Independent of the console log --logconsole and the file-log --logfile the log of preprocessing is always sent to the MIRO logger.

See also the MIRO-logger

Subsequent Steps

See run_Mapping


REQUIREMENTS

Perl 5.8 or higher

At least one of the following mappers: Eland, Seqmap, SOAP, GEM


AUTHORS

Robert Kofler

Ana Vivancos

Matt Ingham

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es