NAME

run_Mapping.pl - A perl script for mapping fasta formated sequences to multiple reference sequences using one of several alignment programs.


SYNOPSIS

 # Start mapping using the default settings file
 run_Mapping.pl
 
 # Start mapping using an alternative settings file
 run_Mapping.pl alternative_settings.txt
 
 # Start mapping using the default settings file and additional options.
 # The provided command line parameters override the parameters from the file!
 # This is useful to speed up mapping of several files as it is not necessary to always edit the settings file.
 run_Mapping.pl --input /home/usr/reads.fa --id day1
 
 # Start mapping using an alternative settings file and additional options
 run_Mapping.pl --input /home/usr/reads.fa --id day1 alternative_settings.txt
 
 # Note that it is not possible to start run_Mapping.pl without an Settings file


OPTIONS

Command line parameters:

--input

the input file; has to be a multiple fasta file; For example:

 >5||Count=3
 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC
 >6||Count=5
 ACTCCCGTGCTGGCCTCCCTGCGAAAACCCCC

The count feature - as produced by MIRO-Preprocessing - is optional, therefore the following is also accepted:

 >5
 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC
 >6
 ACTCCCGTGCTGGCCTCCCTGCGAAAACCCCC

Note, however that the mapping statistics (especially the number of unique sequences) are based on the count feature. Therefore, if the count feature is not provided the number of unique sequences in the mapping statistics will not reflect the true values

--id

A short ID for the reads which allows the user to identify the reads, e.g.: time0, time3, time6 etc. The id must not contain the character '_' which is used in the output file to separate different logical units.

Settingsfile parameters:

INPUT_FILE

See --input

INPUT_ID

See --id

OUTPUT_DIR

The output directory; The directory has to already exist!

TEMP_DIR

The temporary directory used by this perl modules and by Gem

MAPPER

The short-read mapper to be used by MIRO. Either Eland, Gem, SOAP or Seqmap.

MAPPER_PATH

The path to the specified mapper above. Has to be the folder which contains the mapper. (GEM requires to be added to the environment variable PATH. When using GEM this option is thus ignored)

MISMATCHES

The number of mismatches allowed for mapping. Attention not all mappers support this feature and different mappers react differently. Eland only allows only two mismatches. Consult the documentation of the different mappers for more information.

REFERENCE_SEQUENCE_FILE_N

The path to the reference sequence which will be used by gem. May either be the path to a fasta file or a preindexed .map file. If no .map file is specified the fasta file will be moved to the temporary directory and indexed there.

REFERENCE_SEQUENCE_ID_N

A short identifier for the reference sequence, e.g.: mRNA, mirnastar, whole-genome etc

REFERENCE_SEQUENCE_MODE_N

The mapping mode. Specifies what should be mapped to the given reference sequence. The mode may either be i (input), pnm (previous no-matches), pm (previous matches), pi (previous input).

LOGFILE

Write the preprocessing log to the specified file. no default (no file)

CONSOLE

Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1


DESCRIPTION

General

The MIRO-mapper allows to map a fasta formated input file to multiple reference sequences using one of four mappers interchangeably (Eland, Seqmap, Soap, GEM). Short reads may be mapped to many different reference sequences (eg. mature-miRNA, hairpin, refSeq, whole-genome). The number of reference sequences is not limited. The MIRO-mapper furthermore allows to flexibly custom-tailor the mapping process. It is, for example, possible to map only sequences which could not be matched in a previous mapping step or to map only those sequences which could be mapped in a previous mapping step. The reads have to be in a single multiple fasta file. The reference sequences have also to be in multiple fasta format (for GEM a preindexed .map file is also accepted).

Input

As described at --input

Output

The MIRO-mapper generates five files for each specified reference sequence. If, for example, the preprocessed reads should successively be mapped to five reference sequences, the MIRO-mapper will generate 25 output files.

Unambiguos hits

A file containing the hits which could unambiguously be mapped to the reference sequence. A hit is considered to be unambiguously mapped if no other hit could be identified having an equal (or lower) number of mismatches.

Ambiguous hits

A file containing the hits which could only ambiguously be mapped to the reference sequence. A hit is considered to be ambiguously mapped if at least one additional hit could be identified having a identical number of mismatches.

No matches

The reads which could not be mapped to the reference sequence in fasta format.

Matches

The reads which could be mapped to the reference sequence in fasta format.

Statistics file

A plain text file which contains mapping-statistics in human readable format. See Mapping Statistics

Examples of mapping schemes

The multimapper allows for different mapping schemes. Examples:

map all

Map reads to 1: miRNAs, 2: hairpins, 3: mRNA and 4: whole_genome

successive no-matches

Map reads to 1: miRNAs, 2: the previous no-matches to hairpins, 3: the previous no-matches to mRNA and 4: the previous no-matches to the whole_genome

positive selection

Map reads to 1: whole genome, 2: the wg-matches to miRNA, 3: the wg-matches to hairpins and 4: the wg-matches to mRNA

negative selection

Map reads to 1: mycoplasma, 2: the myco. no-matches to miRNA, 3: the myco. no-matches to hairpin and 4: the myco. no-matches to mRNA

Mapping modes

The user has to specify the mapping mode REFERENCE_SEQUENCE_MODE_N for each reference sequence REFERENCE_SEQUENCE_FILE_N in the settings file. The mapping mode specifies which fraction of the input sequences should be mapped against the reference sequence.

i

the whole input file

pnm

the previous no matches, i.e the reads which could not be mapped to the previous reference sequence

pm

the previous matches, i.e the reads which could be mapped to the previous reference sequence

pi

the previous input, i.e the input file for the previous reference sequence

Reference sequence settings in detail

Mapping to a single reference sequence (not pre-indexed)
 #
 # Example mirbase hairpin
 #
 REFERENCE_SEQUENCE_FILE_1=/data/mirbase/hsa_hairpin.fa   
 REFERENCE_SEQUENCE_ID_1=hairpin
 REFERENCE_SEQUENCE_MODE_1=i
 
 # The first mode has to be "i", always!!!
Mapping to a single reference sequence (pre-indexed)
 #
 # Example mirbase hairpin
 #
 REFERENCE_SEQUENCE_FILE_1=/data/mirbase/hsa_hairpin.dna_csa_32.map   
 REFERENCE_SEQUENCE_ID_1=hairpin
 REFERENCE_SEQUENCE_MODE_1=i
Mapping to multiple reference sequences

Example all H. sapiens chromosomes

 #
 # Example whole genome
 # 
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr1.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr2.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr3.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr4.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr5.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr6.fa
 REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr7.fa
 # and so on for all chromosomes.....
 
 REFERENCE_SEQUENCE_ID_1=wholeGenome
 REFERENCE_SEQUENCE_MODE_1=i
 # leave blank if not required
 REFERENCE_SEQUENCE_FILE_2=                                  
 REFERENCE_SEQUENCE_ID_2=                                    
 REFERENCE_SEQUENCE_MODE_2=                                  
 # make sure to leave everything blank!!!
 # otherwise you will have an error

Example: mapping to mature miRNA and miRNA-star concurrently

 #
 # Example mirbase mature + mirstar
 #
 REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mature.fa
 REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mirstar.fa
 REFERENCE_SEQUENCE_ID_1=miRmiRst
 REFERENCE_SEQUENCE_MODE_1=i
Mapping of no-matches
 #
 # Example of successive no match mapping
 #
 REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mature.fa
 REFERENCE_SEQUENCE_ID_1=mat
 REFERENCE_SEQUENCE_MODE_1=i
 # the first mode has always to be i
 
 REFERENCE_SEQUENCE_FILE_2=/data/mirbase/hairpin.fa
 REFERENCE_SEQUENCE_ID_2=hp
 REFERENCE_SEQUENCE_MODE_2=pnm
 # previous no matches
 
 REFERENCE_SEQUENCE_FILE_3=/data/mirbase/refMrna.fa
 REFERENCE_SEQUENCE_ID_3=mRNA
 REFERENCE_SEQUENCE_MODE_3=pnm                  
 
 REFERENCE_SEQUENCE_ID_4=wholeGenome
 REFERENCE_SEQUENCE_MODE_4=pnm                   
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr1.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr2.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr3.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr4.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr5.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr6.fa
 REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr7.fa
 # and so on for all chromosomes..... 
 
 ....
 ....
 ....
 
 # just add new entries if required
 REFERENCE_SEQUENCE_FILE_91=/data/whole_genome/mus_musculus.fa     
 REFERENCE_SEQUENCE_ID_91=wgMusMus
 REFERENCE_SEQUENCE_MODE_91=pm
 # pm, only try to map those which did also map to the previous reference sequence

Mapping Statistics

The mapping statistics file contains some general information about a successful mapping step like the number of reads in the input file, the number of reads which could be mapped ot the number of unambiguously mapped reads. For example:

 MAPPING STATISTICS
 Attempted to map 7794973 reads; Unique sequences: 3490353; Average length: 25.1
 Mapped 373112 reads; Unique sequences: 111505; Average length: 18.1
 No matches for 7421861 reads; Unique sequences: 3378848; Average length 25.5
 Average mismatches for mapped hit: 1.6; Effective length of mapped hit: 16.5
 Reference sequences: 3021; Total length in bp: 2555356
 Expected hits (Allowing 2 mismatches): 2381656284525; Observed hits: 373112
 Unambiguously mapped reads: 222489
 Ambiguously mapped reads: 150623

The expected hits are calculated according to the following equation

 exp_hits = Sum[read_counts(length) * (referenceSequenceLength / 4 ^ (readLength(length) - allowedMismatches))][for each read length]

The mapping statistics file furthermore contains the read length distribution of the input file. leng refers to the length of the reads. seq referes to the number of unique sequences (Attention: this only works if the reads have been preprocessed with MIRO). reads referes to the number of reads. This read length distribution is calculated from the input file which might be a Preprocessed fasta file, the No matches of the previous mapping step or the Matches of the previous mapping step. For example:

 Length profile of input sequences (length - count)
 leng   seq     reads
 15     37136   125513
 16     36781   120286
 17     42091   158056
 18     84598   357524
 19     100702  341725
 20     106505  349425
 21     135237  470599
 22     281329  1271956
 23     299836  1041257
 24     145365  396100
 25     61377   113133
 26     44001   84372
 27     40904   76005
 28     51937   114913
 29     74140   124277
 30     70028   124765
 31     138659  250140
 32     1739727 2274927

The mapping statistics file also contains the read length distribution of the reads which could be mapped to the reference sequence. Where ua_seq refers to the unambiguously mapped unique sequences and ua_reads to the unambiguously mapped reads. The read length distribution of the matches is calculated from two files the Matches and the Unambiguously mapped reads.

 Length profile of matching sequences (length - count)
 leng   seq     reads   ua_seq  ua_reads
 15     29308   92067   9372    29331
 16     21329   69583   9671    39738
 17     12698   41295   7931    27789
 18     10982   63774   7803    43502
 19     6109    37009   4305    31826
 20     3549    12109   2267    9649
 21     3259    8406    1727    6388
 22     3608    13348   1777    11194
 23     2877    8471    1569    6793
 24     1944    3127    1064    2135
 25     1159    2514    624     1922
 26     829     2346    456     1909
 27     775     2014    396     1594
 28     825     1138    425     582
 29     812     1133    416     536
 30     777     1044    378     465
 31     829     1283    432     598
 32     9836    12451   5080    6538

The mapping statistics file also contains the read length distribution of the reads which could not be mapped to the reference sequence. This is calculated from the No matches file. Although this read length distribution could easily be calculated from just subtracting read length distribution of the matches from the input file, we decided to calculate it from the No matches file to provide a positive control. For example:

 Length profile of not-matching sequences (length - count)
 leng   seq     reads
 15     7828    33446
 16     15452   50703
 17     29393   116761
 18     73616   293750
 19     94593   304716
 20     102956  337316
 21     131978  462193
 22     277721  1258608
 23     296959  1032786
 24     143421  392973
 25     60218   110619
 26     43172   82026
 27     40129   73991
 28     51112   113775
 29     73328   123144
 30     69251   123721
 31     137830  248857
 32     1729891 2262476

Finally the mapping statistics file also contains all files which were used in the mapping process. This ensures a higher reproducibility. For example:

 Files
 Input file: /home/usr/analysis/ppday1
 Output unambiguous: /home/usr/analysis/Mapping_day1_1_i_Eland_against_wg_unambiguous.txt
 Output ambiguous: /home/usr/analysis/Mapping_day1_1_i_Eland_against_wg_ambiguous.txt
 Reference sequences used
 /home/usr/H.sapiens/chr1
 /home/usr/H.sapiens/chr2
 /home/usr/H.sapiens/chr3
 /home/usr/H.sapiens/chr4
 /home/usr/H.sapiens/chr5
 ...

Idiosyncracies of different Mappers

Eland

Eland causes the most problems as it has several bugs and a inconsistent behaviour. The Eland-wrapper tries to hide this complexity from the user by providing an easy to use inteface which counteracts some of this idiosyncracies. Examples:

Eland does not provide the reference-id when using reference files containing only one fasta entry

Eland does not use the full reference-id only, everything after the first backspace is discarded

Eland is only capable of mapping reads with the same length in one run

As mentioned previously the user does not need to worry about this problems as the wrapper tries to compensate. However one problem remains which the user must consider. Eland requires a preprocessing of the sequences called squashing. The wrapper is doing this automatically for each specified reference sequence. Unfortunatelly Eland can not squash large reference sequence - like the whole Human genome - at once. This has to be squashed chromosome by chromosome. Therefore instead of providing only one sequence:

 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_whole_genome.wg

the user has to provide each chromsome separately:


 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr1
 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr2
 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr3
 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr4
 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr5
 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr6
 ...

Seqmap

Seqmap is very robust and reliable. The results are of high quality, unfortunatelly the tool is also extremly slow.

SOAP

SOAP does not allow short reference sequences, it should therefore not be used for a mapping to mirbase.

Gem

Gem is a very promising alternative to the other mappers as Gem is very fast, produces reliable results and the read length is not restricted. The path to Gem has to be added to the environmental variable PATH. Generally the idea of the pipeline is to allow only the path to a fasta file as REFERENCE_SEQUENCE_FILE_N, this should make it more easy for the user to exchange the different mappers. Since indexing using gem-do-index may take a very long time, we also allow to specify the path to a .gem file as a reference sequence when using the mapper GEM.

Logging

Independent of the console log --logconsole and the file-log --logfile the log of preprocessing is always sent to the MIRO logger.

See also the MIRO-logger

Subsequent Steps

Mapped reads can further be processed with MIRO

run_DGE.pl

Create a differential miRNA expression profile

See also run_Mapping.pl

Converter/Hit2Bed.pl

Convert the hit-files (the ambiguous hits, the unambiguous hits or both) into a .bed file which may be used at the UCSC Genome browser.

See also Hit2Bed.pl

Converter/Hit2Mideep.pl

Convert the hit-files (the ambiguous hits, the unambiguous hits or both) into files which may be used as input for the miRdeep pipeline.

See also Hit2Mirdeep.pl

Converter/Hit2Wiggle.pl

Convert the hit-files (the ambiguous hits, the unambiguous hits or both) into a .wig file which may be used at the UCSC Genome browser.

See also Hit2Wiggle.pl

GenExp/DGE_BayesSign.pl

Calculates significant differences in miRNA expression using a bayesian approach.

See also DGE_BayesSign.pl

GenExp/DGE_Heatmap.pl

Creates a heatmap depicting miRNA expression levels

See also DGE_Heatmap.pl

GenExp/DGE_Lineplots.pl

Creates a lineplot displaying the course of miRNA expression for a specified subset of the miRNAs

See also DGE_Lineplots.pl

GenExp/DGE_Scatterplots.pl

Creates several scatterplots to visualise differences in miRNA expression.

See also DGE_Scatterplots.pl

GenExp/DGE_Table.pl

Creates a table containing the expression levels of all miRNAs.

See also DGE_Table.pl

Clustering/cluster_overlapping_hits.pl

Aggregates hits which are overlapping into separate clusters.

See also cluster_overlapping_hits.pl

Clustering/create_easy_screenable.pl

Creates files which are optimised for an easy screening of the read distribution along the reference sequence.

See also create_easy_screenable.pl

Clustering/create_miRNA_profile.pl

Displayes the position of all reads in the reference sequence graphicaly.

See also create_miRNA_profile.pl

Clustering/create_miRNA_logo.pl

Creates a logo using Weblogo 3.0

See also create_miRNA_logo.pl

Clustering/run_sliding_window.pl

Uses a sliding window approach to aggreate reads into larger clusters.

See also run_sliding_window.pl


REQUIREMENTS

Perl 5.8 or higher

At least one of the following mappers: Eland, Seqmap, SOAP, GEM


AUTHORS

Robert Kofler

Matthew Ingham

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es