run_Mapping.pl - A perl script for mapping fasta formated sequences to multiple reference sequences using one of several alignment programs.
# Start mapping using the default settings file run_Mapping.pl # Start mapping using an alternative settings file run_Mapping.pl alternative_settings.txt # Start mapping using the default settings file and additional options. # The provided command line parameters override the parameters from the file! # This is useful to speed up mapping of several files as it is not necessary to always edit the settings file. run_Mapping.pl --input /home/usr/reads.fa --id day1 # Start mapping using an alternative settings file and additional options run_Mapping.pl --input /home/usr/reads.fa --id day1 alternative_settings.txt # Note that it is not possible to start run_Mapping.pl without an Settings file
Command line parameters:
the input file; has to be a multiple fasta file; For example:
>5||Count=3 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC >6||Count=5 ACTCCCGTGCTGGCCTCCCTGCGAAAACCCCC
The count feature - as produced by MIRO-Preprocessing - is optional, therefore the following is also accepted:
>5 ACTCCCGTGCTGGCCTCCCTGCGCCCCCCCCC >6 ACTCCCGTGCTGGCCTCCCTGCGAAAACCCCC
Note, however that the mapping statistics (especially the number of unique sequences) are based on the count feature. Therefore, if the count feature is not provided the number of unique sequences in the mapping statistics will not reflect the true values
A short ID for the reads which allows the user to identify the reads, e.g.: time0, time3, time6 etc. The id must not contain the character '_' which is used in the output file to separate different logical units.
Settingsfile parameters:
See --input
See --id
The output directory; The directory has to already exist!
The temporary directory used by this perl modules and by Gem
The short-read mapper to be used by MIRO. Either Eland, Gem, SOAP or Seqmap.
The path to the specified mapper above. Has to be the folder which contains the mapper. (GEM requires to be added to the environment variable PATH. When using GEM this option is thus ignored)
The number of mismatches allowed for mapping. Attention not all mappers support this feature and different mappers react differently. Eland only allows only two mismatches. Consult the documentation of the different mappers for more information.
The path to the reference sequence which will be used by gem. May either be the path to a fasta file or a preindexed .map
file.
If no .map
file is specified the fasta file will be moved to the temporary directory and indexed there.
A short identifier for the reference sequence, e.g.: mRNA, mirnastar, whole-genome etc
The mapping mode. Specifies what should be mapped to the given reference sequence. The mode may either be i (input), pnm (previous no-matches), pm (previous matches), pi (previous input).
Write the preprocessing log to the specified file. no default (no file)
Write the preprocessing log to STDOUT (console); 0..no, 1..yes; default: 1
The MIRO-mapper allows to map a fasta formated input file to multiple reference sequences using one of four mappers interchangeably (Eland, Seqmap, Soap, GEM). Short reads may be mapped to many different reference sequences (eg. mature-miRNA, hairpin, refSeq, whole-genome). The number of reference sequences is not limited. The MIRO-mapper furthermore allows to flexibly custom-tailor the mapping process. It is, for example, possible to map only sequences which could not be matched in a previous mapping step or to map only those sequences which could be mapped in a previous mapping step. The reads have to be in a single multiple fasta file. The reference sequences have also to be in multiple fasta format (for GEM a preindexed .map file is also accepted).
As described at --input
The MIRO-mapper generates five files for each specified reference sequence. If, for example, the preprocessed reads should successively be mapped to five reference sequences, the MIRO-mapper will generate 25 output files.
A file containing the hits which could unambiguously be mapped to the reference sequence. A hit is considered to be unambiguously mapped if no other hit could be identified having an equal (or lower) number of mismatches.
A file containing the hits which could only ambiguously be mapped to the reference sequence. A hit is considered to be ambiguously mapped if at least one additional hit could be identified having a identical number of mismatches.
The reads which could not be mapped to the reference sequence in fasta format.
The reads which could be mapped to the reference sequence in fasta format.
A plain text file which contains mapping-statistics in human readable format. See Mapping Statistics
The multimapper allows for different mapping schemes. Examples:
Map reads to 1: miRNAs, 2: hairpins, 3: mRNA and 4: whole_genome
Map reads to 1: miRNAs, 2: the previous no-matches to hairpins, 3: the previous no-matches to mRNA and 4: the previous no-matches to the whole_genome
Map reads to 1: whole genome, 2: the wg-matches to miRNA, 3: the wg-matches to hairpins and 4: the wg-matches to mRNA
Map reads to 1: mycoplasma, 2: the myco. no-matches to miRNA, 3: the myco. no-matches to hairpin and 4: the myco. no-matches to mRNA
The user has to specify the mapping mode REFERENCE_SEQUENCE_MODE_N
for each reference sequence REFERENCE_SEQUENCE_FILE_N
in the settings file.
The mapping mode specifies which fraction of the input sequences should be mapped against the reference sequence.
the whole input file
the previous no matches, i.e the reads which could not be mapped to the previous reference sequence
the previous matches, i.e the reads which could be mapped to the previous reference sequence
the previous input, i.e the input file for the previous reference sequence
# # Example mirbase hairpin # REFERENCE_SEQUENCE_FILE_1=/data/mirbase/hsa_hairpin.fa REFERENCE_SEQUENCE_ID_1=hairpin REFERENCE_SEQUENCE_MODE_1=i # The first mode has to be "i", always!!!
# # Example mirbase hairpin # REFERENCE_SEQUENCE_FILE_1=/data/mirbase/hsa_hairpin.dna_csa_32.map REFERENCE_SEQUENCE_ID_1=hairpin REFERENCE_SEQUENCE_MODE_1=i
Example all H. sapiens chromosomes
# # Example whole genome # REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr1.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr2.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr3.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr4.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr5.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr6.fa REFERENCE_SEQUENCE_FILE_1=/data/whole_genome/chr7.fa # and so on for all chromosomes..... REFERENCE_SEQUENCE_ID_1=wholeGenome REFERENCE_SEQUENCE_MODE_1=i
# leave blank if not required REFERENCE_SEQUENCE_FILE_2= REFERENCE_SEQUENCE_ID_2= REFERENCE_SEQUENCE_MODE_2= # make sure to leave everything blank!!! # otherwise you will have an error
Example: mapping to mature miRNA and miRNA-star concurrently
# # Example mirbase mature + mirstar # REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mature.fa REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mirstar.fa REFERENCE_SEQUENCE_ID_1=miRmiRst REFERENCE_SEQUENCE_MODE_1=i
# # Example of successive no match mapping # REFERENCE_SEQUENCE_FILE_1=/data/mirbase/mature.fa REFERENCE_SEQUENCE_ID_1=mat REFERENCE_SEQUENCE_MODE_1=i # the first mode has always to be i REFERENCE_SEQUENCE_FILE_2=/data/mirbase/hairpin.fa REFERENCE_SEQUENCE_ID_2=hp REFERENCE_SEQUENCE_MODE_2=pnm # previous no matches REFERENCE_SEQUENCE_FILE_3=/data/mirbase/refMrna.fa REFERENCE_SEQUENCE_ID_3=mRNA REFERENCE_SEQUENCE_MODE_3=pnm REFERENCE_SEQUENCE_ID_4=wholeGenome REFERENCE_SEQUENCE_MODE_4=pnm REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr1.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr2.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr3.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr4.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr5.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr6.fa REFERENCE_SEQUENCE_FILE_4=/data/whole_genome/chr7.fa # and so on for all chromosomes..... .... .... .... # just add new entries if required REFERENCE_SEQUENCE_FILE_91=/data/whole_genome/mus_musculus.fa REFERENCE_SEQUENCE_ID_91=wgMusMus REFERENCE_SEQUENCE_MODE_91=pm # pm, only try to map those which did also map to the previous reference sequence
The mapping statistics file contains some general information about a successful mapping step like the number of reads in the input file, the number of reads which could be mapped ot the number of unambiguously mapped reads. For example:
MAPPING STATISTICS Attempted to map 7794973 reads; Unique sequences: 3490353; Average length: 25.1 Mapped 373112 reads; Unique sequences: 111505; Average length: 18.1 No matches for 7421861 reads; Unique sequences: 3378848; Average length 25.5 Average mismatches for mapped hit: 1.6; Effective length of mapped hit: 16.5 Reference sequences: 3021; Total length in bp: 2555356 Expected hits (Allowing 2 mismatches): 2381656284525; Observed hits: 373112 Unambiguously mapped reads: 222489 Ambiguously mapped reads: 150623
The expected hits are calculated according to the following equation
exp_hits = Sum[read_counts(length) * (referenceSequenceLength / 4 ^ (readLength(length) - allowedMismatches))][for each read length]
The mapping statistics file furthermore contains the read length distribution of the input file.
leng
refers to the length of the reads.
seq
referes to the number of unique sequences (Attention: this only works if the reads have been preprocessed with MIRO).
reads
referes to the number of reads. This read length distribution is calculated from the input file which might
be a Preprocessed fasta file
, the No matches
of the previous mapping step or the Matches
of the previous mapping step.
For example:
Length profile of input sequences (length - count) leng seq reads 15 37136 125513 16 36781 120286 17 42091 158056 18 84598 357524 19 100702 341725 20 106505 349425 21 135237 470599 22 281329 1271956 23 299836 1041257 24 145365 396100 25 61377 113133 26 44001 84372 27 40904 76005 28 51937 114913 29 74140 124277 30 70028 124765 31 138659 250140 32 1739727 2274927
The mapping statistics file also contains the read length distribution of the reads which could be mapped to the reference sequence.
Where ua_seq
refers to the unambiguously mapped unique sequences and ua_reads
to the unambiguously mapped reads.
The read length distribution of the matches is calculated from two files the Matches
and the Unambiguously mapped reads
.
Length profile of matching sequences (length - count) leng seq reads ua_seq ua_reads 15 29308 92067 9372 29331 16 21329 69583 9671 39738 17 12698 41295 7931 27789 18 10982 63774 7803 43502 19 6109 37009 4305 31826 20 3549 12109 2267 9649 21 3259 8406 1727 6388 22 3608 13348 1777 11194 23 2877 8471 1569 6793 24 1944 3127 1064 2135 25 1159 2514 624 1922 26 829 2346 456 1909 27 775 2014 396 1594 28 825 1138 425 582 29 812 1133 416 536 30 777 1044 378 465 31 829 1283 432 598 32 9836 12451 5080 6538
The mapping statistics file also contains the read length distribution of the reads which could not be mapped to the reference sequence.
This is calculated from the No matches
file.
Although this read length distribution could easily be calculated from just subtracting read length distribution of the matches from the input file,
we decided to calculate it from the No matches
file to provide a positive control. For example:
Length profile of not-matching sequences (length - count) leng seq reads 15 7828 33446 16 15452 50703 17 29393 116761 18 73616 293750 19 94593 304716 20 102956 337316 21 131978 462193 22 277721 1258608 23 296959 1032786 24 143421 392973 25 60218 110619 26 43172 82026 27 40129 73991 28 51112 113775 29 73328 123144 30 69251 123721 31 137830 248857 32 1729891 2262476
Finally the mapping statistics file also contains all files which were used in the mapping process. This ensures a higher reproducibility. For example:
Files Input file: /home/usr/analysis/ppday1 Output unambiguous: /home/usr/analysis/Mapping_day1_1_i_Eland_against_wg_unambiguous.txt Output ambiguous: /home/usr/analysis/Mapping_day1_1_i_Eland_against_wg_ambiguous.txt Reference sequences used /home/usr/H.sapiens/chr1 /home/usr/H.sapiens/chr2 /home/usr/H.sapiens/chr3 /home/usr/H.sapiens/chr4 /home/usr/H.sapiens/chr5 ...
Eland causes the most problems as it has several bugs and a inconsistent behaviour. The Eland-wrapper tries to hide this complexity from the user by providing an easy to use inteface which counteracts some of this idiosyncracies. Examples:
Eland does not provide the reference-id when using reference files containing only one fasta entry
Eland does not use the full reference-id only, everything after the first backspace is discarded
Eland is only capable of mapping reads with the same length in one run
As mentioned previously the user does not need to worry about this problems as the wrapper tries to compensate. However one problem remains which the user must consider. Eland requires a preprocessing of the sequences called squashing. The wrapper is doing this automatically for each specified reference sequence. Unfortunatelly Eland can not squash large reference sequence - like the whole Human genome - at once. This has to be squashed chromosome by chromosome. Therefore instead of providing only one sequence:
REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_whole_genome.wg
the user has to provide each chromsome separately:
REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr1 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr2 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr3 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr4 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr5 REFERENCE_SEQUENCE_FILE_1=/home/usr/H_sapiens_chr6 ...
Seqmap is very robust and reliable. The results are of high quality, unfortunatelly the tool is also extremly slow.
SOAP does not allow short reference sequences, it should therefore not be used for a mapping to mirbase.
Gem is a very promising alternative to the other mappers as Gem is very fast, produces reliable results and the read length is not restricted.
The path to Gem has to be added to the environmental variable PATH
.
Generally the idea of the pipeline is to allow only the path to a fasta file as REFERENCE_SEQUENCE_FILE_N
, this should make it more easy for the user to exchange the different mappers.
Since indexing using gem-do-index
may take a very long time, we also allow to specify the path to a .gem
file as a reference sequence when using the mapper GEM.
Independent of the console log --logconsole
and the file-log --logfile
the log of preprocessing is always sent to the MIRO logger.
Mapped reads can further be processed with MIRO
Create a differential miRNA expression profile
See also run_Mapping.plConvert the hit-files (the ambiguous hits, the unambiguous hits or both) into a .bed file which may be used at the UCSC Genome browser.
See also Hit2Bed.plConvert the hit-files (the ambiguous hits, the unambiguous hits or both) into files which may be used as input for the miRdeep pipeline.
See also Hit2Mirdeep.plConvert the hit-files (the ambiguous hits, the unambiguous hits or both) into a .wig file which may be used at the UCSC Genome browser.
See also Hit2Wiggle.plCalculates significant differences in miRNA expression using a bayesian approach.
See also DGE_BayesSign.plCreates a heatmap depicting miRNA expression levels
See also DGE_Heatmap.plCreates a lineplot displaying the course of miRNA expression for a specified subset of the miRNAs
See also DGE_Lineplots.plCreates several scatterplots to visualise differences in miRNA expression.
See also DGE_Scatterplots.plCreates a table containing the expression levels of all miRNAs.
See also DGE_Table.plAggregates hits which are overlapping into separate clusters.
See also cluster_overlapping_hits.plCreates files which are optimised for an easy screening of the read distribution along the reference sequence.
See also create_easy_screenable.plDisplayes the position of all reads in the reference sequence graphicaly.
See also create_miRNA_profile.plCreates a logo using Weblogo 3.0
See also create_miRNA_logo.plUses a sliding window approach to aggreate reads into larger clusters.
See also run_sliding_window.pl
Perl 5.8 or higher
At least one of the following mappers: Eland, Seqmap, SOAP, GEM
Robert Kofler
Matthew Ingham
Heinz Himmelbauer
robert.kofler at crg.es