NAME

run_sliding_window.pl - Clusters reads using a sliding window approach


SYNOPSIS

 # Minimal argument call specifying all required parameters.
 run_sliding_window.pl --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                       --output sliding_cluster_file.txt
 # Maximum argument call specifying all possible parameters; Several different input files may be specified
 # Note that also the file containing the ambiguous hits may be specified
 run_sliding_window.pl --output sliding_cluster_file.txt --min_length 15 --max_length 32 --max_mm 2 --strand RF
                       --min_count 10 --max_ambiguity 2 --tempdir "/tmp"
                       --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                       --input Mapping_day0_1_i_Eland_against_mature_ambiguous.txt


OPTIONS

--input

The input files; Several files may be specified, e.g.: --input file1 --input file2. The input files have to be output files of the script run_Mapping or run_Multimapper. Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.

--output

The output file. Mandatory parameter

--strand

Only reads mapping to the specified strand will be used. Reads from different strands will however never be clustered into the same sliding window. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF

--min_length

The minimum length of reads. Shorter reads will be ignored. default=15

--max_length

The maximum length of reads. Longer reads will be ignored. default=100

--max_mm

The maximum number of mismatches. Reads having more mismatches will not be used. default=2

--max_ambiguity

The maximum ambiguity of the hits. Hits having a higher ambiguity will be ignored. The ambiguity is an integer value which relates how often a read could be mapped with an equal good score (number of mismatches) to the reference sequence. Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

default=5

--min_count

Minimum counts for a sliding window cluster. Clusters having less reads will not be reported. default=1

--window_size

The size of the sliding window; default=1000

--step_size

The size of the sliding window steps. Must be smaller or equal than --window_size. If the step_size is smaller than the window_size the sliding windows will be overlapping. default=--window_size

--tempdir

The path to the temporary directory. default=/tmp

--help

Display the help pages.


DESCRIPTION

General

The script clusters hits using a sliding window approach. The sliding windows may be overlapping or abutting. Details for each sliding window cluster are finally provided in the output file.

Input

Mapping results of the script run_Mapping.pl or run_Multimapper.pl. Note that unambiguous and ambiguous mapping results may be provided.

For example:


 24688||Count=3         TACCCTGTAGATCCGAATTTGT          hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       0       F       1
 128318||Count=2        TACCCTGTAGATCCGAATTTGTG         hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       0       F       1
 150952||Count=1        TACCCTGTAGATCCTAATTTGTGT        hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       2       R       1
 212857||Count=1        TACCCTGTAGATCCAAATTTGT          hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       1       F       1
 317801||Count=1        TACCTTGTAGATCCGAATTTGTG         hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       1       F       1
 389805||Count=1        TACCCTGTATATCCGAATTTGTGG        hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       2       F       1

Ambiguity

Ambiguity is an important concept in the MIRO-pipeline, it is therefore crucial that this concept is properly understood. In a nutshell, ambigutiy is the number of equal good mapping positions for a single Solexa-read. Equal good in this context refers to the number of mismatches. In the MIRO-pipeline all unambiguously mapped reads have a ambiguity of "1" and they are provided in a separate output-file. All ambiguously mapped reads, on the other hand, have a ambiguity of ">=2"

Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having only one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

Output

A "Cluster" file which contains detailed information for each cluster. This is the same output file as used for the script cluster_overlapping_hits.pl. Since for sliding window clustering some parameters are meaningless, like the mean_start, this parameters are left blank (indicated with "-"). The parameters are however kept for reasons of compatibility and to provide a single output format. Following an example:

 query_id     count  mean_length mean_ambiguity reference_id                                          strand start      end mean_start  std_start mean_end std_end std_length mean_mismatches   sequence
 slide_3        10      15.9            1.1     hsa-mir-1233 MI0006323 Homo sapiens miR-1233 stem-loop  F       1       1000    -       -       -               -       1.29    2.0             -
 slide_5        32      22.0            1.0     hsa-mir-1292 MI0006433 Homo sapiens miR-1292 stem-loop  F       1       1000    -       -       -               -       3.81    0.7             -
 slide_6        97      16.0            1.1     hsa-mir-1181 MI0006274 Homo sapiens miR-1181 stem-loop  F       1       1000    -       -       -               -       1.40    1.9             -
 slide_8        101     21.9            1.0     hsa-mir-548k MI0006354 Homo sapiens miR-548k stem-loop  F       1       1000    -       -       -               -       0.54    0.3             -

Another example:

 query_id       count   mean_length     mean_ambiguity  reference_id    strand  start   end     mean_start      std_start       mean_end        std_end std_length      mean_mismatches sequence
 slide_28       11      20.0    2.0     chrX    R       512001  513000  -       -       -       -       0.45    2.0     -
 slide_47       11      25.2    2.0     chrX    R       1465001 1466000 -       -       -       -       4.79    0.3     -
 slide_49       12      29.6    2.0     chrX    R       1468001 1469000 -       -       -       -       3.40    0.7     -
 slide_138      10      20.9    1.8     chrX    R       3641001 3642000 -       -       -       -       3.38    0.7     -
query_id

The ID of the sliding window. IDs are assigned as successive numbers.

count

The total number of reads assigned to this sliding window.

mean_length

The average length of the reads assigned to this sliding window

mean_ambiguity

The average ambiguity of the reads assigned to this sliding window. See also Ambiguity

reference_id

The reference sequence ID on which this sliding window is located

strand

The strand on which this sliding window is located, either F or R

start

The start position of the sliding window.

end

The end position of the sliding window.

mean_start

Meaningless for sliding window clustering thus left blank (indicated by "-")

std_start

Meaningless for sliding window clustering thus left blank (indicated by "-")

mean_end

Meaningless for sliding window clustering thus left blank (indicated by "-")

std_end

Meaningless for sliding window clustering thus left blank (indicated by "-")

std_length

The standar deviation of the average read length

mean_mismatches

The average number of mismatches of the reads assigned to this sliding window

sequence

Meaningless for sliding window clustering thus left blank (indicated by "-")


REQUIREMENTS

Perl 5.8 or higher


AUTHORS

Robert Kofler

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es