NAME

cluster_overlapping_hits.pl - Clusters overlapping reads and calculates specifics for the cluster

SYNOPSIS

 # Minimal argument call specifying all required parameters.
 cluster_overlapping_hits.pl --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                             --output clusterfile.txt

 # Maximum argument call specifying all possible parameters; Several different input files may be specified
 # Note that also the file containing the ambiguous hits may be specified
 cluster_overlapping_hits.pl --output clusterfile.txt --min_length 15 --max_length 32 --max_mm 2 --strand RF
                             --threshold 3 --min_count 10 --min_clusterlength 10 --max_clusterlength 30
                             --max_5_smear 1.5 --max_3_smear 4.5 --max_ambiguity 2 --tempdir "/tmp"
                             --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                             --input Mapping_day0_1_i_Eland_against_mature_ambiguous.txt

OPTIONS

--input

The input files; Several files may be specified, e.g.: --input file1 --input file2. The input files have to be output files of the script run_Mapping or run_Multimapper. Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.

--output

The output file. Mandatory parameter

--strand

Only reads mapping to the specified strand will be used. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF

--min_length

The minimum length of reads. Shorter reads will be ignored. default=15

--max_length

The maximum length of reads. Longer reads will be ignored. default=100

--max_mm

The maximum number of mismatches. Reads having more mismatches will not be used. default=2

--max_ambiguity

The maximum ambiguity of the hits. Hits having a higher ambiguity will be ignored. The ambiguity is an integer value which relates how often a read could be mapped with an equal good score (number of mismatches) to the reference sequence. Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

default=5

--threshold

Threshold for clustering of reads. Clustering of the reads commences as soon as the read-coverage of a certain nucleotide exceeds this threshold. Everthing below this threshold will be regarded as background noise. default=3

--min_count

Minimum counts for a read-cluster. Clusters having less reads will not be reported. default=--threshold

--min_clusterlength

Minimum length of a cluster in base pairs. Smaller cluster will not be reported. default=undef

--max_clusterlength

Maximum length of a cluster in base pairs. Larger clusters will not be reported. default=undef

--max_5_smear

Maximum 5 prime smear. The script calculates the standard deviation of the 5'-position for all reads assigned to one cluster. If the standard deviation of the 5'-position exceeds this threshold the cluster will not be reported. This feature might be usefull as known miRNAs frequently have a very pronounced 5'-position. default=undef

--max_3_smear

Maximum 3 prime smear. The script calculates the standard deviation of the 3'-position for all reads assigned to one cluster. If the standard deviation of the 3'-position exceeds this threshold the cluster will not be reported. default=undef

--help

Display the help pages.

DESCRIPTION

General

The script clusters overlapping reads into larger clusters and calculates specifics for each cluster such as the length, or the read counts. The special algorithm starts clustering as soon as the coverage of a nucleotide exceed a certain threshold --threshold and stops when the coverage drops below this threshold. Details for each cluster are finally provided in the output file.

Input

Mapping results of the script run_Mapping.pl or run_Multimapper.pl. Note that unambiguous and ambiguous mapping results may be provided.

For example:

 3578635||Count=61       ACAGATGATGAACTTATTGACGGGCGGGCAGA        chr22   1       1       R       38045032
 3578646||Count=1        AACTGTGGGAACAAAATGACAGCTTAGAGCAG        chr17   1       0       F       59023828
 3578668||Count=1        AAGTTCCGTAGGAGAGAGACTTTGTTTTTCTG        chr21   1       0       F       46797087
 3578670||Count=1        GTGCAAACTCGATCACTAGCTCTTCGTGATGT        chr1    1       1       F       93078864
 3578784||Count=1        TTACTTCGCTTGTCATCAAACCAACTCTCTGG        chr5    1       2       F       140085948
 3578819||Count=1        ACGCGGGAGACCGGGGTTCAATTCCCGGATGG        chr12   1       2       F       97421448
 3578830||Count=1        ACTCTCTCGGCTCTGCATAGTTGCACTTTGCT        chr16   1       1       R       1953077

Ambiguity

Ambiguity is an important concept in the MIRO-pipeline, it is therefore crucial that this concept is properly understood. In a nutshell, ambigutiy is the number of equal good mapping positions for a single Solexa-read. Equal good in this context refers to the number of mismatches. In the MIRO-pipeline all unambiguously mapped reads have a ambiguity of "1" and they are provided in a separate output-file. All ambiguously mapped reads, on the other hand, have a ambiguity of ">=2"

Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having only one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

Output

A "Cluster" file which contains detailed information for each cluster. Following an example

 query_id    count  mean_length mean_ambiguity          reference_id                                  strand  start    end  mean_start  std_start   mean_end  std_end std_length mean_mismatches        sequence
 overl_0        6       22.2    1.0             hsa-mir-190b MI0005545 Homo sapiens miR-190b stem-loop  F       11      33      11.0    0.00            32.2    0.75    0.75            0.8     TGATATGTTTGATATTGGGTTG
 overl_1        189     24.5    1.0             hsa-mir-923 MI0005715 Homo sapiens miR-923 stem-loop    F       1       55      20.7    11.53           44.1    10.60   5.26            1.1     TATTAGTCAGCGGAGGA
 overl_2        5       17.8    1.0             hsa-mir-1298 MI0003938 Homo sapiens miR-1298 stem-loop  F       14      39      18.6    3.29            35.4    4.62    3.83            1.2     AGGGTTGATTCGGCT
 overl_5        3       20.0    1.0             hsa-mir-27a MI0000085 Homo sapiens miR-27a stem-loop    F       10      34      13.0    5.20            32.0    2.00    3.61            1.0     AGGGCTTAGCTGCTTGTGAGC

query_id: The ID of the cluster. IDs are assigned as successive numbers.
count: The total number of reads assigned to this cluster.
mean_length: The average length of the reads assigned to this cluster
mean_ambiguity: The average ambiguity of the reads assigned to this cluster. See also Ambiguity
reference_id: The reference sequence ID on which this cluster is located
strand: The strand on which this cluster is located, either F or R
start: The start position of the cluster. Actually the start position of the most 5' read
end: The end position of the cluster. Actually the end position of the most 3' read
mean_start: The mean start position for the cluster, i.e the average start position of all the reads within the cluster
std_start: The standard deviation of the start position
mean_end: The mean end position for the cluster, i.e. the average end position of all the reads within the cluster
std_end: The standard deviation of the end position
std_length: The standar deviation of the average read length
mean_mismatches: The average number of mismatches of the reads assigned to this cluster
sequence: A representative sequence for this cluster. From all reads mapping to mean_start the read with the fewest mismatches is chosen.

REQUIREMENTS

Perl 5.8 or higher

AUTHORS

Robert Kofler

Heinz Himmelbauer

CONTACT

robert.kofler at crg.es