cluster_overlapping_hits.pl - Clusters overlapping reads and calculates specifics for the cluster
# Minimal argument call specifying all required parameters. cluster_overlapping_hits.pl --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt --output clusterfile.txt
# Maximum argument call specifying all possible parameters; Several different input files may be specified # Note that also the file containing the ambiguous hits may be specified cluster_overlapping_hits.pl --output clusterfile.txt --min_length 15 --max_length 32 --max_mm 2 --strand RF --threshold 3 --min_count 10 --min_clusterlength 10 --max_clusterlength 30 --max_5_smear 1.5 --max_3_smear 4.5 --max_ambiguity 2 --tempdir "/tmp" --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt --input Mapping_day0_1_i_Eland_against_mature_ambiguous.txt
The input files; Several files may be specified, e.g.: --input file1 --input file2
.
The input files have to be output files of the script run_Mapping
or run_Multimapper
.
Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.
The output file. Mandatory parameter
Only reads mapping to the specified strand will be used. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF
The minimum length of reads. Shorter reads will be ignored. default=15
The maximum length of reads. Longer reads will be ignored. default=100
The maximum number of mismatches. Reads having more mismatches will not be used. default=2
The maximum ambiguity of the hits. Hits having a higher ambiguity will be ignored. The ambiguity is an integer value which relates how often a read could be mapped with an equal good score (number of mismatches) to the reference sequence. Examples:
A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".
A read which could be mapped to the H. sapiens genome three times, always having one mismatch, will have a ambiguity of "3".
A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".
A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".
default=5
Threshold for clustering of reads. Clustering of the reads commences as soon as the read-coverage of a certain nucleotide exceeds this threshold. Everthing below this threshold will be regarded as background noise. default=3
Minimum counts for a read-cluster. Clusters having less reads will not be reported. default=--threshold
Minimum length of a cluster in base pairs. Smaller cluster will not be reported. default=undef
Maximum length of a cluster in base pairs. Larger clusters will not be reported. default=undef
Maximum 5 prime smear. The script calculates the standard deviation of the 5'-position for all reads assigned to one cluster. If the standard deviation of the 5'-position exceeds this threshold the cluster will not be reported. This feature might be usefull as known miRNAs frequently have a very pronounced 5'-position. default=undef
Maximum 3 prime smear. The script calculates the standard deviation of the 3'-position for all reads assigned to one cluster. If the standard deviation of the 3'-position exceeds this threshold the cluster will not be reported. default=undef
Display the help pages.
The script clusters overlapping reads into larger clusters and calculates specifics for each cluster such as the length, or the read counts.
The special algorithm starts clustering as soon as the coverage of a nucleotide exceed a certain threshold --threshold
and stops when the coverage drops below this threshold.
Details for each cluster are finally provided in the output file.
Mapping results of the script run_Mapping.pl
or run_Multimapper.pl
.
Note that unambiguous and ambiguous mapping results may be provided.
For example:
3578635||Count=61 ACAGATGATGAACTTATTGACGGGCGGGCAGA chr22 1 1 R 38045032 3578646||Count=1 AACTGTGGGAACAAAATGACAGCTTAGAGCAG chr17 1 0 F 59023828 3578668||Count=1 AAGTTCCGTAGGAGAGAGACTTTGTTTTTCTG chr21 1 0 F 46797087 3578670||Count=1 GTGCAAACTCGATCACTAGCTCTTCGTGATGT chr1 1 1 F 93078864 3578784||Count=1 TTACTTCGCTTGTCATCAAACCAACTCTCTGG chr5 1 2 F 140085948 3578819||Count=1 ACGCGGGAGACCGGGGTTCAATTCCCGGATGG chr12 1 2 F 97421448 3578830||Count=1 ACTCTCTCGGCTCTGCATAGTTGCACTTTGCT chr16 1 1 R 1953077
Ambiguity is an important concept in the MIRO-pipeline, it is therefore crucial that this concept is properly understood. In a nutshell, ambigutiy is the number of equal good mapping positions for a single Solexa-read. Equal good in this context refers to the number of mismatches. In the MIRO-pipeline all unambiguously mapped reads have a ambiguity of "1" and they are provided in a separate output-file. All ambiguously mapped reads, on the other hand, have a ambiguity of ">=2"
Examples:
A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".
A read which could be mapped to the H. sapiens genome three times, always having only one mismatch, will have a ambiguity of "3".
A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".
A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".
A "Cluster" file which contains detailed information for each cluster. Following an example
query_id count mean_length mean_ambiguity reference_id strand start end mean_start std_start mean_end std_end std_length mean_mismatches sequence overl_0 6 22.2 1.0 hsa-mir-190b MI0005545 Homo sapiens miR-190b stem-loop F 11 33 11.0 0.00 32.2 0.75 0.75 0.8 TGATATGTTTGATATTGGGTTG overl_1 189 24.5 1.0 hsa-mir-923 MI0005715 Homo sapiens miR-923 stem-loop F 1 55 20.7 11.53 44.1 10.60 5.26 1.1 TATTAGTCAGCGGAGGA overl_2 5 17.8 1.0 hsa-mir-1298 MI0003938 Homo sapiens miR-1298 stem-loop F 14 39 18.6 3.29 35.4 4.62 3.83 1.2 AGGGTTGATTCGGCT overl_5 3 20.0 1.0 hsa-mir-27a MI0000085 Homo sapiens miR-27a stem-loop F 10 34 13.0 5.20 32.0 2.00 3.61 1.0 AGGGCTTAGCTGCTTGTGAGC
The ID of the cluster. IDs are assigned as successive numbers.
The total number of reads assigned to this cluster.
The average length of the reads assigned to this cluster
The average ambiguity of the reads assigned to this cluster. See also Ambiguity
The reference sequence ID on which this cluster is located
The strand on which this cluster is located, either F or R
The start position of the cluster. Actually the start position of the most 5' read
The end position of the cluster. Actually the end position of the most 3' read
The mean start position for the cluster, i.e the average start position of all the reads within the cluster
The standard deviation of the start position
The mean end position for the cluster, i.e. the average end position of all the reads within the cluster
The standard deviation of the end position
The standar deviation of the average read length
The average number of mismatches of the reads assigned to this cluster
A representative sequence for this cluster.
From all reads mapping to mean_start
the read with the fewest mismatches is chosen.
Perl 5.8 or higher
Robert Kofler
Heinz Himmelbauer
robert.kofler at crg.es