Cluster2Bed.pl - Converts a cluster file into a .bed file
# Minimal argument call, specifying all required parameters. Cluster2Bed.pl --input cluster_file.txt --output genome_browsable.bed # Maximal argument call, specifying all possible parameters. # Note that several cluster files may be provided Cluster2Bed.pl --range chr22:1000:50000 --trackname "day0" --output genome_browsable.fa --input cluster_file1.txt --input cluster_file2.txt
Input has to be a cluster files; Several files may be specified, e.g.: --input cluster_file1 --input cluster_file2
.
Cluster files may be created with the script cluster_overlapping_hits.pl
or run_sliding_window.pl
.
The output file. Mandatory parameter
If you are only interested in clusters of certain chromosomal region you may use this option.
The range has to be provided in the following format chromosome:start:end
.
For example --range chr22:10000:50000
means that only clusters from chromosome 22, being located between 10.000 and 50.000 bp will be used.
default=undef;
A name for the track. This information will, for example, be displayed in the UCSC-Genome browser. A name may also be easily added afterwards by editing the ".bed" file. default="unknown"
Display the help pages
Converts a cluster-file into a .bed file. Bed files are accepted by most genome-browsers like the UCSC-Genome Browser. The ".bed" format allows for a score. This script uses this option by assigning a number ranging from 1 to 1000 to each cluster, which reflects the ambiguity of the individual reads forming the cluster. Unambiguous clusters will recieve a score of 1000 whereas the score will be lower for ambiguous clusters.
The input has to be a cluster file, which may be created with the scripts cluster_overlapping_hits.pl
or run_sliding_window.pl
For example:
query_id count mean_length mean_ambiguity reference_id strand start end mean_start std_start mean_end std_end std_length mean_mismatches sequence overl_0 189 24.5 1.0 chr1 F 1 55 20.7 11.53 44.1 10.60 5.26 1.1 TATTAGTCAGCGGAGGA overl_1 25 18.9 5.0 chr12 F 24 50 30.7 1.59 48.6 2.34 1.29 0.6 AAAAGCTGGGTTGAGAGG overl_2 836 20.1 1.3 chr12 F 45 78 51.1 0.86 70.2 1.49 1.46 0.6 TTCACAGTGGCTAAGTTCCG overl_3 37 21.4 1.0 chr22 F 40 63 40.3 1.05 60.6 1.69 2.13 0.6 TTTGTTCGTTCGGCTCGCGTG overl_4 56 21.1 1.0 chr12 F 15 37 15.1 0.23 35.1 0.49 0.52 0.6 CATTATTACTTTTGGTACGCG
A ".bed" file, which is accepted by most genome browsers.
For example:
track name=region 1 description="unknown" useScore=1 chr1 1000 1055 count:189 1000 + chr1 30023 30050 count:25 600 + chr1 44001 44078 count:836 970 + chr2 50039 50063 count:37 1000 + chr2 100044 100037 count:56 1000 -
The id of the reference sequence. In order to be accepted by a genome browser this has to be something like chr1, chr12 etc
Start position of the cluster
End position of the cluster
The counts of the cluster. This is actually the name field of the .bed file. Most genome browser will display this information close to each cluster, which makes the number of reads assigned to a cluster immediately accessable.
The score. Some genome browsers assign different colors to features with different scores. For example high scoring clusters may have a very dark color whereas low scoring clusters will have a very light color. This script assigns a score which reflects the ambiguity of the indvidiual reads forming a read-cluster.
Score = 1000 - (mean_ambiguity - 1) * 100 Score = 100 if Score < 100
Strand; + forward strand; - reverse strand
Perl 5.8 or higher
Robert Kofler
Heinz Himmelbauer
robert.kofler at crg.es