NAME

Cluster2Bed.pl - Converts a cluster file into a .bed file


SYNOPSIS

 # Minimal argument call, specifying all required parameters.
 Cluster2Bed.pl --input cluster_file.txt
                --output genome_browsable.bed
 
 # Maximal argument call, specifying all possible parameters.
 # Note that several cluster files may be provided
 Cluster2Bed.pl  --range chr22:1000:50000 --trackname "day0"
                 --output genome_browsable.fa 
                 --input cluster_file1.txt
                 --input cluster_file2.txt


OPTIONS

--input

Input has to be a cluster files; Several files may be specified, e.g.: --input cluster_file1 --input cluster_file2. Cluster files may be created with the script cluster_overlapping_hits.pl or run_sliding_window.pl.

--output

The output file. Mandatory parameter

--range

If you are only interested in clusters of certain chromosomal region you may use this option. The range has to be provided in the following format chromosome:start:end. For example --range chr22:10000:50000 means that only clusters from chromosome 22, being located between 10.000 and 50.000 bp will be used. default=undef;

--trackname

A name for the track. This information will, for example, be displayed in the UCSC-Genome browser. A name may also be easily added afterwards by editing the ".bed" file. default="unknown"

--help

Display the help pages


DESCRIPTION

General

Converts a cluster-file into a .bed file. Bed files are accepted by most genome-browsers like the UCSC-Genome Browser. The ".bed" format allows for a score. This script uses this option by assigning a number ranging from 1 to 1000 to each cluster, which reflects the ambiguity of the individual reads forming the cluster. Unambiguous clusters will recieve a score of 1000 whereas the score will be lower for ambiguous clusters.

Input

The input has to be a cluster file, which may be created with the scripts cluster_overlapping_hits.pl or run_sliding_window.pl

For example:

 query_id       count   mean_length     mean_ambiguity  reference_id    strand  start   end     mean_start      std_start       mean_end        std_end std_length      mean_mismatches sequence
 overl_0        189     24.5    1.0     chr1    F       1       55      20.7    11.53   44.1    10.60   5.26    1.1     TATTAGTCAGCGGAGGA
 overl_1        25      18.9    5.0     chr12   F       24      50      30.7    1.59    48.6    2.34    1.29    0.6     AAAAGCTGGGTTGAGAGG
 overl_2        836     20.1    1.3     chr12   F       45      78      51.1    0.86    70.2    1.49    1.46    0.6     TTCACAGTGGCTAAGTTCCG
 overl_3        37      21.4    1.0     chr22   F       40      63      40.3    1.05    60.6    1.69    2.13    0.6     TTTGTTCGTTCGGCTCGCGTG
 overl_4        56      21.1    1.0     chr12   F       15      37      15.1    0.23    35.1    0.49    0.52    0.6     CATTATTACTTTTGGTACGCG

Output

A ".bed" file, which is accepted by most genome browsers.

For example:

 track name=region 1 description="unknown" useScore=1
 chr1   1000    1055    count:189       1000    +
 chr1   30023   30050   count:25        600     +
 chr1   44001   44078   count:836       970     +
 chr2   50039   50063   count:37        1000    +
 chr2   100044  100037  count:56        1000    -
column 1

The id of the reference sequence. In order to be accepted by a genome browser this has to be something like chr1, chr12 etc

column 2

Start position of the cluster

column3

End position of the cluster

column 4

The counts of the cluster. This is actually the name field of the .bed file. Most genome browser will display this information close to each cluster, which makes the number of reads assigned to a cluster immediately accessable.

column 5

The score. Some genome browsers assign different colors to features with different scores. For example high scoring clusters may have a very dark color whereas low scoring clusters will have a very light color. This script assigns a score which reflects the ambiguity of the indvidiual reads forming a read-cluster.

 Score = 1000 - (mean_ambiguity - 1) * 100
 Score = 100 if Score < 100
column 6

Strand; + forward strand; - reverse strand


REQUIREMENTS

Perl 5.8 or higher


AUTHORS

Robert Kofler

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es