NAME

Hit2Bed.pl - Converts mapping results to a .bed file


SYNOPSIS

 # Minimal argument call, specifying all required parameters.
 Hit2Bed.pl --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
            --output genome_browsable.bed
 
 # Maximal argument call, specifying all possible parameters.
 # Note that also the file containing the ambiguous hits may be provided
 Hit2Bed.pl  --range chr22:1000:50000 --trackname "day0"
             --min_length 15 --max_length 32 --max_ambiguity 2 
             --output genome_browsable.fa --max_mm 2 --strand "R"
             --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
             --input Mapping_day0_1_i_Eland_against_mature_ambiguous.txt


OPTIONS

--input

The input files; Several files may be specified, e.g.: --input file1 --input file2. The input files have to be output files of the script run_Mapping or run_Multimapper. Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.

--output

The output file. Mandatory parameter

--min_length

The minimum length of reads. Shorter reads will be ignored. default=15

--max_length

The maximum length of reads. Longer reads will be ignored. default=100

--max_mm

The maximum number of mismatches. Reads having more mismatches will not be used. default=2

--strand

Only reads mapping to the specified strand will be used. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF

--max_ambiguity

The maximum ambiguity of the hits. Hits having a higher ambiguity will be ignored. The ambiguity is an integer value which relates how often a read could be mapped with an equal good score (number of mismatches) to the reference sequence. Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

default=5

--range

If you are only interested in reads mapping to a certain chromosomal region you may use this option. The range has to be provided in the following format chromosome:start:end. For example --range chr22:10000:50000 means that only reads from chromosome 22, mapping to positions between 10.000 and 50.000 bp will be used. default=undef;

--trackname

A name for the track. This information will, for example, be displayed in the UCSC-Genome browser. A name may also be easily added afterwards by editing the ".bed" file. default="unknown"

--help

Display the help pages


DESCRIPTION

General

Converts a hit-file into a .bed file. Bed files are accepted by most genome-browsers like the UCSC-Genome Browser. The ".bed" format allows for a score. This script uses this option by assigning a number ranging from 1 to 1000 to each read, which reflects their ambiguity. Unambiguous mapped reads will recieve a score of 1000 whereas the score will be lower for ambiguously mapped reads.

Input

Mapping results of the script run_Mapping.pl or run_Multimapper.pl. Note that unambiguous and ambiguous mapping results may be provided.

For example:


 5031||Count=1   TCCCCGCCGGCGGAA chr4    1       0       F       170168086
 5217||Count=1   GACCGTCCAACGCAC chr20   1       0       R       59264663
 5560||Count=1   ATCGGGTGGTAGCAA chr3    1       0       F       16192245
 6184||Count=12  TCCGGGCTACTGCTG chr1    1       0       F       29388851
 6209||Count=1   GCAGCCATCGTTTTT chr10   1       0       F       61351707

Output

A ".bed" file which is accepted by most genome browsers.

For example:

 track name=region 1 description="unknown" useScore=1
 chr22  23901283        23901298        id:715          600     -
 chr22  41490201        41490216        id:9720         600     -
 chr22  48966829        48966844        id:9720         600     +
 chr22  36860963        36860978        id:24947        600     +
 chr22  16801453        16801468        id:43281        700     +
 chr22  16801453        16801468        id:43281        700     +
column 1

The id of the reference sequence. In order to be accepted by a genome browser this has to be something like chr1, chr12 etc

column 2

Start position of the read

column3

End position of the read

column 4

The name of the unique sequence. As several reads may have the same unique sequence this name must not be unique for a specific read. As shown in Output each unique sequence, mapped with MIRO, has an associated count property (For example: 24688||Count=3). Therefore the script copies each unique sequence Count times for the ".bed" file, thus the reads displayed in the Genome-Browser will reflect the actual number of observed reads.

column 5

The score. Some genome browsers assign different colors to reads with different scores. For example high scoring reads may have a very dark color whereas low scoring reads will have a very light scolor. To use this feature this script assigns a score which reflects the mapping ambibuity of the read

 Score = 1000 - (ambiguity - 1) * 100
 Score = 100 if Score < 100
column 6

Strand. + forward strand; - reverse strand


REQUIREMENTS

Perl 5.8 or higher


AUTHORS

Robert Kofler

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es