NAME

create_easy_screenable.pl - Creates two files which allow a fast manual screening of the hit positions and their clustering


SYNOPSIS

 # Minimal argument call specifying all required parameters.
 create_easy_screenable.pl --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                           --output screenable_day0
 # Maximum argument call specifying all possible parameters; Several different input files may be specified
 # Note that also the file containing the ambiguous hits may be specified
 create_easy_screenable.pl --output screenable_day0 --min_length 15 --max_length 32 --max_mm 2 --strand RF
                           --max_ambiguity 2 --tempdir "/tmp" 
                           --input Mapping_day0_1_i_Eland_against_mature_unambiguous.txt
                           --input Mapping_day0_1_i_Eland_against_mature_ambiguous.txt


OPTIONS

--input

The input files; Several files may be specified, e.g.: --input file1 --input file2. The input files have to be output files of the script run_Mapping or run_Multimapper. Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.

--output

The output prefix; This script creates two output files, one file will have the extension .pos the other .seq. Mandatory parameter

--strand

Only reads mapping to the specified strand will be used. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF

--min_length

The minimum length of reads. Shorter reads will be ignored. default=15

--max_length

The maximum length of reads. Longer reads will be ignored. default=100

--max_mm

The maximum number of mismatches. Reads having more mismatches will not be used. default=2

--min_count

The minimum number of reads mapping to a position. default=1

--max_ambiguity

The maximum ambiguity of the hits. Hits having a higher ambiguity will be ignored. The ambiguity is an integer value which relates how often a read could be mapped with an equal good score (number of mismatches) to the reference sequence. Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

default=5

--help

Display the help pages.


DESCRIPTION

General

This script creates the easy manually screenable .pos and .seq files. In the .pos files reads having the same start position are aggregated and sorted. This allows for an easy and fast manual identification of intersting features. The .seq files contain more detailed information such as the actual sequences of the reads.

We therefore recommend to primarily screen the .pos file. When an interesting feature has been found, the feature may be further investigated using the .seq file

Input

Mapping results of the script run_Mapping.pl or run_Multimapper.pl. Note that unambiguous and ambiguous mapping results may be provided.

For example:


 24688||Count=3         TACCCTGTAGATCCGAATTTGT          hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       0       F       1
 128318||Count=2        TACCCTGTAGATCCGAATTTGTG         hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       0       F       1
 150952||Count=1        TACCCTGTAGATCCTAATTTGTGT        hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       2       R       1
 212857||Count=1        TACCCTGTAGATCCAAATTTGT          hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       1       F       1
 317801||Count=1        TACCTTGTAGATCCGAATTTGTG         hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       1       F       1
 389805||Count=1        TACCCTGTATATCCGAATTTGTGG        hsa-miR-10a MIMAT0000253 Homo sapiens miR-10a   1       2       F       1

Ambiguity

Ambiguity is an important concept in the MIRO-pipeline, it is therefore crucial that this concept is properly understood. In a nutshell, ambigutiy is the number of equal good mapping positions for a single Solexa-read. Equal good in this context refers to the number of mismatches. In the MIRO-pipeline all unambiguously mapped reads have a ambiguity of "1" and they are provided in a separate output-file. All ambiguously mapped reads, on the other hand, have a ambiguity of ">=2"

Examples:

A read which could be mapped to the H. sapiens genome only once having two mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times, always having only one mismatch, will have a ambiguity of "3".

A read which could be mapped to the H. sapiens genome three times having one mismatch and one time having zero mismatches, will have a ambiguity of "1".

A read which could be mapped to the H. sapiens genome three times having one mismatch and two times having zero mismatches, will have a ambiguity of "2".

Output

The script will generate two easy screenable output files. A position (.pos) file and a sequence (.seq) file.

position file (.pos)

The position files allow a fast and convenient manual screening of the hit-positions. Reads having the same start position are aggregated and sorted according to the start position. Different reference sequences are separated by three successive empty rows whereas the forward and the reverse strand of the same reference sequence is only separated by a single empty row. This allows to quickly estimate the position and shape of read-clusters and addtionally the amount of antisense transcription. If more details are required use the .seq files.

Following an example of reads mapping to H. sapiens hairpins:


 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      51      21
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       42      1       1
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       46      2       2
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      154     21
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       48      3       2

 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       3       1       1
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       13      1       1
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       40      2       2
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       41      1       1
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       62      1       1
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       64      1       1

 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   F       49      1       1
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   F       55      1       1

 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   R       5       1       1
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   R       82      1       1
column 1

The ID of the reference sequence to which the read could be mappped (unambiguously or ambiguously)

column 2

The strand, either F or R

column 3

The start position within the reference sequence. Each unique sequence will have a own row in the .seq files.

column 4

The number of reads having this start position. When screening the .pos file this column deserves the most attention.

column 5

The number of unique sequences having this start position

sequence file (.seq)

The .seq files are very similar to the .pos files. The only difference is, that each unique sequence occupies a own row and the actual sequence of the read is being displayed.

Following for the same reference sequences as above an example of a .seq file:

 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACGTATTA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACGTATT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACTTAT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATAGATATATGTACGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACTTATGA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATAGATGTACGTATG
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      5       TATATATATATATGTACGTAT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TCTATATATATATGTACGTATGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      14      TATATATATATATGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACTAAT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATATACGTATT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTAC
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACTTATG
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      3       TATATATATATATGTACGTATG
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACGGGTG
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATAGATATATATGTACGTATGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      2       TATATATATATATGTACGTATGT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATATATATGTACGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      11      TATATATATATATGTACGTATGA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       11      1       TATATATACATATTTAC
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       42      1       ATGTTTAGGTAGATAT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       46      1       ATACGTAGACATGTA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       46      1       ATACGTAGATATATATGTATTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      2       TACTTAGATATATATTTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      2       TACGGAGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGGGGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGCAGATATATATTTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      123     TACGTAGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      4       TACGTAGATATATATTTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGCTATATATTTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGATATATATGTATTTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGATATGTATGGATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      3       TACGTATATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       AACGGAGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGATATATATTCATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGCAGATATATATGGATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGATGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       GACGTAGATATATATGGATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGATATATATGTATTAT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAGATATATATGCATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTAAATATATATGTATTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TAAGAAGATATATATGTATTTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      1       TACGTATATATATATGTATTTTA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       47      5       TACGTAGATATATATGTATTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       48      1       AAGTAGATATGTATG
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop F       48      2       ACGTAGATATATATGTATTTTA    
 
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       3       1       TATATATATGTGGGAC
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       13      1       TACGTACACATATATTTA
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       40      1       ATATATATGCACGTATACATTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       40      1       TATCTACGCATATATTT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       41      1       ATCTACGCATATATT
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       62      1       ACCCACCGAAAACAC
 hsa-mir-1277 MI0006419 Homo sapiens miR-1277 stem-loop R       64      1       AAACCCTCCAAAAAA
 
 
 
 
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   F       49      1       AGCTCTGTGGACAGG
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   F       55      1       GTGGAAAGGGTAGGCT
 
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   R       5       1       ACCCATCTGAGTTCA
 hsa-mir-555 MI0003561 Homo sapiens miR-555 stem-loop   R       82      1       ATAGATCAGAGTTCG
column 1

The ID of the reference sequence to which the read could be mappped (unambiguously or ambiguously)

column 2

The strand, either F or R

column 3

The start position within the reference sequence. Each unique sequence will have a own row in the .seq files.

column 4

The number of reads having this start position (column 3) and this specific unique sequence (column 5)

column 5

The sequence of the solexa reads. The counts for this specific sequence are given in (column 4).


REQUIREMENTS

Perl 5.8 or higher


AUTHORS

Robert Kofler

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es