NAME

Hit2Mirdeep.pl - Converts mapping results into a miRdeep blastparsed and into fasta file


SYNOPSIS

 # Minimal argument call, specifying all required parameters.
 Hit2Mirdeep.pl --input Mapping_day0_1_i_Eland_against_hsa-wg_unambiguous.txt
                --reference hsa_whole_genome.fa
                --output_blast mirdeep_blastparsed.txt
                --output_fasta mirdeep_fasta.txt


OPTIONS

--input

The input files; Several files may be specified, e.g.: --input file1 --input file2. The input files have to be output files of the script run_Mapping or run_Multimapper. Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.

--reference

Path to the reference sequence which was used for mapping. Necessary to retrieve information about the length of the chromosomes. Mandatory parameter

--output_blast

The output file for the mirdeep blastparsed file. Mandatory parameter.

--output_fasta

The output file for the fasta formated sequences. Mandatory parameter

--help

Display the help pages


DESCRIPTION

General

Converts a hit-file into a MirDeep blastparsed file and addtionally writes the sequence of the reads into a fasta formated file. This allows to use mirDeep for de-novo miRNA discovery. The mirDeep pipeline may be entered at the step FILTERING UBIQUITOUS ALIGNMENTS using the miRdeep script filter_alignments.pl.

Input

Mapping results of the script run_Mapping.pl or run_Multimapper.pl. Note that unambiguous and ambiguous mapping results may be provided.

For example:


 5031||Count=1   TCCCCGCCGGCGGAA chr4    1       0       F       170168086
 5217||Count=1   GACCGTCCAACGCAC chr20   1       0       R       59264663
 5560||Count=1   ATCGGGTGGTAGCAA chr3    1       0       F       16192245
 6184||Count=12  TCCGGGCTACTGCTG chr1    1       0       F       29388851
 6209||Count=1   GCAGCCATCGTTTTT chr10   1       0       F       61351707

Output

Blastparsed

The blastparsed format of miRdeep. For example:

 id_2437445_x1  16      1..16   chr3    23470805        5193722..5193737        0.04    1.00    32.2    Plus / Plus
 id_2437560_x1  16      1..16   chr4    18585042        3923870..3923885        0.2     0.88    28.2    Plus / Minus
 id_2437621_x2  16      1..16   chr3    23470805        23050758..23050773      0.16    0.94    30.2    Plus / Minus
 id_2437630_x1  16      1..16   chr1    30432563        9024663..9024678        0.04    1.00    32.2    Plus / Minus
 id_2437645_x1  16      1..16   chr4    18585042        18021856..18021871      0.16    0.94    30.2    Plus / Minus
 id_2437682_x6  16      1..16   chr1    30432563        24623989..24624004      0.2     0.88    28.2    Plus / Minus
 id_2437922_x1  16      1..16   chr1    30432563        2847658..2847673        0.2     0.88    28.2    Plus / Plus
 id_2437937_x1  16      1..16   chr2    19705359        19587044..19587059      0.16    0.94    30.2    Plus / Plus

Fasta formated

For example:

 >id_1432505_x5
 GGGATTGTAGTTCAATTGGTCAGAGCACCGCC
 >id_1464713_x1
 CGGTGGGGGCTTCTGGGGCTACGGTCGCGTCT
 >id_1472249_x1
 AAGAATGAAGTAGTACGATCGGCGGGCGTTTC 
 >id_1495451_x1
 AGGGAATGTGACACGTTTAGGCATTCCTCTGC
 >id_1528343_x1
 CTCGATGAGTAGGAGGGCGCGGCGGTCGCTTT
 >id_1544022_x1
 CGGACTACCTGCACCTGGACAGAAAGACCCGC
 >id_1559551_x1
 ATTTATCTTTTTTTTTTTTTTTAAAAAAAAAA

This fasta file is required for the miRdeep script auto_blast.pl

E-value estimation

Short read mappers like Eland, Soap, Seqmap do not provide blast e-values. This is a main problem when converting the Hit format into the mirDeep blastparsed format. We therefore use the following table to obtain approximations for the e-value and the bit score:

 eff_length     e-value bit-score
 <15            0.2     28.2
 15             0.16    30.2
 16             0.040   32.2
 17             0.015   34.2
 18             0.005   36.2
 19             0.002   38.2
 20             4e-04   40.1
 21             1e-04   42.1
 22             4e-05   44.1
 23             1e-05   46.1
 24             3e-06   48.1
 25             9e-07   50.0
 26             2e-07   52.0
 27             7e-08   54.0
 28             2e-08   56.0
 29             5e-09   58.0
 30             1e-09   60.0
 >30            1e-10   64

The eff_length is calculated: eff_length = read_length - mismatches. The values for this table are derived from the mirDeep example files when mapping the reads to the C. elegans whole genome. They will therefore not reflect the true e-values but they may act as approximations.

Next steps using Mirdeep

The miRdeep pipeline requires several distinct steps:

 PREPROCESSING THE READS
 ALIGNING THE READS TO THE GENOME
 FILTERING UBIQUITOUS ALIGNMENTS
 FILTERING BY ANNOTATION
 PREDICTING MICRORNAS
 MAKING PERMUTED CONTROLS

You can enter the miRdeep pipeline at the step FILTERING UBIQUITOUS ALIGNMENTS using the script: filter_alignments.pl.

Testing

We tested this script for compatibility with miRdeep using one of our own arabidopsis datasets and predicted, amongst others, the example shown below. A subsequent mirbase search revealed that this is Arabidopsis thaliana miR825

 score_star     3.9
 score_mfe      1.5
 score_freq     -3.3
 score  2.2
 flank_first_end        22
 flank_first_seq        CTAGTTGATTCCATCGACTCGT
 flank_first_struct     ..(((((((...))))))).((
 flank_second_beg       105
 flank_second_seq       AAGTTG
 flank_second_struct    ......
 freq   5
 loop_beg       45
 loop_end       83
 loop_seq       TTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGC
 loop_struct    .(((((..(((((..(.....)..)))))..))))).))
 mature_arm     second
 mature_beg     84
 mature_end     104
 mature_query   id_493716_x2
 mature_seq     TTCTCAAGAAGGTGCATGAAC
 mature_strand          +
 mature_struct          ))))).))..))))).)))))
 pre_seq        TCAAGCACCAGCTCGAAGAAGCTTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGCTTCTCAAGAAGGTGCATGAAC
 pre_struct     (((.(((((..((.((.(((((.(((((..(((((..(.....)..)))))..))))).))))))).))..))))).)))))
 pri_beg        1
 pri_end        110
 pri_id         chr2_3461
 pri_mfe        -36.60
 pri_seq        CTAGTTGATTCCATCGACTCGTTCAAGCACCAGCTCGAAGAAGCTTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGCTTCTCAAGAAGGTGCATGAACAAGTTG
 pri_struct     ..(((((((...))))))).(((((.(((((..((.((.(((((.(((((..(((((..(.....)..)))))..))))).))))))).))..))))).)))))......
 star_arm       first
 star_beg       23
 star_end       44
 star_read      1
 star_seq       TCAAGCACCAGCTCGAAGAAGC
 star_struct    (((.(((((..((.((.(((((
 id_2122663_x2  22      1..22   chr2_3461       110     23..44  2e-06   1.00    44.1    Plus / Plus
 id_493716_x2   21      1..21   chr2_3461       110     84..104 7e-06   1.00    42.1    Plus / Plus
 id_201319_x1   21      1..21   chr2_3461       110     85..105 7e-06   1.00    42.1    Plus / Plus


REQUIREMENTS

Perl 5.8 or higher

miRdeep


AUTHORS

Robert Kofler

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es