Hit2Mirdeep.pl - Converts mapping results into a miRdeep blastparsed and into fasta file
# Minimal argument call, specifying all required parameters. Hit2Mirdeep.pl --input Mapping_day0_1_i_Eland_against_hsa-wg_unambiguous.txt --reference hsa_whole_genome.fa --output_blast mirdeep_blastparsed.txt --output_fasta mirdeep_fasta.txt
The input files; Several files may be specified, e.g.: --input file1 --input file2
.
The input files have to be output files of the script run_Mapping
or run_Multimapper
.
Note that unambiguously and ambiguously mapped reads may be provided for this script. Mandatory parameter.
Path to the reference sequence which was used for mapping. Necessary to retrieve information about the length of the chromosomes. Mandatory parameter
The output file for the mirdeep blastparsed file. Mandatory parameter.
The output file for the fasta formated sequences. Mandatory parameter
Display the help pages
Converts a hit-file into a MirDeep blastparsed file and addtionally writes the sequence of the reads into a fasta formated file.
This allows to use mirDeep for de-novo miRNA discovery.
The mirDeep pipeline may be entered at the step FILTERING UBIQUITOUS ALIGNMENTS
using the miRdeep script
filter_alignments.pl
.
Mapping results of the script run_Mapping.pl
or run_Multimapper.pl
.
Note that unambiguous and ambiguous mapping results may be provided.
For example:
5031||Count=1 TCCCCGCCGGCGGAA chr4 1 0 F 170168086 5217||Count=1 GACCGTCCAACGCAC chr20 1 0 R 59264663 5560||Count=1 ATCGGGTGGTAGCAA chr3 1 0 F 16192245 6184||Count=12 TCCGGGCTACTGCTG chr1 1 0 F 29388851 6209||Count=1 GCAGCCATCGTTTTT chr10 1 0 F 61351707
The blastparsed format of miRdeep. For example:
id_2437445_x1 16 1..16 chr3 23470805 5193722..5193737 0.04 1.00 32.2 Plus / Plus id_2437560_x1 16 1..16 chr4 18585042 3923870..3923885 0.2 0.88 28.2 Plus / Minus id_2437621_x2 16 1..16 chr3 23470805 23050758..23050773 0.16 0.94 30.2 Plus / Minus id_2437630_x1 16 1..16 chr1 30432563 9024663..9024678 0.04 1.00 32.2 Plus / Minus id_2437645_x1 16 1..16 chr4 18585042 18021856..18021871 0.16 0.94 30.2 Plus / Minus id_2437682_x6 16 1..16 chr1 30432563 24623989..24624004 0.2 0.88 28.2 Plus / Minus id_2437922_x1 16 1..16 chr1 30432563 2847658..2847673 0.2 0.88 28.2 Plus / Plus id_2437937_x1 16 1..16 chr2 19705359 19587044..19587059 0.16 0.94 30.2 Plus / Plus
For example:
>id_1432505_x5 GGGATTGTAGTTCAATTGGTCAGAGCACCGCC >id_1464713_x1 CGGTGGGGGCTTCTGGGGCTACGGTCGCGTCT >id_1472249_x1 AAGAATGAAGTAGTACGATCGGCGGGCGTTTC >id_1495451_x1 AGGGAATGTGACACGTTTAGGCATTCCTCTGC >id_1528343_x1 CTCGATGAGTAGGAGGGCGCGGCGGTCGCTTT >id_1544022_x1 CGGACTACCTGCACCTGGACAGAAAGACCCGC >id_1559551_x1 ATTTATCTTTTTTTTTTTTTTTAAAAAAAAAA
This fasta file is required for the miRdeep script auto_blast.pl
Short read mappers like Eland, Soap, Seqmap do not provide blast e-values. This is a main problem when converting the Hit format into the mirDeep blastparsed format. We therefore use the following table to obtain approximations for the e-value and the bit score:
eff_length e-value bit-score <15 0.2 28.2 15 0.16 30.2 16 0.040 32.2 17 0.015 34.2 18 0.005 36.2 19 0.002 38.2 20 4e-04 40.1 21 1e-04 42.1 22 4e-05 44.1 23 1e-05 46.1 24 3e-06 48.1 25 9e-07 50.0 26 2e-07 52.0 27 7e-08 54.0 28 2e-08 56.0 29 5e-09 58.0 30 1e-09 60.0 >30 1e-10 64
The eff_length
is calculated: eff_length = read_length - mismatches
.
The values for this table are derived from the mirDeep example files when mapping the reads to the C. elegans whole genome.
They will therefore not reflect the true e-values but they may act as approximations.
The miRdeep pipeline requires several distinct steps:
PREPROCESSING THE READS ALIGNING THE READS TO THE GENOME FILTERING UBIQUITOUS ALIGNMENTS FILTERING BY ANNOTATION PREDICTING MICRORNAS MAKING PERMUTED CONTROLS
You can enter the miRdeep pipeline at the step FILTERING UBIQUITOUS ALIGNMENTS
using the script: filter_alignments.pl
.
We tested this script for compatibility with miRdeep using one of our own arabidopsis datasets and predicted, amongst others, the example shown below.
A subsequent mirbase search revealed that this is Arabidopsis thaliana miR825
score_star 3.9 score_mfe 1.5 score_freq -3.3 score 2.2 flank_first_end 22 flank_first_seq CTAGTTGATTCCATCGACTCGT flank_first_struct ..(((((((...))))))).(( flank_second_beg 105 flank_second_seq AAGTTG flank_second_struct ...... freq 5 loop_beg 45 loop_end 83 loop_seq TTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGC loop_struct .(((((..(((((..(.....)..)))))..))))).)) mature_arm second mature_beg 84 mature_end 104 mature_query id_493716_x2 mature_seq TTCTCAAGAAGGTGCATGAAC mature_strand + mature_struct ))))).))..))))).))))) pre_seq TCAAGCACCAGCTCGAAGAAGCTTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGCTTCTCAAGAAGGTGCATGAAC pre_struct (((.(((((..((.((.(((((.(((((..(((((..(.....)..)))))..))))).))))))).))..))))).))))) pri_beg 1 pri_end 110 pri_id chr2_3461 pri_mfe -36.60 pri_seq CTAGTTGATTCCATCGACTCGTTCAAGCACCAGCTCGAAGAAGCTTAGCTAATTTATCTTAGAAAATAATGAAAAAGCTATGCTTCTCAAGAAGGTGCATGAACAAGTTG pri_struct ..(((((((...))))))).(((((.(((((..((.((.(((((.(((((..(((((..(.....)..)))))..))))).))))))).))..))))).)))))...... star_arm first star_beg 23 star_end 44 star_read 1 star_seq TCAAGCACCAGCTCGAAGAAGC star_struct (((.(((((..((.((.((((( id_2122663_x2 22 1..22 chr2_3461 110 23..44 2e-06 1.00 44.1 Plus / Plus id_493716_x2 21 1..21 chr2_3461 110 84..104 7e-06 1.00 42.1 Plus / Plus id_201319_x1 21 1..21 chr2_3461 110 85..105 7e-06 1.00 42.1 Plus / Plus
Perl 5.8 or higher
miRdeep
Robert Kofler
Heinz Himmelbauer
robert.kofler at crg.es