CalculateComplexityProfile.pl - Calculates the complexity for each nucleotide sequence for a multiple fasta file
# Minimal argument call specifying all required parameters. CalculateComplexityProfile.pl file1 # More advanced call. Several input files may be specified. CalculateComplexityProfile.pl file1 file2 file3
The only required parameter is a list of file names.
Display the help pages
The script calculates the complexity for each nucleotide sequence in a list of multiple fasta files. The complexity of a nucleotide sequence is calculated according to the equation
c=1-fA^2-fT^2-fC^2-fG^2 Wheras fA, fT, fC, fG is the frequency of the corresponding nucleotide in the read Examples: AAAAAAAAAAAAAAAA c = 0.00 # this is the lowest possible complexity GAAAAAAAAAAAAAAG c = 0.21 AGAGAGAGAGAGAGAG c = 0.50 ACGTACGTACGTACGT c = 0.75 # this is the highest possible complexity
This script may be useful to estimate the effect of the parameter --min_complexity
in the MIRO main script
run_Preprocessing.pl
Multiple fasta files. For example:
>cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop TACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAGTTTGGAATATTACC ACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTTCGA >cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop ATGCTTCCGGCCTGTTCCCTGAGACCTCAAGTGTGAGTGTACTATTGATG CTTCACACCTGGGCTCTCCGGGTACCAGGACGGTTTGAGCAGAT >cel-mir-1 MI0000003 Caenorhabditis elegans miR-1 stem-loop AAAGTGACCGTACCGAGCTGCATACTTCCTTACATGCCCATACTATATCA TAAATGGATATGGAATGTAAAGAAGTATGTAGAACGGGGTGGTAGT >cel-mir-2 MI0000004 Caenorhabditis elegans miR-2 stem-loop TAAACAGTATACAGAAAGCCATCAAAGCGGTGGTTGATGTGTTGCAAATT ATGACTTTCATATCACAGCCAGCTTTGATGTGCTGCCTGTTGCACTGT >cel-mir-34 MI0000005 Caenorhabditis elegans miR-34 stem-loop CGGACAATGCTCGAGAGGCAGTGTGGTTAGCTGGTTGCATATTTCCTTGA CAACGGCTACCTTCACTGCCACCCCGAACATGTCGTCCATCTTTGAA
A complexity profile. Following an example of the complexity profile calculated for mirbase.
Complexity profile for file: /home/usr/mirbase/mature.fa 0.75 482 0.74 1563 0.73 1446 0.72 865 0.71 1101 0.70 492 0.69 624 0.68 474 0.67 286 0.66 170 0.65 216 0.64 204 0.63 69 0.62 63 0.61 67 0.60 39 0.59 23 0.58 17 0.57 13 0.56 8 0.55 15 0.54 5 0.53 5 0.52 3 0.51 4 0.50 2 0.49 5 0.48 5 0.46 2 0.45 2 0.44 2 0.36 1
The first column is the complexity in steps of 0.01
according to the equation shown in General
.
The second column is the number of sequences having the given complexity.
Perl 5.8 or higher
Robert Kofler
Heinz Himmelbauer
robert.kofler at crg.es