NAME

CalculateComplexityProfile.pl - Calculates the complexity for each nucleotide sequence for a multiple fasta file

SYNOPSIS

 # Minimal argument call specifying all required parameters.
 CalculateComplexityProfile.pl file1
 
 # More advanced call. Several input files may be specified.
 CalculateComplexityProfile.pl file1 file2 file3

OPTIONS

[file1,file2,...]: The only required parameter is a list of file names.
--help: Display the help pages

DESCRIPTION

General

The script calculates the complexity for each nucleotide sequence in a list of multiple fasta files. The complexity of a nucleotide sequence is calculated according to the equation

 c=1-fA^2-fT^2-fC^2-fG^2
 Wheras fA, fT, fC, fG is the frequency of the corresponding nucleotide in the read
 
 Examples:
 AAAAAAAAAAAAAAAA c = 0.00  # this is the lowest possible complexity
 GAAAAAAAAAAAAAAG c = 0.21
 AGAGAGAGAGAGAGAG c = 0.50
 ACGTACGTACGTACGT c = 0.75 # this is the highest possible complexity

This script may be useful to estimate the effect of the parameter --min_complexity in the MIRO main script run_Preprocessing.pl

Input

Multiple fasta files. For example:

 >cel-let-7 MI0000001 Caenorhabditis elegans let-7 stem-loop
 TACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAGTTTGGAATATTACC
 ACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAGAACTCTTCGA
 >cel-lin-4 MI0000002 Caenorhabditis elegans lin-4 stem-loop
 ATGCTTCCGGCCTGTTCCCTGAGACCTCAAGTGTGAGTGTACTATTGATG
 CTTCACACCTGGGCTCTCCGGGTACCAGGACGGTTTGAGCAGAT
 >cel-mir-1 MI0000003 Caenorhabditis elegans miR-1 stem-loop
 AAAGTGACCGTACCGAGCTGCATACTTCCTTACATGCCCATACTATATCA
 TAAATGGATATGGAATGTAAAGAAGTATGTAGAACGGGGTGGTAGT
 >cel-mir-2 MI0000004 Caenorhabditis elegans miR-2 stem-loop
 TAAACAGTATACAGAAAGCCATCAAAGCGGTGGTTGATGTGTTGCAAATT
 ATGACTTTCATATCACAGCCAGCTTTGATGTGCTGCCTGTTGCACTGT
 >cel-mir-34 MI0000005 Caenorhabditis elegans miR-34 stem-loop
 CGGACAATGCTCGAGAGGCAGTGTGGTTAGCTGGTTGCATATTTCCTTGA
 CAACGGCTACCTTCACTGCCACCCCGAACATGTCGTCCATCTTTGAA

Output

A complexity profile. Following an example of the complexity profile calculated for mirbase.

 Complexity profile for file: /home/usr/mirbase/mature.fa
 0.75   482
 0.74   1563
 0.73   1446
 0.72   865
 0.71   1101
 0.70   492
 0.69   624
 0.68   474
 0.67   286
 0.66   170
 0.65   216
 0.64   204
 0.63   69
 0.62   63
 0.61   67
 0.60   39
 0.59   23
 0.58   17
 0.57   13
 0.56   8
 0.55   15
 0.54   5
 0.53   5
 0.52   3
 0.51   4
 0.50   2
 0.49   5
 0.48   5
 0.46   2
 0.45   2
 0.44   2
 0.36   1

The first column is the complexity in steps of 0.01 according to the equation shown in General. The second column is the number of sequences having the given complexity.

REQUIREMENTS

Perl 5.8 or higher

AUTHORS

Robert Kofler

Heinz Himmelbauer

CONTACT

robert.kofler at crg.es