NAME

run_DGE.pl - Create a differential gene (miRNA) expression profile


SYNOPSIS

 # Create a differential gene expression profile (DGE) for the mapping results
 # contained in the folder map_results. Default settings will be used
 run_DGE.pl /home/usr/map_results
 
 # Create a DGE for the folder map_results, specifying all parameters.
 run_DGE.pl --bayes_table --bayes_pairwise --idorder "day0 day6 day12 day24"
            --min_length 20 --max_length 25 --max_mm 1 --strand RF
            --normalisation "off scalelinear quantile housekeep5 housekeep20"
            /home/usr/map_results


OPTIONS

--idorder

Sometimes it is important that the samples are sorted in a particular way, for example to display the time-course of miRNA expression during Drosophila development. MIRO thus allows to specify the order of the samples. Each sample is identified by its ID. default=undef (random). Examples:


 --idorder "day0 day3 day6 day12 day21"
 --idorder "background protein1 protein2 protein3"
--normalisation

MIRO allows to specify a flexible number of normalisation methods which will be used to create the differential gene (miRNA) expression profiles. default="off scalelinear quantil". At the moment the following normalisation methods are supported:

off

samples are not normalised, the actual observed read counts will be displayed

scalelinear

The total expression levels will be linearly scaled to constant level. The individual read counts will be adjusted accordingly. This is the most straight-forward normalisation method

quantile

The quantile-normalisation method

housekeep..

Examples: housekeep5, housekeep10, housekeep20;

This normalisation methods is a derivate of the scalelinear method. Instead of using all genes (miRNAs) for calculating the normalisation factor, only the genes having a medium expression levels will be used. Therefore the genes having the highest and the lowest expression levels will be ignored (of course only for calculating the normalisation factor, not for normalisation itself). The housekeep normalisation has to be called with the exact percentage of genes to be skipped. E.g.: housekeep20 ignores the 20% highest and the 20% lowest expressed genes. The genes (miRNAs) are weighted by the log2 of the expression level.

--strand

Only reads mapping to the specified strand will be used for creating a DGE-profile. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF

--min_length

The minimum length of reads. Shorter reads will not enter the DGE-profile. default=15

--max_length

The maximum length of reads. Shorter reads will not enter the DGE-profile. default=100

--max_mm

The maximum number of mismatches. Reads having more mismatches will not enter the DGE-profile. default=2

--bayes_table

Flag; Indicating whether a bayesian significance table should be created. Note that this table will only be created for normalisation "off". default=not set;

--bayes_pairwise

Flag; Indicating whether pairwise significant changes in gene (miRNA) expression should be calculated. Note that this table will only be created for normalisation "off". A minimum significance of 0.001 is used; default=not set;

--logfile

May be used to specify a file to which the log of run_DGE.pl will be written. default=no file

--logconsole

Should the log be written to the console; 0..no 1..yes; default=1

--help

Display the help pages.


DESCRIPTION

General

run_DGE.pl is the heart of the MIRO-pipeline. It is a powerful script which creates differential gene (miRNA) expression (DGE) profiles, using several different normalisation methods at the same time. Only the unambiguously mapped reads will be used for creating a DGE-profile. The results for each applied normalisation method will be written into different subfolders. The script furthermore calculate this DGE for each reference sequence to which the reads have been mapped. For example if the three different normalisations are used (e.g.: off, scalelinear, quantile) and the reads have been mapped to four reference sequences (e.g.: mature-miRNA, hairpin, reference-mRNA, whole-genome) than the script creates 12 subfolders where each contains a DGE-profile. A DGE-profile (a subfolder) contains a heatmap, several scatterplots, a lineplot, several tables which contain expression levels and tables containing significant differentially expressed genes (miRNAs); run_DGE.pl can be regarded as a wrapper for several scripts which can be used individually, for example to create only the heatmap. We therefore also refer to the documentation of these scripts:

DGE_BayesSign.pl

DGE_Heatmap.pl

DGE_Lineplots.pl

DGE_Scatterplots.pl

DGE_Table.pl

Input

Input for run_DGE.pl has to be a folder which contains the mapping results for at least two different samples (usually corresponding to Solexa-lanes). Example for the minimum content of a folder:

 Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt
 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt   
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt

In this example the two samples "day0" and "day2" have been mapped to mature miRNA (mat-miRNA) using the mapper Gem. In this example the user could user could specify an alternative sorting with --idorder "day2 day0". For this example run_DGE.pl will create three subfolders in the directory (assuming the default normalisations).

A more advanced example:

 Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt
 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt   
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day0_2_pnm_Gem_against_hairpin_ambiguous.txt    
 Mapping_day0_2_pnm_Gem_against_hairpin_matches.fa
 Mapping_day0_2_pnm_Gem_against_hairpin_nomatches.fa
 Mapping_day0_2_pnm_Gem_against_hairpin_statistics.txt
 Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt
 Mapping_day2_2_pnm_Gem_against_hairpin_ambiguous.txt
 Mapping_day2_2_pnm_Gem_against_hairpin_matches.fa
 Mapping_day2_2_pnm_Gem_against_hairpin_nomatches.fa
 Mapping_day2_2_pnm_Gem_against_hairpin_statistics.txt   
 Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt

In this example the two samples "day0" and "day2" have been mapped to mature miRNA and the no-matches subsequently to the hairpins. For this example run_DGE.pl will create 6 subfolderes in the directory. run_DGE.pl will create three subfolders (assuming default normalisations) for each reference sequence.

Output

As mentioned in Input run_DGE.pl will create a subfolder for each normalisation method and for each reference sequence to which the reads have been mapped.

For example:

 Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt
 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt   
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt

the following subfolders will be created:

 mat-miRNA_off_RF_15-36_mm2
 mat-miRNA_quantil_RF_15-36_mm2
 mat-miRNA_scalelinear_RF_15-36_mm2

As can be seen from this example the subfolder name contains information about the used settings for creating the DGE-profile. This allows to create several DGE-Profiles using alternaive settings without overriding the previous DGE-profiles. The components of the subfolder name are:

  1. The ID of the reference sequence; In the example above mature miRNA (mat-miRNA).

  2. The normalisation method;

  3. The strand which was used to create the DGE-profile In the example above both strands have been used (RF);

  4. The length range of the reads; In the example above only reads larger than 14 and smaller than 37 (15-36) were used for creating the DGE-profile.

  5. The number of maximum allowed mismatches with the reference sequence. In the example above only reads with two or less mismatches (2mm) were used for creating the DGE-profile.

DGE-profile

As mentioned in the previous section the run_DGE.pl script will generate a subfolder for each normalisation methode and reference sequence. This subfolders contain a lot of files illustrating differences in gene (miRNA) expression.

expr_table_count.txt

This file contains the normalised expression levels in counts. This values are based on the reads unambiguously mapped to a certain reference sequence. Only in the folder with normalisation "off" this values represent the number of reads actually observed mapping to a certain reference gene.

Example:

                                                day0    day3    day6    day12   day18   day24    
 mmu-miR-720 MIMAT0003484 Mus musculus miR-720  284.0   268.0   170.0   13.0    186.0   204.0
 mmu-miR-30c MIMAT0000514 Mus musculus miR-30c  1.0     537.0   5.0     -       16.0    541.0
 mmu-miR-19b MIMAT0000513 Mus musculus miR-19b  2.0     560.0   1.0     1.0     27.0    465.0
 mmu-miR-99b MIMAT0000132 Mus musculus miR-99b  992.0   9.0     21.0    6.0     3.0     9.0
expr_table_log2.txt

This file is very similar to the previous file. It contains the normalised expression levels in log2 of the counts. This values are based on the reads unambiguously mapped to a certain reference sequence. For example:

                                                        day0    day3    day6    day12   day18   day24  
 mmu-miR-148b MIMAT0000580 Mus musculus miR-148b        4.5     11.8    5.6     3.3     8.4     13.0
 mmu-miR-7a MIMAT0000677 Mus musculus miR-7a            6.0     12.9    5.9     5.0     7.8     12.2
 mmu-miR-363 MIMAT0000708 Mus musculus miR-363          5.6     13.2    5.2     4.8     6.7     11.8
 mmu-miR-221 MIMAT0000669 Mus musculus miR-221          4.8     11.1    5.9     4.7     8.4     13.0
 mmu-miR-181b MIMAT0000673 Mus musculus miR-181b        9.1     12.8    7.8     6.7     8.4     11.8
heatmap.pdf and heatmap_legend.txt

This file contain a heatmap to visualise and cluster similarily expressed genes and samples and the legend for the heatmap. An uniqe number is assigned to each gene (miRNA) in the heatmap.pdf which can be resolved using heatmap_legend.txt.

lineplot.pdf

A lineplot depicting the time-course and development of gene expression levels for the 20 highest expressed miRNAs. For this graph the --idorder is an important feature, as for example the time course "day0 day12 day24 day3 day6", which represents the default sorting, is only of limited biological value.

scatterplot

This is actually a subfolder which contains n-1 scatterplots where n is the number of samples (eg. Solexa-lanes). The gene expression levels of each sample will be compared to the first sample using the active --idorder. This is another instance where --idorder is an important feature as it might, for example, be useful to compare the expression levels of several samples with a single background-sample.

significant_dge_pairwise.txt

This file will only be created if the --bayes_pairwise flag is set. The file can only be found in the subfolder for the normalisation level "off". The method used for calculating the bayesian significance was published by Audic and Claverie (1997): "The Significance of Digital Gene Expression Profiles"


 mmu-miR-709 MIMAT0003499 Mus musculus miR-709  5.83612643834978e-08    day0    day12   2.0     58.0
 mmu-miR-425 MIMAT0004750 Mus musculus miR-425  6.48859011835964e-08    day24   day36   62.0    156.0
 mmu-miR-375 MIMAT0000739 Mus musculus miR-375  7.50493843738679e-08    day0    day3    6.0     40.0
 mmu-miR-484 MIMAT0003127 Mus musculus miR-484  7.68847709460979e-08    day3    day24   67.0    1.0
  1. The first column represents the reference sequence for which significant differences in gene (miRNA) expression have been observed

  2. The second column represents the bayesian significance of the observed differences in expression levels. The whole file is sorted by significance.

  3. The third column represents the ID of the first sample

  4. The fourth column represents the ID of the second sample

  5. The fifth column represents the read counts for the first sample. (not normalised)

  6. The sixth column represents the read counts of the second sample. (not normalised)

It is important to note the the significance is not only calculated from the individual read counts, also the total read counts enter the equation. For further details we refer to the paper of Audic and Claverie.

significant_dge_table.txt

This file will only be created if the --bayes_table flag is set. The file can only be found in the subfolder for the normalisation level "off". The method used for calculating the bayesian significance was published by Audic and Claverie (1997): "The Significance of Digital Gene Expression Profiles"


                                                         day0   day3        day6        day12       day24      day36
  mmu-miR-99b MIMAT0000132 Mus musculus miR-99b         -       1.2e-305    7.4e-231    3.1e-123    0          3.8e-250
  mmu-miR-139-3p MIMAT0004662 Mus musculus miR-139-3p   -       0           6.0e-311    0           0          2.0e-154
  mmu-miR-378 MIMAT0003151 Mus musculus miR-378         -       0           0           3.6e-305    0          0

This is a table which contains the pairwise significance between gene (miRNA) expression levels. Each significance value refers to differences with the first column which should usually be a background level or the first time point. This is another example where the feature --idorder is important. Using a concrete example the value 7.4e-231 in the fourth column and second row means that the differences in expression level between day6 and day0 is highly significant: 7.4e-231. The table is sorted descencding by overal significance.

Workflow

This section contains the workflow of the script run_DGE.pl. Since this script is very powerful and creates a lot of directories and files it might be interesting to the users how the script actually proceeds. If you are not interested in the intrinsic mechanism of the script feel free to skip the section.

Considering the following content of the directory "mydir":

 Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt
 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt   
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day0_2_pnm_Gem_against_hairpin_ambiguous.txt    
 Mapping_day0_2_pnm_Gem_against_hairpin_matches.fa
 Mapping_day0_2_pnm_Gem_against_hairpin_nomatches.fa
 Mapping_day0_2_pnm_Gem_against_hairpin_statistics.txt
 Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt
 Mapping_day2_2_pnm_Gem_against_hairpin_ambiguous.txt
 Mapping_day2_2_pnm_Gem_against_hairpin_matches.fa
 Mapping_day2_2_pnm_Gem_against_hairpin_nomatches.fa
 Mapping_day2_2_pnm_Gem_against_hairpin_statistics.txt   
 Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt
 ppday2
 ppday0
 ppday3
 phone_numbers.txt
 subfolder1
 subfolder2
 etc

If the user starts run_DGE.pl mydir the script first extracts all sequences which start with the name "Mapping*".

Than the reference sequence counter is set to "1". The script extracts all unambiguous hit file which have a reference sequence counter of 1. For example, only the following two files will be kept for further processing.

 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt 
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt

The script, than, loads the content of this two files into a hash-hash which contains the unambiguous read counts for each gene (miRNA) and sample. In this example the hash will contain many hundred rows (one for each mature-miRNA) and two columns (one for each sample).

Subsequently the script iterates over each requested normalisation method. The hash-hash is passed to each normalisation methods and the following files are created with each normalisation:


 heatmap.pdf
 heatmap_legend.txt
 scaterplot (folder and files)
 expr_table_count.txt
 expr_table_log2.txt
 lineplot.pdf
 (significant_dge_table.txt)
 (significant_dge_pairwise)

After succesfully iterating each normalisation method, the reference sequence counter is set to "2" and the next charge is processed:


 Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt
 Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt

and so on.....

until no further files can be found having the number of the reference sequence counter. In our example the script would stop after setting the reference sequence counter to "3" as no files exist which have this number.

Logging

Independent of the console log --logconsole and the file-log --logfile the log of preprocessing is always sent to the MIRO logger.

See also the MIRO-logger

Troubleshooting

The following will not work:

 Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt
 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt
 Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa
 Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt   
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day0_1_i_Gem_against_hairpin_ambiguous.txt    
 Mapping_day0_1_i_Gem_against_hairpin_matches.fa
 Mapping_day0_1_i_Gem_against_hairpin_nomatches.fa
 Mapping_day0_1_i_Gem_against_hairpin_statistics.txt
 Mapping_day0_1_i_Gem_against_hairpin_unambiguous.txt
 Mapping_day2_1_i_Gem_against_hairpin_ambiguous.txt
 Mapping_day2_1_i_Gem_against_hairpin_matches.fa
 Mapping_day2_1_i_Gem_against_hairpin_nomatches.fa
 Mapping_day2_1_i_Gem_against_hairpin_statistics.txt   
 Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt

as explained in the previous section the script retrieves all unambiguous hit files which have the same reference sequence counter. In this example the following files would be retrieved in the first step:

 Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt  
 Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
 Mapping_day0_1_i_Gem_against_hairpin_unambiguous.txt  
 Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt

The script would therefore assume that four different samples exist instead of the correct number two.

If you are using run_Mapping.pl the reference sequence counter is automatically increased and this problem should not occur.

This problem may arise if you start run_Mapping.pl several times for the same sample while using the same output directory. To prevent this you may either choose different output directories or increment the counter of the file manually, thus for example renaming Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt to Mapping_day2_2_i_Gem_against_hairpin_unambiguous.txt


REQUIREMENTS

Perl 5.8 or higher

R 2.7.0 or higher

R-library geneplotter (for heatmap)

R-library gplots (for heatmap)


AUTHORS

Robert Kofler

Manuela Hummel

Lauro Sumoy

Heinz Himmelbauer


CONTACT

robert.kofler at crg.es