run_DGE.pl - Create a differential gene (miRNA) expression profile
# Create a differential gene expression profile (DGE) for the mapping results # contained in the folder map_results. Default settings will be used run_DGE.pl /home/usr/map_results # Create a DGE for the folder map_results, specifying all parameters. run_DGE.pl --bayes_table --bayes_pairwise --idorder "day0 day6 day12 day24" --min_length 20 --max_length 25 --max_mm 1 --strand RF --normalisation "off scalelinear quantile housekeep5 housekeep20" /home/usr/map_results
Sometimes it is important that the samples are sorted in a particular way, for example to display the time-course of miRNA expression during Drosophila development. MIRO thus allows to specify the order of the samples. Each sample is identified by its ID. default=undef (random). Examples:
--idorder "day0 day3 day6 day12 day21" --idorder "background protein1 protein2 protein3"
MIRO allows to specify a flexible number of normalisation methods which will be used to create the differential gene (miRNA) expression profiles.
default="off scalelinear quantil"
. At the moment the following normalisation methods are supported:
samples are not normalised, the actual observed read counts will be displayed
The total expression levels will be linearly scaled to constant level. The individual read counts will be adjusted accordingly. This is the most straight-forward normalisation method
The quantile-normalisation method
Examples: housekeep5
, housekeep10
, housekeep20
;
This normalisation methods is a derivate of the scalelinear method. Instead of using all genes (miRNAs) for calculating the normalisation factor, only the genes having a medium expression levels will be used. Therefore the genes having the highest and the lowest expression levels will be ignored (of course only for calculating the normalisation factor, not for normalisation itself). The housekeep normalisation has to be called with the exact percentage of genes to be skipped. E.g.: housekeep20 ignores the 20% highest and the 20% lowest expressed genes. The genes (miRNAs) are weighted by the log2 of the expression level.
Only reads mapping to the specified strand will be used for creating a DGE-profile. Possible values: R (reverse strand), F (forward strand), RF (both strands); default=RF
The minimum length of reads. Shorter reads will not enter the DGE-profile. default=15
The maximum length of reads. Shorter reads will not enter the DGE-profile. default=100
The maximum number of mismatches. Reads having more mismatches will not enter the DGE-profile. default=2
Flag; Indicating whether a bayesian significance table should be created. Note that this table will only be created for normalisation "off". default=not set;
Flag; Indicating whether pairwise significant changes in gene (miRNA) expression should be calculated. Note that this table will only be created for normalisation "off". A minimum significance of 0.001 is used; default=not set;
May be used to specify a file to which the log of run_DGE.pl
will be written. default=no file
Should the log be written to the console; 0..no 1..yes; default=1
Display the help pages.
run_DGE.pl
is the heart of the MIRO-pipeline. It is a powerful script which creates differential gene (miRNA) expression (DGE) profiles,
using several different normalisation methods at the same time. Only the unambiguously mapped reads will be used for creating a DGE-profile.
The results for each applied normalisation method will be written into different subfolders. The script furthermore calculate this DGE
for each reference sequence to which the reads have been mapped.
For example if the three different normalisations are used (e.g.: off, scalelinear, quantile) and the reads have been mapped to four
reference sequences (e.g.: mature-miRNA, hairpin, reference-mRNA, whole-genome) than the script creates 12 subfolders where each contains
a DGE-profile.
A DGE-profile (a subfolder) contains a heatmap, several scatterplots, a lineplot, several tables which contain expression levels and tables containing significant differentially expressed genes (miRNAs);
run_DGE.pl
can be regarded as a wrapper for several scripts which can be used individually, for example to create only the heatmap.
We therefore also refer to the documentation of these scripts:
Input for run_DGE.pl
has to be a folder which contains the mapping results for at least two different samples (usually corresponding to Solexa-lanes).
Example for the minimum content of a folder:
Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
In this example the two samples "day0" and "day2" have been mapped to mature miRNA (mat-miRNA) using the mapper Gem.
In this example the user could user could specify an alternative sorting with --idorder "day2 day0"
.
For this example run_DGE.pl
will create three subfolders in the directory (assuming the default normalisations).
A more advanced example:
Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day0_2_pnm_Gem_against_hairpin_ambiguous.txt Mapping_day0_2_pnm_Gem_against_hairpin_matches.fa Mapping_day0_2_pnm_Gem_against_hairpin_nomatches.fa Mapping_day0_2_pnm_Gem_against_hairpin_statistics.txt Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt Mapping_day2_2_pnm_Gem_against_hairpin_ambiguous.txt Mapping_day2_2_pnm_Gem_against_hairpin_matches.fa Mapping_day2_2_pnm_Gem_against_hairpin_nomatches.fa Mapping_day2_2_pnm_Gem_against_hairpin_statistics.txt Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt
In this example the two samples "day0" and "day2" have been mapped to mature miRNA and the no-matches subsequently to the hairpins.
For this example run_DGE.pl
will create 6 subfolderes in the directory. run_DGE.pl
will create three subfolders (assuming default normalisations) for each reference sequence.
As mentioned in Input
run_DGE.pl
will create a subfolder for each normalisation method and for each reference sequence to which the reads have been mapped.
For example:
Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
the following subfolders will be created:
mat-miRNA_off_RF_15-36_mm2 mat-miRNA_quantil_RF_15-36_mm2 mat-miRNA_scalelinear_RF_15-36_mm2
As can be seen from this example the subfolder name contains information about the used settings for creating the DGE-profile. This allows to create several DGE-Profiles using alternaive settings without overriding the previous DGE-profiles. The components of the subfolder name are:
The ID of the reference sequence; In the example above mature miRNA (mat-miRNA).
The normalisation method;
The strand which was used to create the DGE-profile In the example above both strands have been used (RF);
The length range of the reads; In the example above only reads larger than 14 and smaller than 37 (15-36) were used for creating the DGE-profile.
The number of maximum allowed mismatches with the reference sequence. In the example above only reads with two or less mismatches (2mm) were used for creating the DGE-profile.
As mentioned in the previous section the run_DGE.pl
script will generate a subfolder for each normalisation methode and reference sequence.
This subfolders contain a lot of files illustrating differences in gene (miRNA) expression.
This file contains the normalised expression levels in counts. This values are based on the reads unambiguously mapped to a certain reference sequence. Only in the folder with normalisation "off" this values represent the number of reads actually observed mapping to a certain reference gene.
Example:
day0 day3 day6 day12 day18 day24 mmu-miR-720 MIMAT0003484 Mus musculus miR-720 284.0 268.0 170.0 13.0 186.0 204.0 mmu-miR-30c MIMAT0000514 Mus musculus miR-30c 1.0 537.0 5.0 - 16.0 541.0 mmu-miR-19b MIMAT0000513 Mus musculus miR-19b 2.0 560.0 1.0 1.0 27.0 465.0 mmu-miR-99b MIMAT0000132 Mus musculus miR-99b 992.0 9.0 21.0 6.0 3.0 9.0
This file is very similar to the previous file. It contains the normalised expression levels in log2 of the counts. This values are based on the reads unambiguously mapped to a certain reference sequence. For example:
day0 day3 day6 day12 day18 day24 mmu-miR-148b MIMAT0000580 Mus musculus miR-148b 4.5 11.8 5.6 3.3 8.4 13.0 mmu-miR-7a MIMAT0000677 Mus musculus miR-7a 6.0 12.9 5.9 5.0 7.8 12.2 mmu-miR-363 MIMAT0000708 Mus musculus miR-363 5.6 13.2 5.2 4.8 6.7 11.8 mmu-miR-221 MIMAT0000669 Mus musculus miR-221 4.8 11.1 5.9 4.7 8.4 13.0 mmu-miR-181b MIMAT0000673 Mus musculus miR-181b 9.1 12.8 7.8 6.7 8.4 11.8
This file contain a heatmap to visualise and cluster similarily expressed genes and samples and the legend for the heatmap. An uniqe number is assigned to each gene (miRNA) in the heatmap.pdf which can be resolved using heatmap_legend.txt.
A lineplot depicting the time-course and development of gene expression levels for the 20 highest expressed miRNAs.
For this graph the --idorder
is an important feature, as for example the time course "day0 day12 day24 day3 day6", which represents the
default sorting, is only of limited biological value.
This is actually a subfolder which contains n-1
scatterplots where n
is the number of samples (eg. Solexa-lanes).
The gene expression levels of each sample will be compared to the first sample using the active --idorder
.
This is another instance where --idorder
is an important feature as it might, for example,
be useful to compare the expression levels of several samples with a single background-sample.
This file will only be created if the --bayes_pairwise
flag is set.
The file can only be found in the subfolder for the normalisation level "off".
The method used for calculating the bayesian significance was published by Audic and Claverie (1997):
"The Significance of Digital Gene Expression Profiles"
mmu-miR-709 MIMAT0003499 Mus musculus miR-709 5.83612643834978e-08 day0 day12 2.0 58.0 mmu-miR-425 MIMAT0004750 Mus musculus miR-425 6.48859011835964e-08 day24 day36 62.0 156.0 mmu-miR-375 MIMAT0000739 Mus musculus miR-375 7.50493843738679e-08 day0 day3 6.0 40.0 mmu-miR-484 MIMAT0003127 Mus musculus miR-484 7.68847709460979e-08 day3 day24 67.0 1.0
The first column represents the reference sequence for which significant differences in gene (miRNA) expression have been observed
The second column represents the bayesian significance of the observed differences in expression levels. The whole file is sorted by significance.
The third column represents the ID of the first sample
The fourth column represents the ID of the second sample
The fifth column represents the read counts for the first sample. (not normalised)
The sixth column represents the read counts of the second sample. (not normalised)
It is important to note the the significance is not only calculated from the individual read counts, also the total read counts enter the equation. For further details we refer to the paper of Audic and Claverie.
This file will only be created if the --bayes_table
flag is set.
The file can only be found in the subfolder for the normalisation level "off".
The method used for calculating the bayesian significance was published by Audic and Claverie (1997):
"The Significance of Digital Gene Expression Profiles"
day0 day3 day6 day12 day24 day36 mmu-miR-99b MIMAT0000132 Mus musculus miR-99b - 1.2e-305 7.4e-231 3.1e-123 0 3.8e-250 mmu-miR-139-3p MIMAT0004662 Mus musculus miR-139-3p - 0 6.0e-311 0 0 2.0e-154 mmu-miR-378 MIMAT0003151 Mus musculus miR-378 - 0 0 3.6e-305 0 0
This is a table which contains the pairwise significance between gene (miRNA) expression levels.
Each significance value refers to differences with the first column which should usually be a background level or the first time point.
This is another example where the feature --idorder
is important.
Using a concrete example the value 7.4e-231
in the fourth column and second row means that the differences in expression level
between day6 and day0 is highly significant: 7.4e-231. The table is sorted descencding by overal significance.
This section contains the workflow of the script run_DGE.pl
.
Since this script is very powerful and creates a lot of directories and files
it might be interesting to the users how the script actually proceeds.
If you are not interested in the intrinsic mechanism of the script feel free to skip the section.
Considering the following content of the directory "mydir":
Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day0_2_pnm_Gem_against_hairpin_ambiguous.txt Mapping_day0_2_pnm_Gem_against_hairpin_matches.fa Mapping_day0_2_pnm_Gem_against_hairpin_nomatches.fa Mapping_day0_2_pnm_Gem_against_hairpin_statistics.txt Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt Mapping_day2_2_pnm_Gem_against_hairpin_ambiguous.txt Mapping_day2_2_pnm_Gem_against_hairpin_matches.fa Mapping_day2_2_pnm_Gem_against_hairpin_nomatches.fa Mapping_day2_2_pnm_Gem_against_hairpin_statistics.txt Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt ppday2 ppday0 ppday3 phone_numbers.txt subfolder1 subfolder2 etc
If the user starts run_DGE.pl mydir
the script first extracts all sequences which start with the name "Mapping*".
Than the reference sequence counter is set to "1". The script extracts all unambiguous hit file which have a reference sequence counter of 1. For example, only the following two files will be kept for further processing.
Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt
The script, than, loads the content of this two files into a hash-hash which contains the unambiguous read counts for each gene (miRNA) and sample. In this example the hash will contain many hundred rows (one for each mature-miRNA) and two columns (one for each sample).
Subsequently the script iterates over each requested normalisation method. The hash-hash is passed to each normalisation methods and the following files are created with each normalisation:
heatmap.pdf heatmap_legend.txt scaterplot (folder and files) expr_table_count.txt expr_table_log2.txt lineplot.pdf (significant_dge_table.txt) (significant_dge_pairwise)
After succesfully iterating each normalisation method, the reference sequence counter is set to "2" and the next charge is processed:
Mapping_day0_2_pnm_Gem_against_hairpin_unambiguous.txt Mapping_day2_2_pnm_Gem_against_hairpin_unambiguous.txt
and so on.....
until no further files can be found having the number of the reference sequence counter. In our example the script would stop after setting the reference sequence counter to "3" as no files exist which have this number.
Independent of the console log --logconsole
and the file-log --logfile
the log of preprocessing is always sent to the MIRO logger.
The following will not work:
Mapping_day0_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day0_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day0_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_ambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_matches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_nomatches.fa Mapping_day2_1_i_Gem_against_mat-miRNA_statistics.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day0_1_i_Gem_against_hairpin_ambiguous.txt Mapping_day0_1_i_Gem_against_hairpin_matches.fa Mapping_day0_1_i_Gem_against_hairpin_nomatches.fa Mapping_day0_1_i_Gem_against_hairpin_statistics.txt Mapping_day0_1_i_Gem_against_hairpin_unambiguous.txt Mapping_day2_1_i_Gem_against_hairpin_ambiguous.txt Mapping_day2_1_i_Gem_against_hairpin_matches.fa Mapping_day2_1_i_Gem_against_hairpin_nomatches.fa Mapping_day2_1_i_Gem_against_hairpin_statistics.txt Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt
as explained in the previous section the script retrieves all unambiguous hit files which have the same reference sequence counter. In this example the following files would be retrieved in the first step:
Mapping_day0_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day2_1_i_Gem_against_mat-miRNA_unambiguous.txt Mapping_day0_1_i_Gem_against_hairpin_unambiguous.txt Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt
The script would therefore assume that four different samples exist instead of the correct number two.
If you are using run_Mapping.pl
the reference sequence counter is automatically increased and this problem should not occur.
This problem may arise if you start run_Mapping.pl
several times for the same sample while using the same output directory.
To prevent this you may either choose different output directories or increment the counter of the file manually, thus for example
renaming Mapping_day2_1_i_Gem_against_hairpin_unambiguous.txt to Mapping_day2_2_i_Gem_against_hairpin_unambiguous.txt
Perl 5.8 or higher
R 2.7.0 or higher
R-library geneplotter (for heatmap)
R-library gplots (for heatmap)
Robert Kofler
Manuela Hummel
Lauro Sumoy
Heinz Himmelbauer
robert.kofler at crg.es