Hello!
we are using the exactTest in edgeR with TMM normalization to compute differential expression for smallRNA-Seq data. We have 3 experimental conditions with 3 replicate samples for each condition. The main focus of the study is to identify deferentially expressed micro RNAs.
The study generates around 94,300 loci with significant read counts. Of these, 373 mapped to known micro RNAs and a further 759 mapped to what we characterize as novel miRNAs.
We performed differential expression analysis separately in edgeR for the 373 known miRNAs, the 1132 (373+759) known and novel micro RNAs and the 94,300 loci which contains everything.
What concerns us is that the differential expression results (Fold change and p-value) for individual loci (miRNA) differ significantly depending on the filtered data set used. For instance (considering Sample-3 vs Sample-1), when only the 373 known miRNAs are used we get 227 down regulated and 140 up-regulated miRNAs (not considering significance p-value). However, none of the 373 loci of known miRNAs are up-regulated when they are analysed with the rest of the loci i.e (94,300 loci) considering only the fold change. All the 140 loci up-regulated in the miRNA only data set come as down-regulated in the large data set. Only one of the 373 loci was up-regulated when they were analysed with the novel microRNAs (the set of 1132 loci).
The reason for this is that the normalized counts differ drastically depending on the size of the data set. The question is whether it is alright to run edgeR with a selected (small) subset of data in a larger data set?
Here are the values for one locus as an example of how the normalized counts vary between the different data sets for the same locus:
Locus | chr18:34957617-34957681 | |||||
Normalized Counts "TMM" (CPM) | Raw Counts | |||||
1132 known and novel miRNA | 373 known miRNA | All 94,300 Loci | 1132 known and novel miRNA | 373 known miRNA | All 94,300 Loci | |
Sample1-Rep1 | 186.9079706 | 55.78897073 | 16.57274643 | 151 | 151 | 151 |
Sample1-Rep2 | 49.43554074 | 23.25649782 | 3.314549286 | 22 | 22 | 22 |
Sample1-Rep3 | 202.4490182 | 61.33157451 | 26.51639428 | 131 | 131 | 131 |
Sample2-Rep1 | 63.40622621 | 153.5625048 | 3.602770963 | 68 | 68 | 68 |
Sample2-Rep2 | 44.08028772 | 100.0256627 | 1.546789667 | 39 | 39 | 39 |
Sample2-Rep3 | 160.6770084 | 141.2183772 | 16.57274643 | 136 | 136 | 136 |
Sample3-Rep1 | 63.64396141 | 350.7079945 | 2.998877925 | 52 | 52 | 52 |
Sample3-Rep2 | 44.37166547 | 351.3239658 | 1.964177354 | 44 | 44 | 44 |
Sample3-Rep3 | 273.4929859 | 559.9950755 | 15.08119925 | 252 | 252 | 252 |
FC (Sample-2 vs Sample-1) | Pvalue (Sample-2 vs Sample-1) | FC (Sample-3 vs Sample-1) | Pvalue (Sample-3 vs Sample-1) | FC (Sample-3 vs Sample-2) | Pvalue (Sample-3 vs Sample-2) | |
1132 known and novel miRNA | -1.642028249 | 0.343610187 | -1.153540907 | 0.788568529 | 1.42357752 | 0.49624928 |
373 known miRNA | 2.780493259 | 0.003097653 | 8.930405915 | 2.26E-09 | 3.20944502 | 0.00083244 |
All 94,300 Loci | -2.170906688 | 0.221123891 | -2.305731197 | 0.18170132 | -1.062028778 | 0.91918073 |
Thank you very much in advance.
To be clear, you're re-running
calcNormFactors
after filtering?Yes. that is correct! After filtering we run:
y <- calcNormFactors(y,method="TMM")
y <- estimateCommonDisp(y)
y <- estimateTagwiseDisp(y)