Filtering tags in edgeR
2
0
Entering edit mode
Sam • 0
@sam-6849
Last seen 8.6 years ago

Hello!

we are using the exactTest in edgeR with TMM normalization to compute differential expression for smallRNA-Seq data. We have 3 experimental conditions with 3 replicate samples for each condition. The main focus of the study is to identify deferentially expressed micro RNAs. 

The study generates around 94,300 loci with significant read counts. Of these, 373 mapped to known micro RNAs and a further 759 mapped to what we characterize as novel miRNAs.

We performed differential expression analysis separately in edgeR  for the 373 known miRNAs, the 1132 (373+759) known and novel micro RNAs and the 94,300 loci which contains everything. 

 

What concerns us is that the differential expression results (Fold change and p-value) for individual loci (miRNA) differ significantly depending on the filtered  data set used. For instance (considering Sample-3 vs Sample-1), when only the 373 known miRNAs are used we get 227 down regulated and 140 up-regulated miRNAs (not considering significance p-value). However, none of the 373 loci of known miRNAs are up-regulated when they are analysed with the rest of the loci i.e (94,300 loci) considering only the fold change. All the 140 loci up-regulated in the miRNA only data set come as down-regulated in the large data set. Only one of the 373 loci was up-regulated when they were analysed with the novel microRNAs (the set of 1132 loci). 

The reason for this is that the normalized counts differ drastically depending on the size of the data set. The question is whether it is alright to run edgeR with a selected (small) subset of data in a larger data set?  

 Here are the values for one locus as an example of how the normalized counts vary between the different data sets for the same locus:

Locus  chr18:34957617-34957681          
  Normalized Counts "TMM" (CPM)     Raw Counts    
  1132 known and novel miRNA 373 known miRNA All 94,300 Loci 1132 known and novel miRNA 373 known miRNA All 94,300 Loci
Sample1-Rep1 186.9079706 55.78897073 16.57274643 151 151 151
Sample1-Rep2 49.43554074 23.25649782 3.314549286 22 22 22
Sample1-Rep3 202.4490182 61.33157451 26.51639428 131 131 131
Sample2-Rep1 63.40622621 153.5625048 3.602770963 68 68 68
Sample2-Rep2 44.08028772 100.0256627 1.546789667 39 39 39
Sample2-Rep3 160.6770084 141.2183772 16.57274643 136 136 136
Sample3-Rep1 63.64396141 350.7079945 2.998877925 52 52 52
Sample3-Rep2 44.37166547 351.3239658 1.964177354 44 44 44
Sample3-Rep3 273.4929859 559.9950755 15.08119925 252 252 252
             
             
  FC (Sample-2 vs Sample-1) Pvalue (Sample-2 vs Sample-1) FC (Sample-3 vs Sample-1) Pvalue (Sample-3 vs Sample-1) FC (Sample-3 vs Sample-2) Pvalue (Sample-3 vs Sample-2)
1132 known and novel miRNA -1.642028249 0.343610187 -1.153540907 0.788568529 1.42357752 0.49624928
373 known miRNA 2.780493259 0.003097653 8.930405915 2.26E-09 3.20944502 0.00083244
All 94,300 Loci -2.170906688 0.221123891 -2.305731197 0.18170132 -1.062028778 0.91918073
               

 

Thank you very much in advance.

edger • 1.0k views
ADD COMMENT
0
Entering edit mode

To be clear, you're re-running calcNormFactors after filtering?

ADD REPLY
0
Entering edit mode

Yes. that is correct! After filtering we run:

y <- calcNormFactors(y,method="TMM")

y <- estimateCommonDisp(y)

y <- estimateTagwiseDisp(y)

 

 

ADD REPLY
0
Entering edit mode

 

 

ADD REPLY
1
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 11 hours ago
The city by the bay

TMM normalization in calcNormFactors assumes that most genes are not DE. If you apply the function on the subset of known miRNAs, then you're assuming that most of the miRNAs are not DE. Normalization will subsequently remove any systematic difference between libraries across those miRNAs. This effectively centers the bulk of miRNAs to have a median log-fold change of zero, such that there willbe comparable numbers of miRNAs that are changing in each direction (i.e., positive and negative log-fold changes).

Of course, the above assumption may not be true. For example, consider a situation where all your miRNAs are downregulated in the comparison between two of your groups. Normalization on the filtered data would eliminate this difference between libraries, effectively removing the differential expression of interest. In contrast, if you normalize using all loci, the assumption is simply that most of those loci are not DE. This is generally a weaker assumption, and will preserve the systematic DE in the miRNAs if most of the other loci are non-DE. You can see this in your results whereby the vast majority of miRNAs are downregulated upon normalization with more features.

In general, filtering at the start of the analysis should be restricted to the removal of low-abundance genes that could interfere with the various statistical methods. More aggressive filtering should be delayed until the end of the analysis, e.g., during interpretation or just before the BH correction (to reduce its severity by reducing the number of tests). The exception is if you expect the known miRNAs to be subject to different biases from the other features, in which case you may need to filter before normalization.

ADD COMMENT
0
Entering edit mode
Sam • 0
@sam-6849
Last seen 8.6 years ago

Thank you Aaron for your reply.

I tend to agree with your reasoning that initial filtering should be restricted to just the low abundance loci. However, since we specifically looked for them, we found ~94,300 mapped loci in the smallRNA-Seq samples. Only around 2% of these loci fall into the category of micor RNA (known and novel). The rest consists of a lot of uncharacterized regions but with high read counts. Since the primary focus of a small RNA-Seq study is to characterize microRNAs, my general understanding is that people only concentrate on the miRNAs and ignore counts from other regions. Output from programs such as miRDeep2 also only gives read counts mapped to the micro RNAs. 

My intuition says that it is wrong to ignore the wast majority of reads (~98%) even though they are uncharacterized and it is what I understand from your answer!

In summary, is it correct to say that when using edgeR to analyze small RNA-Seq samples, you need to use the data set with all loci (characterized and uncharacterized)  with mapped reads (filtering out low-abundance loci) and doing any filtering at the end?

Thanks,

Sam 

ADD COMMENT
1
Entering edit mode

Well, yes, that is correct; but it's important to understand why. In your case, it's a specific problem to do with the assumptions of normalization. The fact that you're ignoring the vast majority of reads is less of an issue. For example, in ChIP-seq, most of the reads are not used as they lie in background regions of non-specific enrichment, and this doesn't really cause problems.

ADD REPLY

Login before adding your answer.

Traffic: 722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6