Question

Filtering tags in edgeR

0

Entering edit mode

Sam • 0

@sam-6849

Last seen 8.6 years ago

Hello!

we are using the exactTest in edgeR with TMM normalization to compute differential expression for smallRNA-Seq data. We have 3 experimental conditions with 3 replicate samples for each condition. The main focus of the study is to identify deferentially expressed micro RNAs.

The study generates around 94,300 loci with significant read counts. Of these, 373 mapped to known micro RNAs and a further 759 mapped to what we characterize as novel miRNAs.

We performed differential expression analysis separately in edgeR for the 373 known miRNAs, the 1132 (373+759) known and novel micro RNAs and the 94,300 loci which contains everything.

What concerns us is that the differential expression results (Fold change and p-value) for individual loci (miRNA) differ significantly depending on the filtered data set used. For instance (considering Sample-3 vs Sample-1), when only the 373 known miRNAs are used we get 227 down regulated and 140 up-regulated miRNAs (not considering significance p-value). However, none of the 373 loci of known miRNAs are up-regulated when they are analysed with the rest of the loci i.e (94,300 loci) considering only the fold change. All the 140 loci up-regulated in the miRNA only data set come as down-regulated in the large data set. Only one of the 373 loci was up-regulated when they were analysed with the novel microRNAs (the set of 1132 loci).

The reason for this is that the normalized counts differ drastically depending on the size of the data set. The question is whether it is alright to run edgeR with a selected (small) subset of data in a larger data set?

Here are the values for one locus as an example of how the normalized counts vary between the different data sets for the same locus:

Locus	chr18:34957617-34957681
	Normalized Counts "TMM" (CPM)			Raw Counts
	1132 known and novel miRNA	373 known miRNA	All 94,300 Loci	1132 known and novel miRNA	373 known miRNA	All 94,300 Loci
Sample1-Rep1	186.9079706	55.78897073	16.57274643	151	151	151
Sample1-Rep2	49.43554074	23.25649782	3.314549286	22	22	22
Sample1-Rep3	202.4490182	61.33157451	26.51639428	131	131	131
Sample2-Rep1	63.40622621	153.5625048	3.602770963	68	68	68
Sample2-Rep2	44.08028772	100.0256627	1.546789667	39	39	39
Sample2-Rep3	160.6770084	141.2183772	16.57274643	136	136	136
Sample3-Rep1	63.64396141	350.7079945	2.998877925	52	52	52
Sample3-Rep2	44.37166547	351.3239658	1.964177354	44	44	44
Sample3-Rep3	273.4929859	559.9950755	15.08119925	252	252	252


	FC (Sample-2 vs Sample-1)	Pvalue (Sample-2 vs Sample-1)	FC (Sample-3 vs Sample-1)	Pvalue (Sample-3 vs Sample-1)	FC (Sample-3 vs Sample-2)	Pvalue (Sample-3 vs Sample-2)
1132 known and novel miRNA	-1.642028249	0.343610187	-1.153540907	0.788568529	1.42357752	0.49624928
373 known miRNA	2.780493259	0.003097653	8.930405915	2.26E-09	3.20944502	0.00083244
All 94,300 Loci	-2.170906688	0.221123891	-2.305731197	0.18170132	-1.062028778	0.91918073

Thank you very much in advance.

edger • 1.0k views

ADD COMMENT • link 8.6 years ago Sam • 0

0

Entering edit mode

To be clear, you're re-running calcNormFactors after filtering?

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes. that is correct! After filtering we run:

y <- calcNormFactors(y,method="TMM")

y <- estimateCommonDisp(y)

y <- estimateTagwiseDisp(y)

ADD REPLY • link 8.6 years ago Sam • 0

0

Entering edit mode

ADD REPLY • link 8.6 years ago Sam • 0

score 1 · Answer 1 · 2015-09-25

TMM normalization in calcNormFactors assumes that most genes are not DE. If you apply the function on the subset of known miRNAs, then you're assuming that most of the miRNAs are not DE. Normalization will subsequently remove any systematic difference between libraries across those miRNAs. This effectively centers the bulk of miRNAs to have a median log-fold change of zero, such that there willbe comparable numbers of miRNAs that are changing in each direction (i.e., positive and negative log-fold changes).

Of course, the above assumption may not be true. For example, consider a situation where all your miRNAs are downregulated in the comparison between two of your groups. Normalization on the filtered data would eliminate this difference between libraries, effectively removing the differential expression of interest. In contrast, if you normalize using all loci, the assumption is simply that most of those loci are not DE. This is generally a weaker assumption, and will preserve the systematic DE in the miRNAs if most of the other loci are non-DE. You can see this in your results whereby the vast majority of miRNAs are downregulated upon normalization with more features.

In general, filtering at the start of the analysis should be restricted to the removal of low-abundance genes that could interfere with the various statistical methods. More aggressive filtering should be delayed until the end of the analysis, e.g., during interpretation or just before the BH correction (to reduce its severity by reducing the number of tests). The exception is if you expect the known miRNAs to be subject to different biases from the other features, in which case you may need to filter before normalization.

score 0 · Answer 2 · 2015-09-25

Thank you Aaron for your reply.

I tend to agree with your reasoning that initial filtering should be restricted to just the low abundance loci. However, since we specifically looked for them, we found ~94,300 mapped loci in the smallRNA-Seq samples. Only around 2% of these loci fall into the category of micor RNA (known and novel). The rest consists of a lot of uncharacterized regions but with high read counts. Since the primary focus of a small RNA-Seq study is to characterize microRNAs, my general understanding is that people only concentrate on the miRNAs and ignore counts from other regions. Output from programs such as miRDeep2 also only gives read counts mapped to the micro RNAs.

My intuition says that it is wrong to ignore the wast majority of reads (~98%) even though they are uncharacterized and it is what I understand from your answer!

In summary, is it correct to say that when using edgeR to analyze small RNA-Seq samples, you need to use the data set with all loci (characterized and uncharacterized) with mapped reads (filtering out low-abundance loci) and doing any filtering at the end?

Thanks,

Sam