Question

Question about decideTests function and adj.p.value in "limma" R Package

0

Entering edit mode

Biologist ▴ 110

@biologist-9801

Last seen 4.1 years ago

Hello,

For the Differential analysis I'm using "limma" R package. I'm using counts data. A matrix "U" with genes as rows and samples as columns. Initially I have done some filtering steps like following:

mycpm <- cpm(U)
thresh <- mycpm > 2
keep <- rowSums(thresh) >= 3

table(keep)

keep
FALSE TRUE
41720 18768
counts.keep <- U[keep,]

dim(counts.keep)
[1] 18768 366

y <- DGEList(counts.keep)
y <- calcNormFactors(y, method = "TMM")

design2 <- model.matrix(~type -1)

head(design2)
GP1n3 GP2
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0

contrast.matrix <- makeContrasts(GP1n3-GP2, levels=design2)

contrast.matrix
Contrasts
Levels GP1n3 - GP2
GP1n3 1
GP2 -1

v <- voom(y, design2, plot=TRUE)

fit <- lmFit(v,design2)

fitC <- contrasts.fit(fit,contrast.matrix)
fitC <- eBayes(fitC,robust=TRUE,trend=TRUE)
tab <- topTable(fitC,adjust="BH",n=Inf)

summary(decideTests(fitC))

GP1n3 - GP2
-1 0
0 18768
1 0

There are no upregulated and downregulated genes between GP1n3 and GP2. But when I looked for significant differential expressed genes:

sig.deg2 = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg2)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000092621.10 0.6884128 4.7057954 3.367474 0.0008388354 0.9999277 -1.308562
ENSG00000204610.11 0.6203649 2.5270859 3.050272 0.0024522414 0.9999277 -2.310364
ENSG00000143195.11 0.8738004 0.5791553 3.080063 0.0022253302 0.9999277 -2.828140
ENSG00000204613.9 0.5080240 0.5260141 2.756266 0.0061375019 0.9999277 -3.245375
ENSG00000254521.5 -0.6332099 -1.9619206 -3.072489 0.0022808742 0.9999277 -3.488171
ENSG00000100625.8 0.5970696 -1.3266248 2.713731 0.0069658381 0.9999277 -3.650128

I don't know for every comparison I get adj.P.Val same for all the genes and decideTests give "-1" and "1" as zeros. Am I doing wrong some where? Can anyone help me in this. Thank you !!

limma r differential gene expression • 2.7k views

ADD COMMENT • link updated 6.8 years ago by Gordon Smyth 50k • written 6.8 years ago by Biologist ▴ 110

score 1 · Answer 1 · 2017-07-18

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 minutes ago

WEHI, Melbourne, Australia

You should not use voom() together with eBayes(trend=TRUE). You should use one or the other, but not both (as explained in Chapter 15 of the limma User's Guide).

Apart from that, I don't see any clear problems. You just don't have any differential expression. There's no contradiction in the results.

ADD COMMENT • link 6.8 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

Thanks for the reply. As you said I used only eBayes(trend=TRUE). I have done another comparison.

MB1 - MB2
-1 1
0 16904
1 1

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg)
logFC AveExpr t P.Value adj.P.Val B
PRR18 0.7536122 5.163854 3.396209 0.0007577785 0.3583424 -0.7996981
DACH2 -0.6016372 0.845996 -3.375142 0.0008163282 0.3583424 -0.8561809
HELLS -0.5986486 5.800163 -3.353323 0.0008813889 0.3583424 -0.9143339
OSBPL6 -0.6416259 5.887980 -3.226050 0.0013674819 0.4128330 -1.2465415
POLQ -0.5932946 5.515293 -3.212282 0.0014328446 0.4131078 -1.2817619
MYH1 0.6999817 1.053430 3.211477 0.0014367504 0.4131078 -1.2838148

Does this mean that this comparison shows only one upregulated and one downregulated genes? May I know one what cutoff this gives up and down regulated info.

And with adj.P.val is same for three genes and two genes. What does this mean?

I must admit that my statistical knowledge is pretty low. If adj.P.val less than 0.05 then those genes are differentially expressed?

ADD REPLY • link 6.8 years ago Biologist ▴ 110

0

Entering edit mode

There are a few things wrong here.

Don't do post hoc filtering on the log-fold change. This will invalidate the p-values. If you have a change threshold below which the DE is uninteresting, use it in treat instead.
Don't select genes based on the p-value, use the adjusted p-value to correct for multiple testing.
Yes, you only have 2 genes that are detected at a FDR of 5% (the default for decideTests).
The adjusted p-value can be the same for several genes due to enforced monotonicity in the BH correction, i.e., to ensure that lower p-values result in lower adjusted p-values. This is expected.
The idea is to define all genes with adjusted p-values below X% as being DE. Within that set of of DE genes, the expected proportion of false positives should be less than X%.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thanks for the reply. I didn't understand it. Could you please tell me clearly. You mean I need to select siginificant differential genes like following?

sig.deg = subset(tab, abs(logFC)>0.5 & tab$adj.P.Val < 0.05)

What is post hoc filtering on log FC?

ADD REPLY • link 6.8 years ago Biologist ▴ 110

0

Entering edit mode

No. Don't filter in the log-fold change manually. Use treat.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

I'm really not aware about this. Could you please tell me how to give that?

ADD REPLY • link 6.8 years ago Biologist ▴ 110

0

Entering edit mode

Read the documentation in ?treat.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

You mean instead of eBayes and topTable like mentioned above in the code, I should use treat?

fitC <- treat(fitC,lfc=log2(1.1),trend=TRUE, robust=TRUE) top_treat <- topTreat(fitC,number=nrow(fitC),adjust.method="fdr")

Do you think this is right?

ADD REPLY • link 6.8 years ago Biologist ▴ 110

0

Entering edit mode

Yes, that's good.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

OK, now you have 2 DE genes. The adjusted p-value represents false discovery rate FDR, and you have 2 genes that are significantly DE at FDR < 0.05. To see which genes these are, please use head(tab). Don't use subset -- there's just no point in that because topTable has already ranked the most important genes for you.

ADD REPLY • link 6.8 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Gordon,

You mean 2 DE genes are the one showing "1" and "-1" with summary(decideTests(fitC)) ? May I know on what cutoff this is decided?

Ok. I'm not using subset now. So, as you said To see the two differential expressed genes I gave

head(tab)

logFC AveExpr t P.Value adj.P.Val B
GINS3 0.4986436 6.588254 4.708722 3.536275e-06 0.03412768 3.342665
C17orf103 -0.4577375 8.099347 -4.680004 4.037345e-06 0.03412768 3.239338
POLE3 0.2610686 10.143267 4.325623 1.961349e-05 0.10254673 2.009803
MNT -0.2703702 8.353955 -4.223649 3.033044e-05 0.10254673 1.671833
SUDS3 0.2435457 9.462002 4.167922 3.835098e-05 0.10254673 1.490172
NHSL1 0.4976037 9.729718 4.165704 3.870899e-05 0.10254673 1.482981

But what does this mean? From this which are differential expressed genes? Here for the first two genes adj.p.values are less than 0.05. could you please tell me clearly. my statistical knowledge is pretty low

ADD REPLY • link 6.8 years ago Biologist ▴ 110

1

Entering edit mode

For your first question, the documentation is your friend. If you read ?decideTests, it will tell you that the threshold on the adjusted p-value is 0.05.

For your second question, look at the all the rows with adjusted p-values below 0.05. This will give you a set of DE genes at a FDR of 5%. In this case, your set of DE genes would contain 2 genes, GINS3 and C17orf103.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much !! But I see some people selecting DEG's based on logFC and P.value like I mentioned above

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.05)

So, I'm confused with this. But Gordon said not to use subset.

ADD REPLY • link 6.8 years ago Biologist ▴ 110

2

Entering edit mode

Those people are misguided. Your subset approach is incorrect; for starters, there is no correction for multiple testing, but even if there was, filtering on a minimum log-fold change can inflate the actual false discovery rate. This is because the log-fold change statistic does not consider the variability of the data. In the worst case, your log-fold change filter will enrich for false positive genes that have high variances and thus are more likely to achieve large log-fold changes just by chance. This is why we recommend treat.

ADD REPLY • link 6.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

In this post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq) Gordon gave a code for Differential analysis with RSEM values. In that he used a filtering step keeping genes that have about 10 counts or more in atleast 14 samples. Which means there are 14 samples in the smallest "experimental group".

In my case:

table(targets$Sample.type)

MB1 MB2
286 80

So, on what sample number should I filter now?

Do I need to filter like this?

keep <- rowSums(y > log2(11)) >= 80

ADD REPLY • link 6.7 years ago Biologist ▴ 110

0

Entering edit mode

Your question is unrelated to the current post. Post a new question.

ADD REPLY • link 6.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Yes I posted a new question (Filtering step in Differential analysis with RSEM values)

ADD REPLY • link 6.7 years ago Biologist ▴ 110