Question about decideTests function and adj.p.value in "limma" R Package
1
0
Entering edit mode
Biologist ▴ 120
@biologist-9801
Last seen 4.8 years ago

Hello,

For the Differential analysis I'm using "limma" R package. I'm using counts data. A matrix "U" with genes as rows and samples as columns. Initially I have done some filtering steps like following:

mycpm <- cpm(U)
thresh <- mycpm > 2
keep <- rowSums(thresh) >= 3

table(keep)

keep
FALSE  TRUE 
41720 18768 
counts.keep <- U[keep,]

dim(counts.keep)
[1] 18768   366

y <- DGEList(counts.keep)
y <- calcNormFactors(y, method = "TMM")

design2 <- model.matrix(~type -1)

head(design2)
  GP1n3 GP2
1     1   0
2     1   0
3     1   0
4     1   0
5     1   0

contrast.matrix <- makeContrasts(GP1n3-GP2, levels=design2)

contrast.matrix
       Contrasts
Levels  GP1n3 - GP2
  GP1n3           1
  GP2              -1

v <- voom(y, design2, plot=TRUE)

fit <- lmFit(v,design2)

fitC <- contrasts.fit(fit,contrast.matrix)
fitC <- eBayes(fitC,robust=TRUE,trend=TRUE)
tab <- topTable(fitC,adjust="BH",n=Inf)

summary(decideTests(fitC))

   GP1n3 - GP2
-1           0
0        18768
1            0

There are no upregulated and downregulated genes between GP1n3 and GP2. But when I looked for significant differential expressed genes:

sig.deg2 = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg2)
                                           logFC        AveExpr         t               P.Value        adj.P.Val         B
ENSG00000092621.10  0.6884128  4.7057954  3.367474 0.0008388354 0.9999277 -1.308562
ENSG00000204610.11  0.6203649  2.5270859  3.050272 0.0024522414 0.9999277 -2.310364
ENSG00000143195.11  0.8738004  0.5791553  3.080063 0.0022253302 0.9999277 -2.828140
ENSG00000204613.9   0.5080240  0.5260141  2.756266 0.0061375019 0.9999277 -3.245375
ENSG00000254521.5  -0.6332099 -1.9619206 -3.072489 0.0022808742 0.9999277 -3.488171
ENSG00000100625.8   0.5970696 -1.3266248  2.713731 0.0069658381 0.9999277 -3.650128

I don't know for every comparison I get adj.P.Val same for all the genes and decideTests give "-1" and "1" as zeros. Am I doing wrong some where? Can anyone help me in this. Thank you !!

limma r differential gene expression • 3.3k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 8 hours ago
WEHI, Melbourne, Australia

You should not use voom() together with eBayes(trend=TRUE). You should use one or the other, but not both (as explained in Chapter 15 of the limma User's Guide).

Apart from that, I don't see any clear problems. You just don't have any differential expression. There's no contradiction in the results.

 

ADD COMMENT
0
Entering edit mode

Dear Gordon,

Thanks for the reply. As you said I used only eBayes(trend=TRUE). I have done another comparison.

       MB1 - MB2
-1         1
0      16904
1          1

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg)
                     logFC  AveExpr         t             P.Value          adj.P.Val          B
PRR18   0.7536122 5.163854  3.396209 0.0007577785 0.3583424 -0.7996981
DACH2  -0.6016372 0.845996 -3.375142 0.0008163282 0.3583424 -0.8561809
HELLS  -0.5986486 5.800163 -3.353323 0.0008813889 0.3583424 -0.9143339
OSBPL6 -0.6416259 5.887980 -3.226050 0.0013674819 0.4128330 -1.2465415
POLQ   -0.5932946 5.515293 -3.212282 0.0014328446 0.4131078 -1.2817619
MYH1    0.6999817 1.053430  3.211477 0.0014367504 0.4131078 -1.2838148

Does this mean that this comparison shows only one upregulated and one downregulated genes? May I know one what cutoff this gives up and down regulated info. 

And with adj.P.val is same for three genes and two genes. What does this mean?

I must admit that my statistical knowledge is pretty low. If adj.P.val less than 0.05 then those genes are differentially expressed?

ADD REPLY
0
Entering edit mode

There are a few things wrong here.

  • Don't do post hoc filtering on the log-fold change. This will invalidate the p-values. If you have a  change threshold below which the DE is uninteresting, use it in treat instead.
  • Don't select genes based on the p-value, use the adjusted p-value to correct for multiple testing.
  • Yes, you only have 2 genes that are detected at a FDR of 5% (the default for decideTests).
  • The adjusted p-value can be the same for several genes due to enforced monotonicity in the BH correction, i.e., to ensure that lower p-values result in lower adjusted p-values. This is expected.
  • The idea is to define all genes with adjusted p-values below X% as being DE. Within that set of of DE genes, the expected proportion of false positives should be less than X%.
ADD REPLY
0
Entering edit mode

Dear Aaron,

Thanks for the reply. I didn't understand it. Could you please tell me clearly. You mean I need to select siginificant differential genes like following?

sig.deg = subset(tab, abs(logFC)>0.5 & tab$adj.P.Val < 0.05)

What is post hoc filtering on log FC?

ADD REPLY
0
Entering edit mode

No. Don't filter in the log-fold change manually. Use treat.

ADD REPLY
0
Entering edit mode

I'm really not aware about this. Could you please tell me how to give that? 

ADD REPLY
0
Entering edit mode

Read the documentation in ?treat.

ADD REPLY
0
Entering edit mode

You mean instead of eBayes and topTable like mentioned above in the code, I should use treat?

fitC <- treat(fitC,lfc=log2(1.1),trend=TRUE, robust=TRUE)
top_treat <- topTreat(fitC,number=nrow(fitC),adjust.method="fdr")

Do you think this is right?

 

ADD REPLY
0
Entering edit mode

Yes, that's good.

ADD REPLY
0
Entering edit mode

OK, now you have 2 DE genes. The adjusted p-value represents false discovery rate FDR, and you have 2 genes that are significantly DE at FDR < 0.05. To see which genes these are, please use head(tab). Don't use subset -- there's just no point in that because topTable has already ranked the most important genes for you.

ADD REPLY
0
Entering edit mode

Dear Gordon,

You mean 2 DE genes are the one showing "1" and "-1" with summary(decideTests(fitC)) ? May I know on what cutoff this is decided? 

Ok. I'm not using subset now. So, as you said To see the two differential expressed genes I gave

head(tab)

                       logFC   AveExpr         t              P.Value           adj.P.Val        B
GINS3      0.4986436  6.588254  4.708722 3.536275e-06 0.03412768 3.342665
C17orf103 -0.4577375  8.099347 -4.680004 4.037345e-06 0.03412768 3.239338
POLE3      0.2610686 10.143267  4.325623 1.961349e-05 0.10254673 2.009803
MNT       -0.2703702  8.353955 -4.223649 3.033044e-05 0.10254673 1.671833
SUDS3      0.2435457  9.462002  4.167922 3.835098e-05 0.10254673 1.490172
NHSL1      0.4976037  9.729718  4.165704 3.870899e-05 0.10254673 1.482981

But what does this mean? From this which are differential expressed genes? Here for the first two genes adj.p.values are less than 0.05. could you please tell me clearly. my statistical knowledge is pretty low

ADD REPLY
1
Entering edit mode

For your first question, the documentation is your friend. If you read ?decideTests, it will tell you that the threshold on the adjusted p-value is 0.05.

For your second question, look at the all the rows with adjusted p-values below 0.05. This will give you a set of DE genes at a FDR of 5%. In this case, your set of DE genes would contain 2 genes, GINS3 and C17orf103.

ADD REPLY
0
Entering edit mode

Thank you very much !! But I see some people selecting DEG's based on logFC and P.value like I mentioned above

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.05)

So, I'm confused with this. But Gordon said not to use subset.

ADD REPLY
2
Entering edit mode

Those people are misguided. Your subset approach is incorrect; for starters, there is no correction for multiple testing, but even if there was, filtering on a minimum log-fold change can inflate the actual false discovery rate. This is because the log-fold change statistic does not consider the variability of the data. In the worst case, your log-fold change filter will enrich for false positive genes that have high variances and thus are more likely to achieve large log-fold changes just by chance. This is why we recommend treat.

ADD REPLY
0
Entering edit mode

Dear Aaron,

In this post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq) Gordon gave a code for Differential analysis with RSEM values. In that he used a filtering step keeping genes that have about 10 counts or more in atleast 14 samples. Which means there are 14 samples in the smallest "experimental group".

In my case:

table(targets$Sample.type)

MB1   MB2 
286     80 

So, on what sample number should I filter now? 

Do I need to filter like this? 

keep <- rowSums(y > log2(11)) >= 80

ADD REPLY
0
Entering edit mode

Your question is unrelated to the current post. Post a new question.

ADD REPLY
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6