Hello,
For the Differential analysis I'm using "limma" R package. I'm using counts data. A matrix "U" with genes as rows and samples as columns. Initially I have done some filtering steps like following:
mycpm <- cpm(U)
thresh <- mycpm > 2
keep <- rowSums(thresh) >= 3
table(keep)
keep
FALSE TRUE
41720 18768
counts.keep <- U[keep,]
dim(counts.keep)
[1] 18768 366
y <- DGEList(counts.keep)
y <- calcNormFactors(y, method = "TMM")
design2 <- model.matrix(~type -1)
head(design2)
GP1n3 GP2
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
contrast.matrix <- makeContrasts(GP1n3-GP2, levels=design2)
contrast.matrix
Contrasts
Levels GP1n3 - GP2
GP1n3 1
GP2 -1
v <- voom(y, design2, plot=TRUE)
fit <- lmFit(v,design2)
fitC <- contrasts.fit(fit,contrast.matrix)
fitC <- eBayes(fitC,robust=TRUE,trend=TRUE)
tab <- topTable(fitC,adjust="BH",n=Inf)
summary(decideTests(fitC))
GP1n3 - GP2
-1 0
0 18768
1 0
There are no upregulated and downregulated genes between GP1n3 and GP2. But when I looked for significant differential expressed genes:
sig.deg2 = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)
head(sig.deg2)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000092621.10 0.6884128 4.7057954 3.367474 0.0008388354 0.9999277 -1.308562
ENSG00000204610.11 0.6203649 2.5270859 3.050272 0.0024522414 0.9999277 -2.310364
ENSG00000143195.11 0.8738004 0.5791553 3.080063 0.0022253302 0.9999277 -2.828140
ENSG00000204613.9 0.5080240 0.5260141 2.756266 0.0061375019 0.9999277 -3.245375
ENSG00000254521.5 -0.6332099 -1.9619206 -3.072489 0.0022808742 0.9999277 -3.488171
ENSG00000100625.8 0.5970696 -1.3266248 2.713731 0.0069658381 0.9999277 -3.650128
I don't know for every comparison I get adj.P.Val same for all the genes and decideTests give "-1" and "1" as zeros. Am I doing wrong some where? Can anyone help me in this. Thank you !!
Dear Gordon,
Thanks for the reply. As you said I used only eBayes(trend=TRUE). I have done another comparison.
MB1 - MB2
-1 1
0 16904
1 1
sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)
head(sig.deg)
logFC AveExpr t P.Value adj.P.Val B
PRR18 0.7536122 5.163854 3.396209 0.0007577785 0.3583424 -0.7996981
DACH2 -0.6016372 0.845996 -3.375142 0.0008163282 0.3583424 -0.8561809
HELLS -0.5986486 5.800163 -3.353323 0.0008813889 0.3583424 -0.9143339
OSBPL6 -0.6416259 5.887980 -3.226050 0.0013674819 0.4128330 -1.2465415
POLQ -0.5932946 5.515293 -3.212282 0.0014328446 0.4131078 -1.2817619
MYH1 0.6999817 1.053430 3.211477 0.0014367504 0.4131078 -1.2838148
Does this mean that this comparison shows only one upregulated and one downregulated genes? May I know one what cutoff this gives up and down regulated info.
And with adj.P.val is same for three genes and two genes. What does this mean?
I must admit that my statistical knowledge is pretty low. If adj.P.val less than 0.05 then those genes are differentially expressed?
There are a few things wrong here.
treat
instead.decideTests
).Dear Aaron,
Thanks for the reply. I didn't understand it. Could you please tell me clearly. You mean I need to select siginificant differential genes like following?
sig.deg = subset(tab, abs(logFC)>0.5 & tab$adj.P.Val < 0.05)
What is post hoc filtering on log FC?
No. Don't filter in the log-fold change manually. Use
treat
.I'm really not aware about this. Could you please tell me how to give that?
Read the documentation in
?treat
.You mean instead of eBayes and topTable like mentioned above in the code, I should use treat?
fitC <- treat(fitC,lfc=log2(1.1),trend=TRUE, robust=TRUE)
top_treat <- topTreat(fitC,number=nrow(fitC),adjust.method="fdr")
Do you think this is right?
Yes, that's good.
OK, now you have 2 DE genes. The adjusted p-value represents false discovery rate FDR, and you have 2 genes that are significantly DE at FDR < 0.05. To see which genes these are, please use
head(tab)
. Don't usesubset
-- there's just no point in that because topTable has already ranked the most important genes for you.Dear Gordon,
You mean 2 DE genes are the one showing "1" and "-1" with summary(decideTests(fitC)) ? May I know on what cutoff this is decided?
Ok. I'm not using subset now. So, as you said To see the two differential expressed genes I gave
head(tab)
logFC AveExpr t P.Value adj.P.Val B
GINS3 0.4986436 6.588254 4.708722 3.536275e-06 0.03412768 3.342665
C17orf103 -0.4577375 8.099347 -4.680004 4.037345e-06 0.03412768 3.239338
POLE3 0.2610686 10.143267 4.325623 1.961349e-05 0.10254673 2.009803
MNT -0.2703702 8.353955 -4.223649 3.033044e-05 0.10254673 1.671833
SUDS3 0.2435457 9.462002 4.167922 3.835098e-05 0.10254673 1.490172
NHSL1 0.4976037 9.729718 4.165704 3.870899e-05 0.10254673 1.482981
But what does this mean? From this which are differential expressed genes? Here for the first two genes adj.p.values are less than 0.05. could you please tell me clearly. my statistical knowledge is pretty low
For your first question, the documentation is your friend. If you read
?decideTests
, it will tell you that the threshold on the adjusted p-value is 0.05.For your second question, look at the all the rows with adjusted p-values below 0.05. This will give you a set of DE genes at a FDR of 5%. In this case, your set of DE genes would contain 2 genes, GINS3 and C17orf103.
Thank you very much !! But I see some people selecting DEG's based on logFC and P.value like I mentioned above
sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.05)
So, I'm confused with this. But Gordon said not to use subset.
Those people are misguided. Your
subset
approach is incorrect; for starters, there is no correction for multiple testing, but even if there was, filtering on a minimum log-fold change can inflate the actual false discovery rate. This is because the log-fold change statistic does not consider the variability of the data. In the worst case, your log-fold change filter will enrich for false positive genes that have high variances and thus are more likely to achieve large log-fold changes just by chance. This is why we recommendtreat
.Dear Aaron,
In this post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq) Gordon gave a code for Differential analysis with RSEM values. In that he used a filtering step keeping genes that have about 10 counts or more in atleast 14 samples. Which means there are 14 samples in the smallest "experimental group".
In my case:
table(targets$Sample.type)
MB1 MB2
286 80
So, on what sample number should I filter now?
Do I need to filter like this?
keep <- rowSums(y > log2(11)) >= 80
Your question is unrelated to the current post. Post a new question.
Yes I posted a new question (Filtering step in Differential analysis with RSEM values)