Hello,

For the Differential analysis I'm using "limma" R package. I'm using counts data. A matrix "U" with genes as rows and samples as columns. Initially I have done some filtering steps like following:

mycpm <- cpm(U)

thresh <- mycpm > 2

keep <- rowSums(thresh) >= 3

table(keep)

keep

FALSE TRUE

41720 18768

counts.keep <- U[keep,]

dim(counts.keep)

[1] 18768 366

y <- DGEList(counts.keep)

y <- calcNormFactors(y, method = "TMM")

design2 <- model.matrix(~type -1)

head(design2)

GP1n3 GP2

1 1 0

2 1 0

3 1 0

4 1 0

5 1 0

contrast.matrix <- makeContrasts(GP1n3-GP2, levels=design2)

contrast.matrix

Contrasts

Levels GP1n3 - GP2

GP1n3 1

GP2 -1

v <- voom(y, design2, plot=TRUE)

fit <- lmFit(v,design2)

fitC <- contrasts.fit(fit,contrast.matrix)

fitC <- eBayes(fitC,robust=TRUE,trend=TRUE)

tab <- topTable(fitC,adjust="BH",n=Inf)

summary(decideTests(fitC))

GP1n3 - GP2

-1 0

0 18768

1 0

There are no upregulated and downregulated genes between GP1n3 and GP2. But when I looked for significant differential expressed genes:

sig.deg2 = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg2)

logFC AveExpr t P.Value adj.P.Val B

ENSG00000092621.10 0.6884128 4.7057954 3.367474 0.0008388354 0.9999277 -1.308562

ENSG00000204610.11 0.6203649 2.5270859 3.050272 0.0024522414 0.9999277 -2.310364

ENSG00000143195.11 0.8738004 0.5791553 3.080063 0.0022253302 0.9999277 -2.828140

ENSG00000204613.9 0.5080240 0.5260141 2.756266 0.0061375019 0.9999277 -3.245375

ENSG00000254521.5 -0.6332099 -1.9619206 -3.072489 0.0022808742 0.9999277 -3.488171

ENSG00000100625.8 0.5970696 -1.3266248 2.713731 0.0069658381 0.9999277 -3.650128

I don't know for every comparison I get adj.P.Val same for all the genes and decideTests give "-1" and "1" as zeros. Am I doing wrong some where? Can anyone help me in this. Thank you !!

Dear Gordon,

Thanks for the reply. As you said I used only eBayes(trend=TRUE). I have done another comparison.

MB1 - MB2

-1 1

0 16904

1 1

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.01)

head(sig.deg)

logFC AveExpr t P.Value adj.P.Val B

PRR18 0.7536122 5.163854 3.396209 0.0007577785 0.3583424 -0.7996981

DACH2 -0.6016372 0.845996 -3.375142 0.0008163282 0.3583424 -0.8561809

HELLS -0.5986486 5.800163 -3.353323 0.0008813889 0.3583424 -0.9143339

OSBPL6 -0.6416259 5.887980 -3.226050 0.0013674819 0.4128330 -1.2465415

POLQ -0.5932946 5.515293 -3.212282 0.0014328446 0.4131078 -1.2817619

MYH1 0.6999817 1.053430 3.211477 0.0014367504 0.4131078 -1.2838148

Does this mean that this comparison shows only one upregulated and one downregulated genes? May I know one what cutoff this gives up and down regulated info.

And with adj.P.val is same for three genes and two genes. What does this mean?

I must admit that my statistical knowledge is pretty low. If adj.P.val less than 0.05 then those genes are differentially expressed?

There are a few things wrong here.

post hocfiltering on the log-fold change. This will invalidate the p-values. If you have a change threshold below which the DE is uninteresting, use it in`treat`

instead.`decideTests`

).Dear Aaron,

Thanks for the reply. I didn't understand it. Could you please tell me clearly. You mean I need to select siginificant differential genes like following?

sig.deg = subset(tab, abs(logFC)>0.5 & tab$adj.P.Val < 0.05)

What is post hoc filtering on log FC?

No. Don't filter in the log-fold change manually. Use

`treat`

.I'm really not aware about this. Could you please tell me how to give that?

Read the documentation in

`?treat`

.You mean instead of eBayes and topTable like mentioned above in the code, I should use treat?

`fitC <- treat(fitC,lfc=log2(1.1),trend=TRUE, robust=TRUE)`

top_treat <- topTreat(fitC,number=nrow(fitC),adjust.method="fdr")

`Do you think this is right?`

Yes, that's good.

OK, now you have 2 DE genes. The adjusted p-value represents false discovery rate FDR, and you have 2 genes that are significantly DE at FDR < 0.05. To see which genes these are, please use

`head(tab)`

. Don't use`subset`

-- there's just no point in that because topTable has already ranked the most important genes for you.Dear Gordon,

You mean 2 DE genes are the one showing "1" and "-1" with summary(decideTests(fitC)) ? May I know on what cutoff this is decided?

Ok. I'm not using subset now. So, as you said To see the two differential expressed genes I gave

head(tab)

logFC AveExpr t P.Value adj.P.Val B

GINS3 0.4986436 6.588254 4.708722 3.536275e-06 0.03412768 3.342665

C17orf103 -0.4577375 8.099347 -4.680004 4.037345e-06 0.03412768 3.239338

POLE3 0.2610686 10.143267 4.325623 1.961349e-05 0.10254673 2.009803

MNT -0.2703702 8.353955 -4.223649 3.033044e-05 0.10254673 1.671833

SUDS3 0.2435457 9.462002 4.167922 3.835098e-05 0.10254673 1.490172

NHSL1 0.4976037 9.729718 4.165704 3.870899e-05 0.10254673 1.482981

But what does this mean? From this which are differential expressed genes? Here for the first two genes adj.p.values are less than 0.05. could you please tell me clearly. my statistical knowledge is pretty low

For your first question, the documentation is your friend. If you read

`?decideTests`

, it will tell you that the threshold on the adjusted p-value is 0.05.For your second question, look at the all the rows with adjusted p-values below 0.05. This will give you a set of DE genes at a FDR of 5%. In this case, your set of DE genes would contain 2 genes,

GINS3andC17orf103.Thank you very much !! But I see some people selecting DEG's based on logFC and P.value like I mentioned above

sig.deg = subset(tab, abs(logFC)>0.5 & tab$P.Value < 0.05)

So, I'm confused with this. But Gordon said not to use subset.

Those people are misguided. Your

`subset`

approach is incorrect; for starters, there is no correction for multiple testing, but even if there was, filtering on a minimum log-fold change can inflate the actual false discovery rate. This is because the log-fold change statistic does not consider the variability of the data. In the worst case, your log-fold change filter will enrich for false positive genes that have high variances and thus are more likely to achieve large log-fold changes just by chance. This is why we recommend`treat`

.Dear Aaron,

In this post (C: Possible ways of performing differential gene expression and analysis of RNA-Seq) Gordon gave a code for Differential analysis with RSEM values. In that he used a filtering step keeping genes that have about 10 counts or more in atleast 14 samples. Which means there are 14 samples in the smallest "experimental group".

In my case:

table(targets$Sample.type)

MB1 MB2

286 80

So, on what sample number should I filter now?

Do I need to filter like this?

keep <- rowSums(y > log2(11)) >= 80

Your question is unrelated to the current post. Post a new question.

Yes I posted a new question (Filtering step in Differential analysis with RSEM values)