Hi all, I'm working with my RNA-Seq raw count data right now. I'm in the process of determining differentially expressed genes among the file, and I used DESeq2 package for R to do that.
#run the DESeq pipeline
dds <- DESeq(dds)
#delete all the data that have total of 0 counts
dds<-dds[rowSums(counts(dds))>1,]
#get differential expression results
res<-results(dds, pAdjustMethod="BH")
If I use this code above, it gives me bunch of same numbers for the adjusted p-value, which I don't think is correct. And I think since this is a multiple comparison, I need to use adjusted p-values instead of just p-values. I'm trying to find the genes that have p-values<0.05.
My question is,
1) In this case, what cutoff value should I use instead of 0.05? Many articles used 0.1, but is there a specific reason to use that number?
2) As for the reason why the above code have me bunch of same numbers, is it because of the data size? I have more than 30,000 genes in the file. If this is so, then what is the alternative way to identify at differentially expressed genes?
Thanks a lot!
To follow up on Mike's answer, what proportion of false positives are you willing to tolerate? How damaging would a false positive be to your follow-up studies? If every follow-up study costs a fortune, then perhaps you would set a very low cut-off, because you don't want to waste time and money on a wild goose chase. On the other hand, if it's cheap to do follow-up work, then you could accept a higher FDR threshold because you don't mind a bit of wasted effort. Most people and studies use a threshold of 0.05 because being wrong 5% of the time seems to be fairly acceptable. Are you willing to be wrong 10% (or more) of the time? That's up to you to judge, as it depends on the situation - but don't try to pull a fast one and pretend that the set of DE genes detected at a FDR threshold of 10% is as reliable as that detected at a threshold of 5%.
Hi Mike, thanks for the reply!
I was simply asked to get a DEG for p-value < 0.05 or < 0.01. Without further explanation. Since I have so many genes (30,000), it makes sense that adjusted p-value would have repeated values. Is it not okay to use p-values? My understanding was that depending on how many genes we have, the adjusted p-values will be different (because it's "adjusted"). I understand that it is more reliable when we have multiple comparisons, but what if I reduce the p-value cutoff? So instead of using 0.05 as my p-value cutoff, could I use 0.01?
FYI, here is the data and my code: dat is RNA-seq raw count
>head(dat)
KO1 KO2 KO3 KO4 KO5 KO6 Control1 Control2
ENSG00000000003 770 520 650 364 1165 440 1116 1002
ENSG00000000005 23 17 9 15 30 18 31 16
ENSG00000000419 1086 473 800 425 1091 659 1187 1192
ENSG00000000457 401 231 374 156 583 312 457 485
ENSG00000000460 619 364 604 268 795 428 883 684
ENSG00000000938 0 1 0 2 0 3 0 0
>(condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))
>(coldata <- data.frame(row.names=colnames(dat), condition))
>dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)
>dds <- DESeq(dds)
>dds<-dds[rowSums(counts(dds))>1,]
>res<-results(dds, pAdjustMethod="fdr")
>res<-res[order(res$pvalue),]
>head(res)
log2 fold change (MAP): condition KO6 vs Control1
Wald test p-value: condition KO6 vs Control1
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000167244 27383.1058 0.7608001 0.2543051 2.991683 0.002774445 0.902056
ENSG00000165805 428.0581 0.8474591 0.2862597 2.960455 0.003071847 0.902056
ENSG00000111676 522.7859 0.8061529 0.2726130 2.957133 0.003105145 0.902056
ENSG00000008441 286.6628 0.8216742 0.2852824 2.880214 0.003974053 0.902056
ENSG00000117450 2178.8648 -0.7492058 0.2633030 -2.845413 0.004435389 0.902056
ENSG00000159216 1038.0489 0.8008571 0.2839419 2.820496 0.004794941 0.902056
I think I have all the information I need to determine DEG, but I just can't process further with those adjusted p-values..
Why are you testing only two samples: KO6 vs Control1?
log2 fold change (MAP): condition KO6 vs Control1
Wald test p-value: condition KO6 vs Control1
Remember, software can be smart, but it can't do total guesswork. You need to tell the software what the groups are and which you want to compare. Which groups of samples do you want to compare here?
I want to compare all samples, but when I run results() only KO6 and control1 are being compared. I'm not sure what makes them compare only those two. Do I need to add parameters when calling results()?
Take a look at our documentation. For example, in the vignette:
... we use condition in the design formula. This tells DESeq2 which samples belong to which group. And what does condition look like in colData(dds)? Have you told the software which samples belong to which group? How would you change your code to make colData(dds) look like the example in the documentation?
Hi Mike, I've actually used that documentation when I used DESeq2 and plotting PCA plot.
My condition is: (condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))
which are the samples and controls. When I put colData(dds), I get:
DataFrame with 8 rows and 2 columns
condition sizeFactor
<factor> <numeric>
KO1 KO1 1.1721342
KO2 KO2 0.6549836
KO3 KO3 1.0515301
KO4 KO4 0.5569159
KO5 KO5 1.5390775
KO6 KO6 0.8438359
Control1 Control1 1.2993950
Control2 Control2 1.3763345
I'm not really sure if this is something I am supposed to get.
To a human, KO1 looks like KO2, etc. and Control1 looks like Control2.
But to a computer KO1 != KO2 != ... != Control1 != Control2. You haven't told DESeq2 which samples belong to which group. You've specified that each sample is its own group. You should make your 'condition' variable look like how we have them in the vignette:
Then in this case would
(condition <- factor(c(rep("sample", 6), rep("control", 2))))
This work?
Yes exactly. In this way you have specified that the first six samples are in one group and the next two are in another group. You need to make sure this is saved to the dds object before running DESeq().