adjusted p-values and p-values problems!
Entering edit mode
fromhj304 ▴ 10
Last seen 5.0 years ago

Hi all, I'm working with my RNA-Seq raw count data right now. I'm in the process of determining differentially expressed genes among the file, and I used DESeq2 package for R to do that.

#run the DESeq pipeline
dds <- DESeq(dds)

#delete all the data that have total of 0 counts

#get differential expression results
res<-results(dds, pAdjustMethod="BH")

If I use this code above, it gives me bunch of same numbers for the adjusted p-value, which I don't think is correct. And I think since this is a multiple comparison, I need to use adjusted p-values instead of just p-values. I'm trying to find the genes that have p-values<0.05. 

My question is,

1) In this case, what cutoff value should I use instead of 0.05? Many articles used 0.1, but is there a specific reason to use that number?

2) As for the reason why the above code have me bunch of same numbers, is it because of the data size? I have more than 30,000 genes in the file. If this is so, then what is the alternative way to identify at differentially expressed genes?


Thanks a lot!


pvalue adjusted pvalue r bonferroni rna-seq • 5.0k views
Entering edit mode
Last seen 5 hours ago
United States

See this post about having repeated values for adjusted p-value:

A: All same padj of contrast results in the DESeq2

You need to filter on adjusted p-values, not p-values, to obtain FDR control.

To answer your other questions, you can choose whatever FDR limit you want. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok, if that means we can find hundreds more true positives. Picking 10% for FDR is arbitrary though.

Entering edit mode

To follow up on Mike's answer, what proportion of false positives are you willing to tolerate? How damaging would a false positive be to your follow-up studies? If every follow-up study costs a fortune, then perhaps you would set a very low cut-off, because you don't want to waste time and money on a wild goose chase. On the other hand, if it's cheap to do follow-up work, then you could accept a higher FDR threshold because you don't mind a bit of wasted effort. Most people and studies use a threshold of 0.05 because being wrong 5% of the time seems to be fairly acceptable. Are you willing to be wrong 10% (or more) of the time? That's up to you to judge, as it depends on the situation - but don't try to pull a fast one and pretend that the set of DE genes detected at a FDR threshold of 10% is as reliable as that detected at a threshold of 5%.

Entering edit mode

Hi Mike, thanks for the reply!

I was simply asked to get a DEG for p-value < 0.05 or < 0.01. Without further explanation. Since I have so many genes (30,000), it makes sense that adjusted p-value would have repeated values. Is it not okay to use p-values? My understanding was that depending on how many genes we have, the adjusted p-values will be different (because it's "adjusted"). I understand that it is more reliable when we have multiple comparisons, but what if I reduce the p-value cutoff? So instead of using 0.05 as my p-value cutoff, could I use 0.01?

FYI, here is the data and my code: dat is RNA-seq raw count


                                KO1 KO2 KO3 KO4  KO5 KO6 Control1 Control2
ENSG00000000003  770 520 650 364 1165 440     1116     1002
ENSG00000000005   23  17   9  15   30  18       31       16
ENSG00000000419 1086 473 800 425 1091 659     1187     1192
ENSG00000000457  401 231 374 156  583 312      457      485
ENSG00000000460  619 364 604 268  795 428      883      684
ENSG00000000938    0   1   0   2    0   3        0        0

>(condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))

>(coldata <- data.frame(row.names=colnames(dat), condition))

>dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)

>dds <- DESeq(dds)


>res<-results(dds, pAdjustMethod="fdr")



log2 fold change (MAP): condition KO6 vs Control1
Wald test p-value: condition KO6 vs Control1
DataFrame with 6 rows and 6 columns
                  baseMean log2FoldChange     lfcSE      stat      pvalue      padj
                 <numeric>      <numeric> <numeric> <numeric>   <numeric> <numeric>
ENSG00000167244 27383.1058      0.7608001 0.2543051  2.991683 0.002774445  0.902056
ENSG00000165805   428.0581      0.8474591 0.2862597  2.960455 0.003071847  0.902056
ENSG00000111676   522.7859      0.8061529 0.2726130  2.957133 0.003105145  0.902056
ENSG00000008441   286.6628      0.8216742 0.2852824  2.880214 0.003974053  0.902056
ENSG00000117450  2178.8648     -0.7492058 0.2633030 -2.845413 0.004435389  0.902056
ENSG00000159216  1038.0489      0.8008571 0.2839419  2.820496 0.004794941  0.902056


I think I have all the information I need to determine DEG, but I just can't process further with those adjusted p-values..

Entering edit mode

Why are you testing only two samples: KO6 vs Control1?

log2 fold change (MAP): condition KO6 vs Control1
Wald test p-value: condition KO6 vs Control1

Remember, software can be smart, but it can't do total guesswork. You need to tell the software what the groups are and which you want to compare. Which groups of samples do you want to compare here?

Entering edit mode

I want to compare all samples, but when I run results() only KO6 and control1 are being compared. I'm not sure what makes them compare only those two. Do I need to add parameters when calling results()?

Entering edit mode

Take a look at our documentation. For example, in the vignette:


... we use condition in the design formula. This tells DESeq2 which samples belong to which group. And what does condition look like in colData(dds)? Have you told the software which samples belong to which group? How would you change your code to make colData(dds) look like the example in the documentation?

Entering edit mode

Hi Mike, I've actually used that documentation when I used DESeq2 and plotting PCA plot.

My condition is: (condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))

which are the samples and controls. When I put colData(dds), I get:

DataFrame with 8 rows and 2 columns
         condition sizeFactor
          <factor>  <numeric>
KO1            KO1  1.1721342
KO2            KO2  0.6549836
KO3            KO3  1.0515301
KO4            KO4  0.5569159
KO5            KO5  1.5390775
KO6            KO6  0.8438359
Control1  Control1  1.2993950
Control2  Control2  1.3763345

I'm not really sure if this is something I am supposed to get.

Entering edit mode

To a human, KO1 looks like KO2, etc. and Control1 looks like Control2.

But to a computer KO1 != KO2 != ... != Control1 != Control2. You haven't told DESeq2 which samples belong to which group. You've specified that each sample is its own group. You should make your 'condition' variable look like how we have them in the vignette:

##              condition type
## treated1fb   treated   single-read
## treated2fb   treated   paired-end
## treated3fb   treated   paired-end
## untreated1fb untreated single-read
## untreated2fb untreated single-read
## untreated3fb untreated paired-end
Entering edit mode

Then in this case would

(condition <- factor(c(rep("sample", 6), rep("control", 2))))

This work?

Entering edit mode

Yes exactly. In this way you have specified that the first six samples are in one group and the next two are in another group. You need to make sure this is saved to the dds object before running DESeq().

colData(dds)$condition <- condition

Login before adding your answer.

Traffic: 257 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6