Question: adjusted p-values and p-values problems!
0
2.6 years ago by
fromhj30410
fromhj30410 wrote:

Hi all, I'm working with my RNA-Seq raw count data right now. I'm in the process of determining differentially expressed genes among the file, and I used DESeq2 package for R to do that.

#run the DESeq pipeline
dds <- DESeq(dds)

#delete all the data that have total of 0 counts
dds<-dds[rowSums(counts(dds))>1,]

#get differential expression results

If I use this code above, it gives me bunch of same numbers for the adjusted p-value, which I don't think is correct. And I think since this is a multiple comparison, I need to use adjusted p-values instead of just p-values. I'm trying to find the genes that have p-values<0.05.

My question is,

1) In this case, what cutoff value should I use instead of 0.05? Many articles used 0.1, but is there a specific reason to use that number?

2) As for the reason why the above code have me bunch of same numbers, is it because of the data size? I have more than 30,000 genes in the file. If this is so, then what is the alternative way to identify at differentially expressed genes?

Thanks a lot!

modified 2.6 years ago by Michael Love23k • written 2.6 years ago by fromhj30410
1
2.6 years ago by
Michael Love23k
United States
Michael Love23k wrote:

A: All same padj of contrast results in the DESeq2

You need to filter on adjusted p-values, not p-values, to obtain FDR control.

To answer your other questions, you can choose whatever FDR limit you want. 10% FDR is common because RNA-seq experiments are often exploratory and having 90% true positives in the gene set is ok, if that means we can find hundreds more true positives. Picking 10% for FDR is arbitrary though.

1

To follow up on Mike's answer, what proportion of false positives are you willing to tolerate? How damaging would a false positive be to your follow-up studies? If every follow-up study costs a fortune, then perhaps you would set a very low cut-off, because you don't want to waste time and money on a wild goose chase. On the other hand, if it's cheap to do follow-up work, then you could accept a higher FDR threshold because you don't mind a bit of wasted effort. Most people and studies use a threshold of 0.05 because being wrong 5% of the time seems to be fairly acceptable. Are you willing to be wrong 10% (or more) of the time? That's up to you to judge, as it depends on the situation - but don't try to pull a fast one and pretend that the set of DE genes detected at a FDR threshold of 10% is as reliable as that detected at a threshold of 5%.

Hi Mike, thanks for the reply!

I was simply asked to get a DEG for p-value < 0.05 or < 0.01. Without further explanation. Since I have so many genes (30,000), it makes sense that adjusted p-value would have repeated values. Is it not okay to use p-values? My understanding was that depending on how many genes we have, the adjusted p-values will be different (because it's "adjusted"). I understand that it is more reliable when we have multiple comparisons, but what if I reduce the p-value cutoff? So instead of using 0.05 as my p-value cutoff, could I use 0.01?

FYI, here is the data and my code: dat is RNA-seq raw count

KO1 KO2 KO3 KO4  KO5 KO6 Control1 Control2
ENSG00000000003  770 520 650 364 1165 440     1116     1002
ENSG00000000005   23  17   9  15   30  18       31       16
ENSG00000000419 1086 473 800 425 1091 659     1187     1192
ENSG00000000457  401 231 374 156  583 312      457      485
ENSG00000000460  619 364 604 268  795 428      883      684
ENSG00000000938    0   1   0   2    0   3        0        0

>(condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1))))

>(coldata <- data.frame(row.names=colnames(dat), condition))

>dds <- DESeqDataSetFromMatrix(countData=dat, colData=coldata, design=~condition)

>dds <- DESeq(dds)

>dds<-dds[rowSums(counts(dds))>1,]

>res<-res[order(res$pvalue),] >head(res) log2 fold change (MAP): condition KO6 vs Control1 Wald test p-value: condition KO6 vs Control1 DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000167244 27383.1058 0.7608001 0.2543051 2.991683 0.002774445 0.902056 ENSG00000165805 428.0581 0.8474591 0.2862597 2.960455 0.003071847 0.902056 ENSG00000111676 522.7859 0.8061529 0.2726130 2.957133 0.003105145 0.902056 ENSG00000008441 286.6628 0.8216742 0.2852824 2.880214 0.003974053 0.902056 ENSG00000117450 2178.8648 -0.7492058 0.2633030 -2.845413 0.004435389 0.902056 ENSG00000159216 1038.0489 0.8008571 0.2839419 2.820496 0.004794941 0.902056 I think I have all the information I need to determine DEG, but I just can't process further with those adjusted p-values.. ADD REPLYlink written 2.6 years ago by fromhj30410 Why are you testing only two samples: KO6 vs Control1? log2 fold change (MAP): condition KO6 vs Control1 Wald test p-value: condition KO6 vs Control1 Remember, software can be smart, but it can't do total guesswork. You need to tell the software what the groups are and which you want to compare. Which groups of samples do you want to compare here? ADD REPLYlink written 2.6 years ago by Michael Love23k I want to compare all samples, but when I run results() only KO6 and control1 are being compared. I'm not sure what makes them compare only those two. Do I need to add parameters when calling results()? ADD REPLYlink written 2.6 years ago by fromhj30410 Take a look at our documentation. For example, in the vignette: vignette("DESeq2") ... we use condition in the design formula. This tells DESeq2 which samples belong to which group. And what does condition look like in colData(dds)? Have you told the software which samples belong to which group? How would you change your code to make colData(dds) look like the example in the documentation? ADD REPLYlink written 2.6 years ago by Michael Love23k Hi Mike, I've actually used that documentation when I used DESeq2 and plotting PCA plot. My condition is: (condition <- factor(c(rep("KO1", 1), rep("KO2", 1), rep("KO3", 1), rep("KO4", 1), rep("KO5", 1), rep("KO6", 1), rep("Control1", 1), rep("Control2", 1)))) which are the samples and controls. When I put colData(dds), I get: DataFrame with 8 rows and 2 columns condition sizeFactor <factor> <numeric> KO1 KO1 1.1721342 KO2 KO2 0.6549836 KO3 KO3 1.0515301 KO4 KO4 0.5569159 KO5 KO5 1.5390775 KO6 KO6 0.8438359 Control1 Control1 1.2993950 Control2 Control2 1.3763345 I'm not really sure if this is something I am supposed to get. ADD REPLYlink written 2.6 years ago by fromhj30410 To a human, KO1 looks like KO2, etc. and Control1 looks like Control2. But to a computer KO1 != KO2 != ... != Control1 != Control2. You haven't told DESeq2 which samples belong to which group. You've specified that each sample is its own group. You should make your 'condition' variable look like how we have them in the vignette: head(colData) ... ## condition type ## treated1fb treated single-read ## treated2fb treated paired-end ## treated3fb treated paired-end ## untreated1fb untreated single-read ## untreated2fb untreated single-read ## untreated3fb untreated paired-end ADD REPLYlink written 2.6 years ago by Michael Love23k Then in this case would (condition <- factor(c(rep("sample", 6), rep("control", 2)))) This work? ADD REPLYlink written 2.6 years ago by fromhj30410 Yes exactly. In this way you have specified that the first six samples are in one group and the next two are in another group. You need to make sure this is saved to the dds object before running DESeq(). colData(dds)$condition <- condition