10 multiple comparisons in DESeq2
2
0
Entering edit mode
kissmatee • 0
@kissmatee-10895
Last seen 5.9 years ago

Dear All,

I have an RNA seq experiment with 5 different cell types (=my 5 conditions are the 5 cell types), and I would like to identify those genes that show differential expression between at least 2 cell types (so basically like performing an ANOVA). I am a medical doctor and new to R with very basic knowledge in statistics so I would like to be sure that my code is correct from a statistical perspective.

Between 5 cell types there are 10 possible comparisons, so my idea is to run these 10 comparisons, combine the gene lists from the comparisons, and then delete the duplicates to get the list of genes that show statistical difference between at least 2 cell types. Using this gene list, I can create an expression profile containing the expression values of these genes in the different cell types that I can use for clustering the genes based on their expression profile in the cell types etc. I will also have the adjusted p-values and fold changes for every comparison in separate tables. Here is an example:

dds <- DESeqDataSetFromMatrix(countData = countData,

    colData = colData,     design = ~ condition)

colData(dds)

dds$condition <- relevel(dds$condition, "cell_type_A")

Question 1: Is "relevel" required here? I do not have a reference or control cell type/condition. What difference does it make if I do not perform "relevel"?

Then the multiple comparisons:

dds <- DESeq(dds)

res <- results(dds, contrast=list("conditioncell_type_A","conditioncell_type_B")) resOrdered <- res[order(res$padj),] resSigTmp <- subset(resOrdered, padj < 0.01) resSig <- subset(resSigTmp, log2FoldChange < -1 | log2FoldChange > 1) DEtable_1<-as.data.frame(resSig[order(resSig$log2FoldChange,decreasing=TRUE),])

res <- results(dds, contrast=list("conditioncell_type_B","conditioncell_type_C")) resOrdered <- res[order(res$padj),] resSigTmp <- subset(resOrdered, padj < 0.01) resSig <- subset(resSigTmp, log2FoldChange < -1 | log2FoldChange > 1) DEtable_2<-as.data.frame(resSig[order(resSig$log2FoldChange,decreasing=TRUE),])

...and so on, until creating DEtable_10.

Collecting rownames:

allDEgenes <- c(rownames(DEtable_1), rownames(DEtable_2), ..... , rownames(DEtable_10))'

Removing duplicates:

allDEgenes_unique<-unique(allDEgenes)'

Question 2: Is this method statistically correct? Will my adjusted p-values be correct? Do they require further correction considering the 10 comparisons I perform? (I am sorry, these must be silly questions.)

Question 3: Is there a better way of performing this analysis using DEseq2 or some other package?

Question 4: My "coldata" table must have 2 variables, but I only have one: cell type. Since I will only use this variable in my contrast argument, I can put anything as a second variable just to have one, it does not really matter, right?

Best,

Mate

deseq2 rnaseq statistics • 4.1k views
0
Entering edit mode
@ben-mansfeld-10812
Last seen 5.9 years ago
Michigan State University

Q1: I think Relevel-ing is good so that you are at least certain what your base level is. Since you are using specific contrast that might not be an issue. If you were to just ask for results(dds), then making sure you knew what your base level was would be important. I may be wrong about this though.

Q2: the pvals should be correct for that specific contrast.

Q3: maybe some sort of k-means clustering?

Q4: why does coldata have to have 2 variables?

from DESeq help:

colData
for matrix input: a DataFrame or data.frame with at least a single column. Rows of colData correspond to columns of countData

Hope this helps a little....

-Ben

0
Entering edit mode
@mikelove
Last seen 15 hours ago
United States

You can perform a likelihood ratio test (as is done in ANOVA). See the DESeq2 vignette section on "likelihood ratio tests" and the section in ?results about likelihood ratio tests as well.

dds <- DESeq(dds, test="LRT", reduced=~1)
res <- results(dds)

Other Qs:

No you don't have to relevel. This is only if you plan to use results() with no arguments.

colData does not have to have 2 variables

0
Entering edit mode

Hi Michael,

Thank you for taking time to answer my questions.

Does that mean that the method described above is not correct and LRT should be used in this case? Will combining 10 comparisons yield too many false positives?

Thanks,

Mate

0
Entering edit mode
Yes. It's opportunistic you could say. You're not correcting for doing all these extra comparisons. And the LRT gives you the answer - - any differences among the five - - using a single test which is a cleaner approach.