Dear All,
I have an RNA seq experiment with 5 different cell types (=my 5 conditions are the 5 cell types), and I would like to identify those genes that show differential expression between at least 2 cell types (so basically like performing an ANOVA). I am a medical doctor and new to R with very basic knowledge in statistics so I would like to be sure that my code is correct from a statistical perspective.
Between 5 cell types there are 10 possible comparisons, so my idea is to run these 10 comparisons, combine the gene lists from the comparisons, and then delete the duplicates to get the list of genes that show statistical difference between at least 2 cell types. Using this gene list, I can create an expression profile containing the expression values of these genes in the different cell types that I can use for clustering the genes based on their expression profile in the cell types etc. I will also have the adjusted p-values and fold changes for every comparison in separate tables. Here is an example:
dds <- DESeqDataSetFromMatrix(countData = countData,
colData = colData,
design = ~ condition)
colData(dds)
dds$condition <- relevel(dds$condition, "cell_type_A")
Question 1: Is "relevel" required here? I do not have a reference or control cell type/condition. What difference does it make if I do not perform "relevel"?
Then the multiple comparisons:
dds <- DESeq(dds)
res <- results(dds, contrast=list("conditioncell_type_A","conditioncell_type_B"))
resOrdered <- res[order(res$padj),]
resSigTmp <- subset(resOrdered, padj < 0.01)
resSig <- subset(resSigTmp, log2FoldChange < -1 | log2FoldChange > 1)
DEtable_1<-as.data.frame(resSig[order(resSig$log2FoldChange,decreasing=TRUE),])
res <- results(dds, contrast=list("conditioncell_type_B","conditioncell_type_C"))
resOrdered <- res[order(res$padj),]
resSigTmp <- subset(resOrdered, padj < 0.01)
resSig <- subset(resSigTmp, log2FoldChange < -1 | log2FoldChange > 1)
DEtable_2<-as.data.frame(resSig[order(resSig$log2FoldChange,decreasing=TRUE),])
...and so on, until creating DEtable_10.
Collecting rownames:
allDEgenes <- c(rownames(DEtable_1), rownames(DEtable_2), ..... , rownames(DEtable_10))'
Removing duplicates:
allDEgenes_unique<-unique(allDEgenes)'
Question 2: Is this method statistically correct? Will my adjusted p-values be correct? Do they require further correction considering the 10 comparisons I perform? (I am sorry, these must be silly questions.)
Question 3: Is there a better way of performing this analysis using DEseq2 or some other package?
Question 4: My "coldata" table must have 2 variables, but I only have one: cell type. Since I will only use this variable in my contrast argument, I can put anything as a second variable just to have one, it does not really matter, right?
Thank you for your help!
Best,
Mate
Hi Michael,
Thank you for taking time to answer my questions.
Does that mean that the method described above is not correct and LRT should be used in this case? Will combining 10 comparisons yield too many false positives?
Thanks,
Mate