Hi, I'm using DESeq2 to analyse bulk RNAseq data (new at this, so apologies if this is a naive question).
I have 3 genotypes and 2 conditions (neurons/sorted, or all cells), so 6 groups in total. Per genotype, I have 3 biological replicates, so in total there are 18 samples (9 per condition).
I am interested in comparing in 2 ways: (1) different genotypes within one condition (see what's differentially expressed across genotypes, per condition) (2) same genotypes across different conditions (check to see whether neuronal genes are generally enriched in the "neurons" condition, more of a quality check step)
I used HTSeq to get my counts; and my code looks like this
dir <- "D:/alignments/htseq_counts"
sampleFiles <- grep("_counts",list.files(dir),value=TRUE)
condition <- c('all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell', 'sorted', 'sorted', 'sorted ,'sorted' ,'sorted' ,'sorted' ,'sorted' ,'sorted', 'sorted')
genotype <- c('gf', 'gf', 'gf', 'lf', 'lf', 'lf','WT','WT','WT')
group <- c('gf_all','gf_all','gf_all','lf_all','lf_all','lf_all','WT_all','WT_all','WT_all', 'gf_sorted', 'gf_sorted', 'gf_sorted', 'lf_sorted', 'lf_sorted', 'lf_sorted', 'WT_sorted', 'WT_sorted', 'WT_sorted')
number <- c('5','5','5','3','3','3','1','1','1', '6','6','6','4','4','4','2','2','2')
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
condition = condition,
genotype = genotype,
group = group,
number = number,
stringsAsFactors = TRUE)
dds.HTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = dir,
design= ~group)
#Filter low counts
keep <- rowSums(counts(dds.HTSeq)>=10)>=3
dds.HTSeq <- dds.HTSeq[keep,]
dds.DESeq2.HTSeq <- DESeq(dds.HTSeq)
I set up the first set of contrasts like this:
#Contrasts condition 1
#1
res_1vs3 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_all","lf_all"))
#2
res_1vs5 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_all","gf_all"))
#3
res_3vs5 <- results(dds.DESeq2.HTSeq, contrast = c("group","lf_all","gf_all"))
#Contrasts condition 2
#4
res_2vs4 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_sorted","lf_sorted"))
#5
res_2vs6 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_sorted","gf_sorted"))
#6
res_4vs6 <- results(dds.DESeq2.HTSeq, contrast = c("group","lf_sorted", "gf_sorted"))
#Setting criteria for up/down-regulated
#1
res_1vs3.up <- subset(res_1vs3, log2FoldChange > 0 & padj < 0.05)
#2
res_1vs5.up <- subset(res_1vs5, log2FoldChange > 0 & padj < 0.05)
#......etc for all of the contrasts set up above
#Ordering and annotating
res_1vs3.up.anno <- res_1vs3.up
res_1vs3.up.anno$symbol <- mapIds(org.Ce.eg.db,
keys=rownames(res_1vs3.up),
column="SYMBOL",
keytype="WORMBASE",
multiVals="first")
res_1vs3.up.anno$geneID <- mapIds(org.Ce.eg.db,
keys=rownames(res_1vs3.up),
column="GENENAME",
keytype="WORMBASE",
multiVals="first")
res_1vs3.up.anno.ordered <- res_1vs3.up.anno[order(res_1vs3.up.anno$padj),]
res_1vs3.up.anno.ordered.df <- as.data.frame(res_1vs3.up.anno.ordered)
write.csv(res_1vs3.up.anno.ordered.df, file = "res_1v3_up.csv")
#...and so on, for the rest of the comparisons.
The code runs fine, but I looked into the list of differentially expressed genes for the first 3 contrasts I set up - but when I looked at the raw counts for the particular genes that showed up, the numbers very low across the genotypes (in the single digits, or sometimes even zero...). I understand I can be more stringent with the filtering but I am confused as to how these made it to the top of the list in the first place.
I did not face this issue for the last 3 sets of contrasts, I could generally eyeball the counts and see some kind of difference at least and the counts were always "decent" (more than 50+ counts per replicate).
Should I be running each of the conditions separately? I'm not sure whether creating the dds with all my samples has somehow interfered with the way the differential expression is calculated. Or am I oversimplifying this by looking at raw counts?
I'm wondering if someone can shed some light on this - perhaps I'm lacking in some understanding of how the statistical models work, so if anybody can provide some clarification I would be very grateful.
Thanks in advance!
Edit: I separated both conditions and ran the analysis again. For one of the contrasts, I initially, I had 28 candidates which seem to have passed the tests for multiple comparison, now I have 10 (there is high overlap between the two lists, so maybe not too much to worry, but there are also some differences). For the other contrast, I only had one candidate before, and I also have one now, but they are different genes (the rest of the padj values are 0.99996). If anybody has any insight on what these differences might be due to, that would be great. Thanks!
Thanks very much! I'll give that a try.