Question

Setting up contrasts within a dataset within conditions (across different genotype) and across conditions (for the same genotype)

0

Entering edit mode

nrama • 0

@5e32505d

Last seen 12 months ago

Canada

Hi, I'm using DESeq2 to analyse bulk RNAseq data (new at this, so apologies if this is a naive question).

I have 3 genotypes and 2 conditions (neurons/sorted, or all cells), so 6 groups in total. Per genotype, I have 3 biological replicates, so in total there are 18 samples (9 per condition).

I am interested in comparing in 2 ways: (1) different genotypes within one condition (see what's differentially expressed across genotypes, per condition) (2) same genotypes across different conditions (check to see whether neuronal genes are generally enriched in the "neurons" condition, more of a quality check step)

I used HTSeq to get my counts; and my code looks like this

dir <- "D:/alignments/htseq_counts"
sampleFiles <- grep("_counts",list.files(dir),value=TRUE)
condition <- c('all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell','all_cell', 'sorted', 'sorted', 'sorted ,'sorted' ,'sorted' ,'sorted' ,'sorted' ,'sorted', 'sorted')
genotype <- c('gf', 'gf', 'gf', 'lf', 'lf', 'lf','WT','WT','WT')
group <- c('gf_all','gf_all','gf_all','lf_all','lf_all','lf_all','WT_all','WT_all','WT_all', 'gf_sorted', 'gf_sorted', 'gf_sorted', 'lf_sorted', 'lf_sorted', 'lf_sorted', 'WT_sorted', 'WT_sorted', 'WT_sorted')
number <- c('5','5','5','3','3','3','1','1','1', '6','6','6','4','4','4','2','2','2')

sampleTable <- data.frame(sampleName = sampleFiles,
                      fileName = sampleFiles,
                      condition = condition,
                      genotype = genotype,
                      group = group,
                      number = number,
                      stringsAsFactors = TRUE)


dds.HTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
                                   directory = dir,
                                   design= ~group)

#Filter low counts 
keep <- rowSums(counts(dds.HTSeq)>=10)>=3 
dds.HTSeq <- dds.HTSeq[keep,]

dds.DESeq2.HTSeq <- DESeq(dds.HTSeq)

I set up the first set of contrasts like this:

#Contrasts condition 1
#1
res_1vs3 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_all","lf_all"))
#2
res_1vs5 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_all","gf_all"))
#3
res_3vs5 <- results(dds.DESeq2.HTSeq, contrast = c("group","lf_all","gf_all"))

#Contrasts condition 2
#4
res_2vs4 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_sorted","lf_sorted"))
#5
res_2vs6 <- results(dds.DESeq2.HTSeq, contrast = c("group","WT_sorted","gf_sorted"))
#6
res_4vs6 <- results(dds.DESeq2.HTSeq, contrast = c("group","lf_sorted", "gf_sorted"))


#Setting criteria for up/down-regulated
#1
res_1vs3.up <- subset(res_1vs3, log2FoldChange > 0 & padj < 0.05)
#2
res_1vs5.up <- subset(res_1vs5, log2FoldChange > 0 & padj < 0.05) 

#......etc for all of the contrasts set up above

#Ordering and annotating 
res_1vs3.up.anno <- res_1vs3.up
res_1vs3.up.anno$symbol <- mapIds(org.Ce.eg.db,
                                  keys=rownames(res_1vs3.up),
                                  column="SYMBOL",
                                  keytype="WORMBASE",
                                  multiVals="first")
res_1vs3.up.anno$geneID <- mapIds(org.Ce.eg.db,
                                  keys=rownames(res_1vs3.up),
                                  column="GENENAME",
                                  keytype="WORMBASE",
                                  multiVals="first")
res_1vs3.up.anno.ordered <- res_1vs3.up.anno[order(res_1vs3.up.anno$padj),]
res_1vs3.up.anno.ordered.df <- as.data.frame(res_1vs3.up.anno.ordered)
write.csv(res_1vs3.up.anno.ordered.df, file = "res_1v3_up.csv")

#...and so on, for the rest of the comparisons.

The code runs fine, but I looked into the list of differentially expressed genes for the first 3 contrasts I set up - but when I looked at the raw counts for the particular genes that showed up, the numbers very low across the genotypes (in the single digits, or sometimes even zero...). I understand I can be more stringent with the filtering but I am confused as to how these made it to the top of the list in the first place.

I did not face this issue for the last 3 sets of contrasts, I could generally eyeball the counts and see some kind of difference at least and the counts were always "decent" (more than 50+ counts per replicate).

Should I be running each of the conditions separately? I'm not sure whether creating the dds with all my samples has somehow interfered with the way the differential expression is calculated. Or am I oversimplifying this by looking at raw counts?

I'm wondering if someone can shed some light on this - perhaps I'm lacking in some understanding of how the statistical models work, so if anybody can provide some clarification I would be very grateful.

Thanks in advance!

Edit: I separated both conditions and ran the analysis again. For one of the contrasts, I initially, I had 28 candidates which seem to have passed the tests for multiple comparison, now I have 10 (there is high overlap between the two lists, so maybe not too much to worry, but there are also some differences). For the other contrast, I only had one candidate before, and I also have one now, but they are different genes (the rest of the padj values are 0.99996). If anybody has any insight on what these differences might be due to, that would be great. Thanks!

DESeq2 • 519 views

ADD COMMENT • link 17 months ago nrama • 0

score 0 · Answer 1 · 2022-12-02

Re: the differences between altogether vs running separate, that is answered in the FAQ in the vignette.

I'd recommend this design:

~sorted + sorted:genotype

And then you can use the following type of code to perform LRT on the sorted-specific genotype effects:

m <- model.matrix(~sorted + sorted:genotype)
dds <- DESeq(dds, full=m, reduced=m[,-idx], test="LRT")

You can use idx to specify the coefficients to test, e.g. the columns representing the lf vs WT, and gf vs WT effects. You can combine them to ask if there is any genotype effect within a sorted group. The LRT tends to be less likely to call positive counts vs 0. Finally, you can use lfcShrink() to robustly estimate these differences, you can set coef to the column number of the model matrix to shrink.