Hi,
I am using DESeq2 v1.22.2 to test for changes in tumors over time pre and post treatment. However when I use the results function, I get log2FC on the order of +/- 30, which is of course unreasonably large. If I follow up with lfcshrink + ashr shrinkage, these barely change and if I use lfcshrink + normal shrinkage, these effects are shrunken to nearly 0. I understand "ashr" generally performs better than "normal" and that it preserves large LFC, but these LFC are unrealistically large with ashr. I have 2 questions:
- Is this expected behavior or does log2FC on this scale imply there is something fundamentally wrong with my design or samples?
- If this is expected behavior, should I go with "ashr" or "normal"? I am not using apeglm because I need the contrast argument.
More background:
Not all DEGs have such massive LFC - it seems to almost be bimodal with significant genes either having L2FC > 25 or < 5.
There is significant variability in tumor type (breast, colorectal, melanoma etc), but I am hoping this effect is captured by a paired design. Samples generally cluster on PCA based on their patientID, not time. I have at minimum 2 replicates per timepoint, but usually 3 or more. There is also significant variability in RNA quality, which I have binned into "good" or "bad". My design is thus:
design=~PatientID+time+rbin
I then test the effect of time using the contrast argument. I have noticed that the extent of these massive LFC genes is slightly affected by the threshold I use for RIN, but nothing seems to eliminate the effect entirely. I have tried removing rbin, modeling RIN as a continuous variable, removing all tumors with low RIN, SVA (with 1, 2, and 3 SVs), filtering samples based on tumor type, and removing the PatientID variable with no luck.
I apologize if this question has been asked before and am happy to provide more info as needed. Here is the Biostars link to my question there. Thanks!
-Alex
Edit - more info
colData:
PatientID disease time rbin
PatientA A d1 low
PatientA A d100 high
PatientA A d200 low
PatientB A d1 high
PatientB A d100 low
PatientC A d1 high
PatientC A d100 low
PatientC A d200 low
PatientD B d1 high
PatientD B d100 high
PatientD B d200 low
PatientE A d1 low
PatientE A d100 high
PatientE A d200 low
PatientF C d1 low
PatientF C d200 low
PatientG D d1 high
PatientG D d100 low
PatientG D d200 high
Code:
# Read in metadata
meta <- read.csv("meta.csv")
# Import
txi <- tximport(quantfiles,
type = "salmon",
tx2gene = tx2gene,
importer = read_tsv,
ignoreTxVersion=TRUE,
countsFromAbundance="lengthScaledTPM")
# Create DESeqDataSet
dds <- DESeqDataSetFromTximport(txi = txi,
colData = meta,
design=~PatientID+rbin+time)
# Run DESeq
dds <- DESeq(dds)
# DGE
res <- list()
res$d100vd1_noShrink<-results(dds,contrast=c("time","d100","d1"))
res$d100vd1_normalShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="normal")
res$d100vd1_ashrShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="ashr")
PCA (color = disease, label = patient):
MA plots with various shrinkages:
Hi Michael Love thanks a ton for your rapid response and for DESeq2 in general - awesome tool!
I should have clarified my questions - I'm aware that normal is not recommended, but it seems to be performing more realistically by shrinking the ridiculously large fold changes (on the order of billions), while ashr preserves them. Is this reason enough to use normal over ashr? I could alternatively just filter out transcripts with low 0 read counts, which is why I assume I'm getting these massive LFCs.
Thanks again, Alex
Can you try apeglm? Or if you want to use ashr, yes, just filter out genes where there are not X or more samples with a count of 10 or more should get rid of these genes.
I'd like to stick with ashr for now because it works with the contrast argument. However I will try apeglm and update the post if it gives a significantly different result. Thanks a ton for the advice and code!