Question

DESeq2: multiple factors, comparison of a main factor between samples with inconsistent magnitudes of gene expression changes

0

Entering edit mode

jropa ▴ 10

@jropa-22911

Last seen 22 months ago

United States

Hello,

I am working on RNA-seq data that comes from different patient derived samples and are then treated with four different possible treatments:

directory <- "human_count_files"
sampleFiles <- grep("counts", list.files(directory), value=TRUE)
sampleTreatment <- factor(rep(c(rep("control",3), rep("A",3), rep("B",3), rep("A_B",3)),3), levels=c("control","A", "B", "A_B"))
samplePatient <- factor(c(rep("p1",12), rep("p2",12), rep("p3",12)), levels=c("c1","c2","c3"))
sampleName <- 1:12
sample_df <- data.frame(sampleName, sampleFiles, sampleTreatment, sampleCord)
sample_df
   sampleName                            sampleFiles sampleTreatment samplePatient
3           1 lib11.Aligned.sortedByCoord.out.counts         control            p1
6           2 lib14.Aligned.sortedByCoord.out.counts               A            p1
9           3 lib17.Aligned.sortedByCoord.out.counts               B            p1
12          4  lib2.Aligned.sortedByCoord.out.counts             A_B            p1
15          5 lib22.Aligned.sortedByCoord.out.counts         control            p2
18          6 lib25.Aligned.sortedByCoord.out.counts               A            p2
21          7 lib28.Aligned.sortedByCoord.out.counts               B            p2
24          8 lib30.Aligned.sortedByCoord.out.counts             A_B            p2
27          9 lib33.Aligned.sortedByCoord.out.counts         control            p3
30         10 lib36.Aligned.sortedByCoord.out.counts               A            p3
33         11  lib6.Aligned.sortedByCoord.out.counts               B            p3
36         12  lib9.Aligned.sortedByCoord.out.counts             A_B            p3

Because of the variability involved in patient samples, I expect that some gene expression changes after a given treatment, while they may be changing in the same direction (i.e. they are up-regulated in all treatments vs controls within a patient sample) are not going to be consistent in the magnitude of change from patient sample to patient sample. Indeed, when I look at my normalized counts data I see something like the following:

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sample_df, directory=directory, design= ~samplePatient + sampleTreatment)
dds <- estimateSizeFactors(dds)
counts <- data.frame(counts(dds, normalized=TRUE))
colnames(counts) <- paste0(dds$samplePatient,"_",dds$sampleTreatment)
counts[5613,]
counts[5613,]
                   p1_control     p1_A     p1_B   p1_A_B p2_control     p2_A     p2_B   p2_A_B p3_control    p3_A     p3_B   p3_A_B
ENSG00000125124.12   398.8299 949.9343 575.5421 1573.744    514.532 5442.532 930.1038 1558.219   1117.352 1952.98 1701.638 1966.304

Now looking at treatment A vs control in the context of each patient separately, you can see that there are approximately 2.5-fold, 10-fold, and 2-fold up regulations from control to A if you look at the patients individually. However, because of the inconsistency in the increase, this gene is not considered significantly changed with an alpha value of 0.1 when I perform the DESeq analysis using design= ~samplePatient + sampleTreatment and results(dds, contrast = c("sampleTreatment", "A", "control"). I know that this particular design is meant to model batch effect for the patients, and that to fit this model well, the changes in genes should be consistent. However, how might I find genes such as the one I've shown above, where the change might not be to the same magnitude consistently, but they are always changing in the same direction and seem to be "big enough" changes to warrant follow-up (I know that "big" isn't very specific, but hopefully it gets the point across).

I'm looking for something similar to This previous question. However, I attempted to do as the poster here did, which was to combine the patient and treatment factors and then look at the average of the effect of A vs control in each patient. When I do this, I receive an error:

sample_df$combine <- factor(paste0(sample_df$samplePatient, "_", sample_df$sampleTreatment))
dds <- DESeqDataSetFromHTSeqCount(sampleTable=sample_df, directory=directory, design= ~combine)
dds<- DESeq(dds)
estimating size factors
estimating dispersions
Error in checkForExperimentalReplicates(object, modelMatrix) : 

  The design matrix has the same number of samples and coefficients to fit,
  so estimation of dispersion is not possible. Treating samples
  as replicates was deprecated in v1.20 and no longer supported since v1.22.

Did I misunderstand how the previous poster set up their analysis, or is this an approach that isn't possible with DESeq2 since the release of v1.22? If this doesn't work anymore, is there another approach you would recommend?

Thanks, Jim

deseq2 • 401 views

ADD COMMENT • link updated 4.9 years ago by Michael Love 43k • written 4.9 years ago by jropa ▴ 10

score 0 · Answer 1 · 2020-02-14

I'm not sure I have a good answer for you. If you had more samples, I'd recommend to use swish which has a paired DE mode that looks for consistent sign of LFC while taking into account the uncertainty in the count data. It works really well for such a case, where it doesn't care about the inconsistency in the within sample LFC itself as long as it is all positive or negative. But you only have 3 patients, so you have very little power. With so few replicates, you're left with parametric modeling, where you need the effects to be consistent, and here they are not. Maybe instead of relying on p-values you could use LFC shrinkage and look for genes where the shrunken LFC is likely not of the wrong sign (s-values). See the vignette for more details on this.