How to account for solvent controls in DESeq2 when comparing follicular vs luteal phase?
1
1
Entering edit mode
snijesh ▴ 200
@snijesh-20358
Last seen 1 day ago
India

I have an RNA-seq dataset with 8 samples (genes as rows, samples as columns):

AA - FP SC T1   (Follicular phase, solvent control, replicate 1)
AB - FP T1      (Follicular phase, treated, replicate 1)
AC - FP SC T2   (Follicular phase, solvent control, replicate 2)
AD - FP T2      (Follicular phase, treated, replicate 2)
AG - LP SC T1   (Luteal phase, solvent control, replicate 1)
AE - LP T1      (Luteal phase, treated, replicate 1)
AF - LP SC T2   (Luteal phase, solvent control, replicate 2)
AH - LP T2      (Luteal phase, treated, replicate 2)

My goal is to perform differential expression analysis between FP and LP, but I want to remove the effect of solvent controls so that I capture only the biological differences between the two phases.

What would be the correct way to handle the solvent controls in DESeq2 for this type of design?

Possible approaches I considered

  1. Subtract SC from treated counts within each replicate before running DESeq2 (e.g., AB-AA = FP_T1_corrected).: This reduces to 4 corrected samples: FP_T1, FP_T2, LP_T1, LP_T2. But here I am concerned about whether the subtraction can create negative counts, and only 2 replicates per group remain.

  2. Model SC directly in DESeq2 using an interaction term: Metadata includes phase (FP/LP), condition (SC/Treated), and replicate (T1/T2).

design ~ replicate + phase + condition + phase:condition

Then use the interaction term (phaseLP.conditionTreated) to test whether LP vs FP differences hold after adjusting for SC.

What is the best practice in this situation?

DESeq2 • 407 views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 9 hours ago
United States

Your approach 1 is a bad idea. Subtracting the gene counts won't do what you want to do.

I would personally check to see if there is any evidence that the solvent has an effect on gene expression in the two phases. I sort of doubt it does, but if there are no genes that are differentially expressed between the controls at the two phases, then you can safely ignore the solvent effect. The interaction term is in general much lower powered than a direct contrast, so ideally you would assume that the solvent effects are consistent, and a direct comparison of the treated in each phase will provide those genes that are affected differently in the follicular vs luteal phase.

0
Entering edit mode

I have checked DEGs between Solvent Control (SC) of FP and LP and there are noticebale differences in terms of DEGs.

# Checking SC in FP and LP
# Read featureCounts table
counts = read.table("counts.txt", header=TRUE, sep="\t", row.names=1)

# Metadata
targets = data.frame(ID = c("YHDAA","YHDAB","YHDAC","YHDAD","YHDAE","YHDAF","YHDAG","YHDAH"),
                     Name = c("FP_SC_T1","FP_T1","FP_SC_T2","FP_T2", "LP_SC_T1","LP_T1","LP_SC_T2","LP_T2"),
                     phase = c("FP","FP","FP","FP","LP","LP","LP","LP"),
                     Group = c("FP_SC","FP","FP_SC","FP","LP_SC","LP","LP_SC","LP"),
                     condition = c("SC","Treated","SC","Treated","SC","Treated","SC","Treated"))

# converting 'NA' values to 0
counts[is.na(counts)] <- 0
# make sample_info and counts in same order
counts = counts[, targets$ID]
#generate dds object
dds <- DESeqDataSetFromMatrix(countData = counts, colData = targets, design = ~ Group)

# Remove genes with 0 counts
keep_genes <- rowSums(counts(dds)) > 0
dds <- dds[ keep_genes, ]
# genes to have more than a sum total of 10 reads of support in all samples
dds = dds[rowSums(counts(dds)) > 10, ]

dds = DESeq(dds, betaPrior=FALSE, parallel=TRUE, BPPARAM=SnowParam(5))
res = results(dds, contrast = c("Group", "FP_SC", "LP_SC"), parallel = TRUE)

This has identified quite a good number of DEGs.

Next approach,

#DESeq2 object with interaction design
dds <- DESeqDataSetFromMatrix(countData = counts, colData = targets, design = ~ phase * condition)
# Equivalent to: ~ phase + condition + phase:condition

How do I calculate the res here by negating the SC effect?

res <- results(dds, name = "phaseLP.conditionTreated")

ADD REPLY
0
Entering edit mode

Not sure I understand. You ask how to do something and then show how one would do so.

0
Entering edit mode

I would like to clarify whether the command res <- results(dds, name = "phaseLP.conditionTreated") is the correct approach to generate differentially expressed genes (DEGs) when aiming to account for or negate the effect of the solvent control.

ADD REPLY
0
Entering edit mode

Yes. You can infer that by figuring out what each coefficient measures. Here is your design matrix:

> targets
     ID     Name phase Group condition
1 YHDAA FP_SC_T1    FP FP_SC        SC
2 YHDAB    FP_T1    FP    FP   Treated
3 YHDAC FP_SC_T2    FP FP_SC        SC
4 YHDAD    FP_T2    FP    FP   Treated
5 YHDAE LP_SC_T1    LP LP_SC        SC
6 YHDAF    LP_T1    LP    LP   Treated
7 YHDAG LP_SC_T2    LP LP_SC        SC
8 YHDAH    LP_T2    LP    LP   Treated
> model.matrix(~phase*condition, targets)
  (Intercept) phaseLP conditionTreated phaseLP:conditionTreated
1           1       0                0                        0
2           1       0                1                        0
3           1       0                0                        0
4           1       0                1                        0
5           1       1                0                        0
6           1       1                1                        1
7           1       1                0                        0
8           1       1                1                        1

The easiest (for me) way to identify the coefficients is to start with the coefficient with just one 1 in the row (note that the design matrix defines samples in the rows and coefficients in the columns). Remember that we are solving for y = X_1B_1 + X_2_B_2 + X_3B_3 + X_4B_4, where the 1 and 0 in a given row of the design matrix represent each of the X's in that formula. So all of the coefficients but the first one in the first row get zeroed out, so we know that in row one, it's y = X_1B_1, and from that we know that we are estimating the mean of the FP SC samples.

The second row has two 1's. We know that y in this case is FP treated, and that the first coefficient is FP SC, so we have FP treated = FP SC + B_3, and you can solve the algebra to get B_3 = FP treated - FP SC

The fifth row is an LP SC sample, and the only coefficients are the baseline and phaseLP, so we can again do the algebra, solving LC SC = FP SC + B_2, and getting B_2 = LC SC - FP SC.

The sixth sample is LP Treated, and that row has all four coefficients. So LP Treated = FP SC + (LC SC - FP SC) + (FP treated - FP SC) + B_4 which you can easily solve to get the coefficient.

ADD REPLY

Login before adding your answer.

Traffic: 1096 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6