I have an RNA-seq dataset with 8 samples (genes as rows, samples as columns):
AA - FP SC T1 (Follicular phase, solvent control, replicate 1)
AB - FP T1 (Follicular phase, treated, replicate 1)
AC - FP SC T2 (Follicular phase, solvent control, replicate 2)
AD - FP T2 (Follicular phase, treated, replicate 2)
AG - LP SC T1 (Luteal phase, solvent control, replicate 1)
AE - LP T1 (Luteal phase, treated, replicate 1)
AF - LP SC T2 (Luteal phase, solvent control, replicate 2)
AH - LP T2 (Luteal phase, treated, replicate 2)
My goal is to perform differential expression analysis between FP and LP, but I want to remove the effect of solvent controls so that I capture only the biological differences between the two phases.
What would be the correct way to handle the solvent controls in DESeq2 for this type of design?
Possible approaches I considered
Subtract SC from treated counts within each replicate before running DESeq2 (e.g., AB-AA = FP_T1_corrected).: This reduces to 4 corrected samples: FP_T1, FP_T2, LP_T1, LP_T2. But here I am concerned about whether the subtraction can create negative counts, and only 2 replicates per group remain.
Model SC directly in DESeq2 using an interaction term: Metadata includes phase (FP/LP), condition (SC/Treated), and replicate (T1/T2).
design ~ replicate + phase + condition + phase:condition
Then use the interaction term (phaseLP.conditionTreated) to test whether LP vs FP differences hold after adjusting for SC.
What is the best practice in this situation?
I have checked DEGs between Solvent Control (SC) of FP and LP and there are noticebale differences in terms of DEGs.
This has identified quite a good number of DEGs.
Next approach,
How do I calculate the res here by negating the SC effect?
res <- results(dds, name = "phaseLP.conditionTreated")
Not sure I understand. You ask how to do something and then show how one would do so.
I would like to clarify whether the command
res <- results(dds, name = "phaseLP.conditionTreated")
is the correct approach to generate differentially expressed genes (DEGs) when aiming to account for or negate the effect of the solvent control.Yes. You can infer that by figuring out what each coefficient measures. Here is your design matrix:
The easiest (for me) way to identify the coefficients is to start with the coefficient with just one 1 in the row (note that the design matrix defines samples in the rows and coefficients in the columns). Remember that we are solving for
y = X_1B_1 + X_2_B_2 + X_3B_3 + X_4B_4
, where the 1 and 0 in a given row of the design matrix represent each of the X's in that formula. So all of the coefficients but the first one in the first row get zeroed out, so we know that in row one, it'sy = X_1B_1
, and from that we know that we are estimating the mean of the FP SC samples.The second row has two 1's. We know that
y
in this case is FP treated, and that the first coefficient is FP SC, so we haveFP treated = FP SC + B_3
, and you can solve the algebra to getB_3 = FP treated - FP SC
The fifth row is an LP SC sample, and the only coefficients are the baseline and
phaseLP
, so we can again do the algebra, solvingLC SC = FP SC + B_2
, and gettingB_2 = LC SC - FP SC
.The sixth sample is LP Treated, and that row has all four coefficients. So
LP Treated = FP SC + (LC SC - FP SC) + (FP treated - FP SC) + B_4
which you can easily solve to get the coefficient.