Question

Interactions in batches of varying size with different number of replicates using limma

0

Entering edit mode

mperalc • 0

@martapalcantara-22311

Last seen 4 months ago

United Kingdom

Dear all,

I'm trying to do differential expression analysis using limma with a somewhat complicated design. I have an experiment with three levels of treatment (untreated, and treatment I and II) in over 200 donors (A, B...ZZ) of different genotype each. Due to experimental limitations, the samples have been treated in the same way but in 15 different batches. Each batch includes 8-40 different donors each, and 2 donors are present in every batch (donor A and B), while others are present in a variety of batches (some in 2, some in 3 or 4). So the majority of donors are present only in one batch, and there is not a clear baseline. Each donor has all three levels of treatment in each batch in which it is present. So the metadata looks like this:

                        donor     treatment   batch                       
untreated_A_batch13        A       untreated  batch13
untreated_B_batch17        B       untreated  batch17 
untreated_B_batch2         B       untreated  batch2  
untreated_C_batch3         C       untreated  batch3 
untreated_D_batch6         D       untreated  batch6 
....
II_ZZ_batch15      ZZ        II        batch15
II_ZZ_batch16      ZZ        II        batch16

I wanted to answer two questions:

What is the effect of each treatment, controlling for the donor and batch effects?
Is there an interaction of donor x treatment?

To answer question 1 I am controlling for batch using the blocking method explained on [p. 127][1] of the manual (duplicateCorrelation(), under "18.1.9 Linear modeling") and following the usual voom() -> lmFit() -> eBayes(), and fitting the design excluding batch:

mat = model.matrix(~ 0 + treatment + donor, metadata)

This works fine for the contrasts treatmentI-treatmentuntreated and treatmentII-treatmentuntreated. My only issue is that the adj.P.Val are tiny (and <0.05 in 91% of genes). Is this just due to the amount of replicates (donors) that I have, or should I be worried? There are ~10% with abs(logFC)>1, which is in the line of what I've seen in other experiments.

Now the problem is question 2. I create a new design matrix like this:

mat = model.matrix(~ 0 + donor*treatment, metadata)

The column names of the design matrix look like this if I set donorA and treatment I as reference:

"donorA" "donorB" "donorC" (....) "treatmentII" "treatmentuntreated" "donorB:treatmentII"
"donorC:treatmentII" (...) "donorB:treatmentuntreated" "donorC:treatmentuntreated" (...) "donorZZ:treatmentuntreated"

Because the batch and donor effects are correlated, I expect correcting using duplicateCorrelation() will remove any signal I may have in the donors that are present in only one batch. So because of this and to prevent the matrix not being full rank, I manually remove all columns from the metadata matrix that pertain to donors that are not shared across all batches (so, I remove "donorC" to "donor ZZ", but leave their interaction columns: so for example "donorZZ:treatmentuntreated" would remain.


v = voom(dge, design = mat, plot=TRUE)

cor = duplicateCorrelation(v, design=mat, block = metadata$batch)
#voom weights may have changed:
v = voom(dge, design = mat, plot=TRUE, block = metadata$pool, correlation = cor$consensus)
cor = duplicateCorrelation(v, design = mat, block =  metadata$pool) # extract cor again
cor$consensus

# fit 
fit = lmFit(v, design=mat,block = metadata$pool, correlation  = cor$consensus)

Now I fit the contrast matrix. For every donor in mat above (I abbreviated the code for clarity):

cont = c( paste0("(donor",donor,".treatmentII-donor", donor,".treatmentuntreated)-(treatmentII-treatmentuntreated)-((donorA + donorB)/2)"))
colnames(contrast.matrix) = c(paste0(donor,"_II_interaction"))

contrast.matrix = makeContrasts(contrasts = cont,
                                levels=mat)

The contrast matrix looks like this:


                                   Contrasts
Levels                                        B_II_interaction     C_II_interaction     D_II_interaction
  donorA                                                  -0.5                   -0.5                      -0.5
  donorB                                                  -0.5                   -0.5                      -0.5 
 treatmentII                                              -1                     -1                     -1
  treatmentuntreated                                       1                      1                      1
  donorB.treatmentII                                       1                      0                      0
(...)
donorB.treatmentuntreated                                 -1                      0                      0

Am I correct in my interpretation that treatmentII - treatmentuntreated is the effect of treatment II vs untreated for the average of all donors, and not the 2 donors that I'm retaining in the matrix to estimate the donor effects? What would be then the caveat of including only two donors vs including the other donors that are only partially shared across batches?
Is N_II_interaction = (donorN:treatmentII - donorN:treatmentuntreated)-(treatmentII - treatmentuntreated)-((donorA + donorB)/2) interpretable as the interaction for donor N with respect of the average of donor A and B? I noticed that calculating this for donor B gives more differentially expressed genes than for the other donors, why is this?
Can you suggest a better way of fitting these interactions in an interpretable way for all donors?

I hope I have included all the information needed, let me know if not. Thanks in advance for your advice.

Clarification: This is pseudobulk data from scRNA-seq, so I could potentially create replicates for some donors within each batch and treatment, provided I have enough cells for that donor (some are much more abundant than others).

limma RNASeq DifferentialExpression • 463 views

ADD COMMENT • link 5 months ago • updated 4 months ago mperalc • 0

0

Entering edit mode

Do you observe all three treatments for every donor?

If a donor appears in more than one batch, are all three treatments repeated for that donor in each of the batches?

Are there any repeats of the same donor with the same treatment in the same batch?

ADD REPLY • link 4 months ago Gordon Smyth 51k

0

Entering edit mode

Yes, all three treatments are present for every donor, in all the batches in which that donor appears. There are no repeats of the same donor with the same treatment in the same batch (but I could create some from different cells of the same donor - these is pseudobulk data from scRNA-seq)

ADD REPLY • link 4 months ago mperalc • 0

0

Entering edit mode

I am controlling for batch using the blocking method explained on [p.126] of the manual

I guess you mean using duplicateCorrelation as is done for cell line in Section 18.1.19 of the limma User's Guide. (That's page 127 of the latest User's Guide.)