Question

between and within subjects design with batch correction

0

Entering edit mode

demisa • 0

@demisa-22970

Last seen 4.6 years ago

United States

Hello,

I have been searching a looot about it, apologies if I have missed a solution.

I am trying to do a paired analysis in DESeq2, meaning having a paired samples design.

I have a dataset with two groups (which is my batch), each group has uneven number of subjects, and each subject can only be in one of the groups and have one or two samples (so one or both conditions). Note that my counts are estimated with kallisto. Just a toy example of how my colData look like:

  sample batch condition subject
1   sample1     1         A      S1
2   sample2     1         B      S1
3   sample3     1         A      S2
4   sample4     1         B      S2
5   sample5     1         A      S3
6   sample6     1         B      S3
7   sample7     1         A      S4
8   sample8     1         B      S4
9   sample9     1         A      S5
10 sample10     1         B      S5
11 sample11     2         A      S6
12 sample12     2         B      S6
13 sample13     2         A      S7
14 sample14     2         B      S7
15 sample15     2         A      S8

In this example, batch == 1 has 5 subjects with both conditions per subject, while batch == 2 has 3 subjects, one of which has only one condition. I simplified with keeping balanced paired samples with respect to the condition, so I filtered out sample15.

My goal is to test the condition effect while controlling for subject effects.

So initially I thought my model would be ~ batch + subject + condition. And the resultName I would look at to see the condition effect (while having controlled for batch and subject effects) is the 'condition_B_vs_A'. This model design leads to the "Model matrix not full rank" error.

dds = DESeqDataSetFromMatrix(countData = counts.mat, colData = sample.summary.balanced, design = ~ batch + subject + condition)
dds <- DESeq(dds)

Error in checkFullRank(modelMatrix) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.

The problem is the linearity of batch and subject. The examples of linearity in the vignette do not exactly match the case here to my understanding.

While I have tried a few thoughts motivated of some conversations here (like my last resort was applying batch correction first, converting to positive integers and then run DESeq2 with ~ subject + condition, which gave zero DEGs), nothing I have tried works.

By the way, if I don't do a paired design and just have ~batch+condition the model works nicely. But I would like to take advantage of the fact that I have both conditions per subject.

Any insight would be greatly appreciated!

RNASeqData BatchEffect paired DESeq2 • 3.8k views

ADD COMMENT • link 4.6 years ago demisa • 0

score 0 · Answer 1 · 2021-04-14

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 4 days ago

United States

This part:

each subject can only be in one of the groups and have one or two samples (so one or both conditions)

Has been discussed before in previous posts. In DESeq2 we use fixed effects, so with a fixed effect for pair and for condition, but limma's duplicateCorrelation approach is more flexible, so you can look into that. It would use all paired or unpaired samples and model the within subject correlations.

Within DESeq2, I would recommend to use the paired samples, and you can use the strategy in the vignette for paired samples nested within groups.

ADD COMMENT • link 4.6 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael for your response.

While I have seen the discussions that recommend limma with duplicateCorrelation, I was wondering if I could do it with DESeq2.

My confusion was in exactly that part of the vignette "Group-specific condition effects, individuals nested within groups" because I thought subject.n was kind of grouping two different subjects - but I have realized now what exactly it does so that's ok. But I am not looking for group-specific (or batch-specific in my case) condition effects but just condition effects while adjusting for group/batch and subject differences. So I tried the following model after I added subject.n variable:

~ batch + batch:subject.n + condition

But let me go one step before, where I got a convergence issue in the simple model, for which I tried two things that didn't resolve the issue. First to split DESeq function into steps and increase maxit, and second to filter even more low expressed genes.

# simplified model for object construction:

dds = DESeqDataSetFromMatrix(countData = counts.mat, colData = sample.summary.balanced, design = ~ subject + condition)
dds <- DESeq(dds)

*estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
21 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest*

# So I repeat by spliting into steps & filter more low expressed genes:

dds = DESeqDataSetFromMatrix(countData = counts.mat, colData = sample.summary.balanced, design = ~ subject + condition)
dds <- estimateSizeFactors(dds)
nc <- counts(dds, normalized=TRUE)
filter <- rowSums(nc >= 10) >= 4
dds <- dds[filter,]
dds <- estimateDispersions(dds)

*gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates*

dds <- nbinomWaldTest(dds, maxit=2000)

*23 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest*


# Without resolving the convergence issue, I tried going forward.

# Added combinatory variable to dds ColData:

dds$subject.n = ifelse(dds$batch == 1, factor(rep(rep(1:batch1.subjects,each=2),2)), factor(rep(rep(1:batch2.subjects,each=2),2)))
dds$subject.n = as.factor(dds$subject.n)

     sample batch condition subject subject.n
1   sample1     1         A      S1         1
2   sample2     1         B      S1         1
3   sample3     1         A      S2         2
4   sample4     1         B      S2         2
5   sample5     1         A      S3         3
6   sample6     1         B      S3         3
7   sample7     1         A      S4         4
8   sample8     1         B      S4         4
9   sample9     1         A      S5         5
10 sample10     1         B      S5         5
11 sample11     2         A      S6         1
12 sample12     2         B      S6         1
13 sample13     2         A      S7         2
14 sample14     2         B      S7         2

# Feed the new colData to DESeqDataSet and edit the design:

dds = DESeqDataSetFromMatrix(countData = counts.mat, colData = colData(dds), design = ~ batch + batch*subject.n + condition)

*converting counts to integer mode
Error in checkFullRank(modelMatrix) :
  the model matrix is not full rank, so the model cannot be fit as specified.
  One or more variables or interaction terms in the design formula are linear
  combinations of the others and must be removed.*

# This is because I have unbalanced data in respect to batch: uneven subjects in the two batches. 
# So I removed the levels without samples, like for example phase2:subject.n3:

m = model.matrix(~batch + batch:subject.n + condition, colData(dds))
all.zero <- apply(m, 2, function(x) all(x==0))
length(all.zero[all.zero == TRUE])
idx <- which(all.zero)
m <- m[,-idx]

# Feed clean model matrix to DESeq:

dds = DESeq(dds, full = m)

*using supplied model matrix
using pre-existing size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
121 rows did not converge in beta, labelled in mcols(object)$betaConv. Use larger maxit argument with nbinomWaldTest*

resultsNames(dds)
 [1] "Intercept"          "batch2"             "conditionB"
 [4] "batch1.subject.n2"  "batch2.subject.n2"  "batch1.subject.n3"
 [7] "batch1.subject.n4"  "batch1.subject.n5"

How can I resolve the convergence issue? I even repeated the separate steps and added the useOptim option:

  dds <- estimateSizeFactors(dds)
  dds <- estimateDispersions(dds)
  dds <- nbinomWaldTest(dds, maxit=2000, useOptim = T)

Also, to get the condition effect correcting for differences in batch and subjects, is it res1 or res2?

res1 = results(dds, contrast = c("condition", "B", "A")) 

res2 = results(dds, contrast=list( c("condition_B",
                              "batch1.subject.n2",
                              "batch2.subject.n2",
                              "batch1.subject.n3",
                              "batch1.subject.n4",
                              "batch1.subject.n5",
                              "batch2") ))

ADD REPLY • link 4.6 years ago demisa • 0

0

Entering edit mode

Jumping to the top, if you don’t want group specific condition effects then just use ~sample + condition. You don’t (and cannot) control at a higher resolution than sample.

ADD REPLY • link 4.6 years ago Michael Love 43k

0

Entering edit mode

If by group-specific we mean batch-specific (in this case), then I surely don't. I just need to correct for the batch.

When you say "You don’t (and cannot) control at a higher resolution than sample", you mean in DESeq2 or in general? If my model is ~sample + condition then how do I take into account that sample1 and sample2 comes from the same subject? Or that sample1 and sample14 are from different batches?

ADD REPLY • link 4.6 years ago demisa • 0

0

Entering edit mode

Sorry, above you have subject. Use ~subject + condition. You don't need to additionally control for anything for within which subject is nested.

ADD REPLY • link 4.6 years ago Michael Love 43k

0

Entering edit mode

mmm I had this thought too, but isn't it between and within subjects? So for the between part, wouldn't I need to correct for batch? What do you mean when you say "within which subject is nested"? Is it nested now that batch:subject.n is not in this model?

Also, still with the ~subject + condition (which is the model I start with above) there are genes that do not converge.

ADD REPLY • link 4.6 years ago demisa • 0

1

Entering edit mode

Subject is within batch -- perhaps discuss with a statistician how this is the case.

Re: convergence, I'd recommend the following:

keep <- rowSums(counts(dds) >= 10) >= x
dds <- dds[keep,]
# then DESeq() below

For x chosen to avoid genes where you have a lot of zeros, maybe 7.

ADD REPLY • link 4.6 years ago Michael Love 43k

0

Entering edit mode

Thank you Michael, everything is clear now.

Indeed the convergence was resolved by raising x = 2/3 of my samples. Thanks a lot!

ADD REPLY • link 4.6 years ago demisa • 0

0

Entering edit mode

If I may ask one more thing, why even though I reach convergence, I find no DEGs?

dds = DESeqDataSetFromTximport(counts, sample.summary.balanced, ~ subject + condition)
min.cutoff = round(nrow(sample.summary.balanced)/1.2)
keep <- rowSums(counts(dds) >= min.cutoff) >= min.cutoff
dds <- dds[keep,]
dds <- DESeq(dds) 
*estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing*

res = results(dds, contrast = c("condition", "B", "A"), pAdjustMethod = "BH", alpha = 0.05)
summary(res)
*out of 10735 with nonzero total read count
adjusted p-value < 0.05
LFC > 0 (up)       : 1, 0.0093%
LFC < 0 (down)     : 0, 0%
outliers [1]       : 0, 0%
low counts [2]     : 0, 0%
(mean count < 145)*

The only way I get some DEGs is if I filter by p-value alone. Not even FC. Here are the ranges of the log2FC, padj and p-values, and a volcano with p-values and log2FC distributions.

summary(res$log2FoldChange)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-0.654707 -0.110581 -0.004492 -0.007384  0.097655  0.668882

summary(res$padj)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.02358 0.69443 0.77172 0.79575 0.89926 0.99999

summary(res$pvalue)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.0000022 0.1736602 0.3860016 0.4302314 0.6745380 0.9999893

The volcano plot with p-values

ADD REPLY • link 4.6 years ago demisa • 0

0

Entering edit mode

There are not large changes in your dataset. You cannot reject the null for these genes.

You may want to discuss with a statistician about powering future datasets.

ADD REPLY • link 4.6 years ago Michael Love 43k

0

Entering edit mode

I totally agree. But when I don't do paired analysis, and I use a simple ~batch+condition model, the power is very good. I filter by FDR and 4FC and get hundreds of DEGs. I wonder why the paired is so poor.

ADD REPLY • link 4.6 years ago demisa • 0

0

Entering edit mode

Sorry, I missed that part in your initial post.

Well I guess you could be on the verge of significance with the 100s, and by adding all the additional covariates for subject terms, you lose degrees of freedom.

I would note that, with accounting for patient baseline, these LFC are pretty small. So I wouldn't push to hard on the data when it's telling you the changes are small.

ADD REPLY • link 4.6 years ago Michael Love 43k

0

Entering edit mode

I understand your point and agree with you.

But when I use 1/10 of my samples for the paired analysis then I get like 1000 DEGs with p0.01 and 4FC thresholds. Still not good enough FDRs though. But I did expect having more samples would increase the power, not the opposite. It feels weird to me. But maybe you are right, maybe adding all these subject terms is minimizing any effect.

Would you think limma would be more appropriate?

I certainly don't want to push too hard the data, just want to make sure I'm doing it the right way..