Handling biological replicates across batches in DESeq2
1
1
Entering edit mode
kmuench ▴ 40
@kmuench-9243
Last seen 6.1 years ago
United States

Cross-post w/ Biostars based on commenter suggestion: https://www.biostars.org/p/345751/#345776

Hello,

I am trying to figure out how to handle biological replicates across batches in DESeq2's design matrix. They're not quite technical replicates, but trying to include the sample ID as a factor in the design matrix produces a 'Model matrix not full rank' error. What do you think I should do?

 

# Experiment:

We have three variables of interest in our analysis: Timepoint (A and B), Sex (M or F), and Condition (cond1 or cond2). 

Note that a given sample can only have one timepoint, one sex, and one condition - so it's not like we can split the same tissue and look at it at Timepoint A and B. Each possible Timepoint x Sex x Condition has its own unique samples.

In our first sequencing batch, we collected samples for each possible combination of conditions.  

In our second batch, we took some of the same RNA samples from the first sequencing batch PLUS some new RNA samples, re-generated libraries from all of these, and then sequenced.

In the end, we have a sample table that looks like this:

    SAMPLE | BATCH | CONDITION | SEX | TIMEPOINT
    ------  -------  ---------   ----  ---------
    samp1  |  1    |   cond1   |  F  |    A
    samp1  |  2    |   cond1   |  F  |    A
    samp2  |  1    |   cond1   |  M  |    A
    samp2  |  2    |   cond1   |  M  |    A
    samp3  |  1    |   cond1   |  F  |    A
    samp4  |  2    |   cond1   |  F  |    A

...and so forth.

 

# Problem

I tried to run DESeq with this:

    design(dds) <- ~Genotype + Sex + Condition + SequencingBatch + Sample

and also with this:

    design(dds) <- ~Genotype + Sex + Condition + SequencingBatch_Sample

...where SequencingBatch_Sample is a combination of SequencingBatch and Sample, like "samp1_1", samp1_2" for sample 1/batch1 and sample1/batch2.
 

These designs, of course, throw a 'Matrix not full rank' error, presumably because it is not possible to have the same sample have every possible combination of Condition/Sex/Timepoint, since each sample can only have one quality each of Condition, Sex, and Timepoint.

However, I'm not quite sure what to do instead.

Using collapseReplicates() doesn't feel appropriate because these aren't really technical replicates - a whole new library was produced for the second batch, just using the same RNA.

Doing nothing seems problematic because then the 'duplicated' signal from a subset of samples threatens to overwhelm the differential expression results.

All suggestions are very welcome - thank you!

 

EDIT:

Ideally, I would like to avoid collapsing Condition, Sex, and Timepoint into a single variable (e.g. Condition_Sex_Timepoint with 2 * 2 * 2 = 8 levels) because I am interested in looking at main effects and interaction effects for these variables - i.e., I'd like to understand the main effect of Sex (M vs. F) as well as the effect of Sex:Condition, and it seems harder to tease this apart if I have to do pairwise comparisons between 8 groups, e,g. Condition1+SexF+TimepointA vs. Condition1+SexF+TimepointB, Condition1+SexF+TimepointA vs. Condition1+SexM+TimepointA... etc)

 

deseq2 • 4.0k views
ADD COMMENT
0
Entering edit mode

I’ll take a look, but in the meantime can you post this link on the other thread to complete the cross-link.

ADD REPLY
0
Entering edit mode

Ah thank you for the reminder - cross-link complete.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 3 hours ago
United States

Can you say more about this:

"Using collapseReplicates() doesn't feel appropriate because these aren't really technical replicates - a whole new library was produced for the second batch, just using the same RNA."

It sounds like a technical replicate to me. The pool of RNA stays the same, you just draw more from it. There should be some variation, from PCR fluctuations, but no biological variation. It is very different than the dispersion you estimate across samples. Can you make a PCA plot and see how far apart these are. If they are very close to each other, I'd suggest collapsing them.

ADD COMMENT
0
Entering edit mode

Sure. My understanding is that technical replicates had to be drawn from the same library preparation/cDNA, not just the same RNA. In my case, a whole new library preparation was prepared in the second batch on the same RNA from the first batch (plus a few extra/new samples thrown in). I imagine that there could be a number of variables affected between the first and second library preps, e.g., we suspect the amplification step was a little less efficient in the second iteration of library prep than the first iteration of library prep, so perhaps the second library cDNA doesn't capture exactly the same composition of mRNAs/number of copies of mRNAs per gene.

There is definitely a batch effect between our first and our second Illumina sequencing runs. To explain this figure I'll add a bit of detail to the example I gave in my original post. We had three different library prep batches (A, B, and D), but only two different Illumina sequencing runs. The first sequencing run sequenced library preps A and B, and the second sequencing run sequenced library prep D. The samples in library prep D consists of a subset of the samples from A and B  - same RNA, new library prep - plus a few extra samples.

As you can see in the above figure, library preps A and B (Illumina Run 1) don't cluster separately, but D (Illumina Run 2) does. I'm guessing either something was special about Library Prep D, or something was weird about Run 2. That said, in the past we haven't seen much variability when re-sequencing the same libraries in subsequent Illumina runs, so I suspect there may be something different about library prep D.

 

As for whether samples are clustering together - it's a little difficult to visualize on the PCA due to the number of points, but below I posted what I see when I make a heatmap of cor( assay(my_vst_normalized_data) ). The IDs on the right hand side represent "[SequencingBatch]_[Sample_ID]", so 1_13_2 represents Sample 13_2 from Illumina Run #1. The summary is that some of the samples do cluster together, despite SequencingBatch (e.g. sample 13_2, in the upper right corner there), but the vast majority do not, and the strong Sequencing Batch seems to prevent samples from clustering together.

 

 

 

Lot of text, lot of pictures - 

TL;DR I think there is a batch effect of Sequencing Run, possibly motivated by slightly different library preparation outcome even though we used the exact same library preparation protocol on the exact same samples. This batch effect means that the repeat samples don't cluster together, and so I'm not sure if collapseReplicates() is appropriate in that case.

ADD REPLY
1
Entering edit mode

I see, yes it seems like there is lots of technical variation across the replicates, even though there is no biological variation that you would expect, is its from the identical sample.

The only approach I can think of, to inform the model that you have technical replicates nested within condition which are not biological replicates, and yet also not similar enough to justify collapsing them, is to use the duplicateCorrelation() approach with limma-voom. So I'd take a look at that framework for approaching this experiment. 

It's not possible to use fixed effects to account for the duplicate nature of technical replicates within a condition, and this is the way that DESeq2 controls for variables. You can do within-individual comparisons for various complex designs, but you can't "control" for the correlations among certain related samples within a condition and then compare directly across condition with our implementation. 

ADD REPLY
0
Entering edit mode

Ah, thank you very much for your reply! I'll propagate this back to Biostars and take a look at limma-voom. I appreciate the insight.

ADD REPLY
0
Entering edit mode

That said: I would be so happy if collapseReplicates() were appropriate in this situation, because the only other alternative I've found is to try to create a custom design matrix with a separate factor for every set of duplicated samples and feed that into the full= flag of DESeq(), which is very large and so far presents a "missing levels" error that is difficult to troubleshoot: https://www.biostars.org/p/205263/

ADD REPLY

Login before adding your answer.

Traffic: 893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6