Question: Batch effects in limma
0
3.5 years ago by
smt8n0
smt8n0 wrote:

Dear all,

I realize most facets of my question have been discussed here, but I am having hard time pulling all the answers together and generalizing.

I have 36 iPSC, 9 primary CM, and 18 biopsy samples. My quality control plots showed batch effects in the data and I added 3 batches to the design matrix, using, e.g. this:

A: Limma, blocking batch effect

discussion (and reviewing others on the topic)

After I added batches to the design matrix and got the "Coefficients not estimable" message, I found an answer from Gordon Smyth: "You can't expect to be able to estimate more things than your experiment contains information about.  In this case, your batches are entirely confounded with patients+treatment, and hence you can't estimate coefficients for them." in one of the posts (and the differential expression was the same with and without the "+batches" in the matrix).

There are three questions I have here:

1) Does it mean that if I have ALL (36+9+18) samples in the design matrix, adding "batches" explicitly is introducing too many variables and there is no way to explicitly include batch effects?

2) In this case, does the regression on the samples implicitly take care of the batch effect?

3a) (if the answer to (2) is  "Yes" ) Since there are a lot of discussions on batch effect inclusion, it should be needed for something.  In what cases is it needed?

3b) (if the answer to (2) is "No") What can I do?

Thank you. Any help is greatly appreciated.

Slava

Thank you for the comment. Added upon request:

the sample table is the following:

<caption>Sample table</caption>
 iPSC0 iPSC3 iPSC7 iPSC10 iPSC14 iPSC20 iPSC28 iPSC35 iPSC45 iPSC60 iPSC90 iPSC120 Prim Fetal LV Heart1

Sorry, I lumped biological replicates together, but the table still did not fit. There should be 2 more columns: "Heart2" and "Heart diseased"

There are 3 biological replicates for all iPSC entries, 9 replicates for "Prim" entry, 6 replicates for "Fetal" and 3 replicates for "LV", "Heart1", "Heart2", and "Heart diseased".

The code that I used to generate the original (before attempting a batch correction) design matrix:

    RNAs <- factor(c("iPSC0","iPSC0","iPSC0",
"iPSC3","iPSC3","iPSC3",
"iPSC7","iPSC7","iPSC7",
"iPSC10","iPSC10","iPSC10",
"iPSC14","iPSC14","iPSC14",
"iPSC20","iPSC20","iPSC20",
"iPSC28","iPSC28","iPSC28",
"iPSC35","iPSC35","iPSC35",
"iPSC45","iPSC45","iPSC45",
"iPSC60","iPSC60","iPSC60",
"iPSC90","iPSC90","iPSC90",
"iPSC120","iPSC120","iPSC120",
"Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim",
"Fetal","Fetal","Fetal","Fetal","Fetal","Fetal",
"Heart1","Heart1","Heart1",
"LV","LV","LV",
"HeartDis","HeartDis","HeartDis",
"Heart2","Heart2","Heart2")
)
###    print(RNAs)
design <- model.matrix(~0 + RNAs)
colnames(design) <- levels(RNAs)

My attempt to account for batches (gse is an eset read from the series_matrix file with getGEO()) :

    colstring <- substring(pData(gse)[,"source_name_ch1"],1,4)

batches <- factor(colstring, levels=unique(colstring))
#    print(batches)
designBC <- model.matrix(~0 + RNAs + batches)
colnames(designBC)[1:nlevels(RNAs)] <- levels(RNAs)

Thank you!

Sorry about misunderstanding. I think the table below is what you requested:

RNAs                                                    Batches

 iPSC0 iPSC iPSC0 iPSC iPSC0 iPSC iPSC3 iPSC iPSC3 iPSC iPSC3 iPSC iPSC7 iPSC iPSC7 iPSC iPSC7 iPSC iPSC10 iPSC iPSC10 iPSC iPSC10 iPSC iPSC14 iPSC iPSC14 iPSC iPSC14 iPSC iPSC20 iPSC iPSC20 iPSC iPSC20 iPSC iPSC28 iPSC iPSC28 iPSC iPSC28 iPSC iPSC35 iPSC iPSC35 iPSC iPSC35 iPSC iPSC45 iPSC iPSC45 iPSC iPSC45 iPSC iPSC60 iPSC iPSC60 iPSC iPSC60 iPSC iPSC90 iPSC iPSC90 iPSC iPSC90 iPSC iPSC120 iPSC iPSC120 iPSC iPSC120 iPSC Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Prim Fetal Comm Fetal Comm Fetal Comm Fetal Comm Fetal Comm Fetal Comm Heart1 Comm Heart1 Comm Heart1 Comm LV Comm LV Comm LV Comm HeartDis Comm HeartDis Comm HeartDis Comm Heart2 Comm Heart2 Comm Heart2 Comm

Thank you!

limma batch effect • 1.2k views
modified 3.5 years ago by Aaron Lun25k • written 3.5 years ago by smt8n0

The answers to this kind of question depend on what your sample table looks like. Can you show the full table and the code that you're using to generate your design matrix? (Note that there is an "insert table" button in the edit bar when editing your question that you can use to add your sample table.)

Can you make a data frame with one row per sample, with a column for "RNAs" and a second column for "batches"? This is the table I'm asking you to include.

1
3.5 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Your groups are fully nested within your batch. It is impossible to account for the batch effect in the model, as you cannot distinguish (uninteresting) differences between batches from (interesting) differences between groups. You should instead use one-way layout with only the groupings, as you've done to generate design in your code. Any batch effect will be absorbed into the group-specific coefficients, so it will be implicitly modelled. All you have to do is to remember to only compare groups in the same batch. Comparisons between groups in different batches can but should not be formulated when you're running makeContrasts.

More generally, I wonder whether it is sensible to include all of these groups in the same analysis. Some of these samples seem to be very different - for example, iPSCs are probably cultured and of low variability, while the biopsy samples are likely to be more variable. If you're not going to test for DE between iPSCs and the biopsies, then it's unnecessary to analyze them together. In fact, mashing everything into a single analysis is unlikely to be appropriate, as you'll make normalization and variance modelling more difficult, which would outweigh any increase in residual d.f. from more samples.

Finally, as to your question 3, you need to include batch effects when you want to block on a non-confounding batch factor. I think you'll find plenty of examples in the limma and edgeR user's guides, but here's one:

batch <- factor(rep(1:3, 3))
condition <- factor(rep(LETTERS[1:3], each=3))
design <- model.matrix(~batch + condition)

Here, we have three batches, and each batch contains the same three groups. Blocking on batch allows us to test for DE between groups, accounting for the batch effect. Unlike your example, this is possible as the groups are not nested within batches.

Dear Aaron,

Thank you for the answer. Unfortunately, it brings up a problem.

First a quick question: do I understand correctly that you recommend the  design <- model.matrix(~0 + RNAs) approach that will implicitly take care of the batches? Just making sure since I do not know the "one-way layout" term.

The big problem is that the whole analysis was started to compare some of the iPSC batch samples to samples from the other batches. Is there any way I could do this?

Thank you

Slava

1

If there is any batch effect, then it will be absorbed into the estimates of the group-specific coefficients in design. It doesn't "take care of them" in the sense that the batch effects won't get removed (which is impossible for your experimental design, for reasons discussed above). However, the batch effects won't compromise the fit of the model either, which ensures that your dispersion estimates and various other statistics for comparisons within batches are valid.

If you want to compare between groups in different batches... well, that's simply not possible when your groups are nested in the batches. However, before we get too agitated, I would ask how you decided that batch effects were present in the first place. Did you do a MDS plot and see that the three sets of samples clustered separately? If so, that doesn't sound like a batch effect to me; I would expect separation between clusters just due to biological differences. On the other hand, if your batch effects are defined based on some technical reason, e.g., you prepared them on separate days/at sequencing centres, then you're out of luck. In such cases you'd have to go back to the drawing board and get the experimental design right.

I looked at box plots of intensities and at MDS plots. Just in case I am crazy, I am posting them below. Unfortunately, the batch effects seem too pronounced to be figments of my imagination.

1

The clustering in the MDS plot is to be expected and may represent genuine biological differences. Now, I don't deal with microarray data much, but the intensity plots look strange to me. Why are all the boxplots exactly the same within each putative batch? This should only be possible if you performed, say, quantile normalization; but in that case, why are the boxplots different between samples in different batches? This suggests that you normalized within each batch, which is not a sensible strategy if you want to compare samples between different batches.