Question: Batch effects in limma
0
gravatar for smt8n
3.5 years ago by
smt8n0
smt8n0 wrote:

Dear all,

I realize most facets of my question have been discussed here, but I am having hard time pulling all the answers together and generalizing.

I have 36 iPSC, 9 primary CM, and 18 biopsy samples. My quality control plots showed batch effects in the data and I added 3 batches to the design matrix, using, e.g. this:

A: Limma, blocking batch effect

discussion (and reviewing others on the topic)

 

After I added batches to the design matrix and got the "Coefficients not estimable" message, I found an answer from Gordon Smyth: "You can't expect to be able to estimate more things than your experiment contains information about.  In this case, your batches are entirely confounded with patients+treatment, and hence you can't estimate coefficients for them." in one of the posts (and the differential expression was the same with and without the "+batches" in the matrix).

There are three questions I have here:

1) Does it mean that if I have ALL (36+9+18) samples in the design matrix, adding "batches" explicitly is introducing too many variables and there is no way to explicitly include batch effects?

2) In this case, does the regression on the samples implicitly take care of the batch effect?

3a) (if the answer to (2) is  "Yes" ) Since there are a lot of discussions on batch effect inclusion, it should be needed for something.  In what cases is it needed?

3b) (if the answer to (2) is "No") What can I do?

Thank you. Any help is greatly appreciated.

Slava

Thank you for the comment. Added upon request:

the sample table is the following:

<caption>Sample table</caption>
iPSC0 iPSC3 iPSC7 iPSC10 iPSC14 iPSC20 iPSC28 iPSC35 iPSC45 iPSC60 iPSC90 iPSC120 Prim Fetal LV Heart1    

 

Sorry, I lumped biological replicates together, but the table still did not fit. There should be 2 more columns: "Heart2" and "Heart diseased"

There are 3 biological replicates for all iPSC entries, 9 replicates for "Prim" entry, 6 replicates for "Fetal" and 3 replicates for "LV", "Heart1", "Heart2", and "Heart diseased".

The code that I used to generate the original (before attempting a batch correction) design matrix:

    RNAs <- factor(c("iPSC0","iPSC0","iPSC0",
                     "iPSC3","iPSC3","iPSC3",
                     "iPSC7","iPSC7","iPSC7",
                     "iPSC10","iPSC10","iPSC10",
                     "iPSC14","iPSC14","iPSC14",
                     "iPSC20","iPSC20","iPSC20",
                     "iPSC28","iPSC28","iPSC28",
                     "iPSC35","iPSC35","iPSC35",
                     "iPSC45","iPSC45","iPSC45",
                     "iPSC60","iPSC60","iPSC60",
                     "iPSC90","iPSC90","iPSC90",
                     "iPSC120","iPSC120","iPSC120",
                     "Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim","Prim",
                     "Fetal","Fetal","Fetal","Fetal","Fetal","Fetal",
                     "Heart1","Heart1","Heart1",
                     "LV","LV","LV",
                     "HeartDis","HeartDis","HeartDis",
                     "Heart2","Heart2","Heart2")
                   )
###    print(RNAs)
    design <- model.matrix(~0 + RNAs)
    colnames(design) <- levels(RNAs)

My attempt to account for batches (gse is an eset read from the series_matrix file with getGEO()) :

    colstring <- substring(pData(gse)[,"source_name_ch1"],1,4)

batches <- factor(colstring, levels=unique(colstring))
#    print(batches)
    designBC <- model.matrix(~0 + RNAs + batches)
    colnames(designBC)[1:nlevels(RNAs)] <- levels(RNAs)

Thank you!

Addition #2

Sorry about misunderstanding. I think the table below is what you requested:

                              RNAs                                                    Batches             

iPSC0 iPSC
iPSC0 iPSC
iPSC0 iPSC
iPSC3 iPSC
iPSC3 iPSC
iPSC3 iPSC
iPSC7 iPSC
iPSC7 iPSC
iPSC7 iPSC
iPSC10 iPSC
iPSC10 iPSC
iPSC10 iPSC
iPSC14 iPSC
iPSC14 iPSC
iPSC14 iPSC
iPSC20 iPSC
iPSC20 iPSC
iPSC20 iPSC
iPSC28 iPSC
iPSC28 iPSC
iPSC28 iPSC
iPSC35 iPSC
iPSC35 iPSC
iPSC35 iPSC
iPSC45 iPSC
iPSC45 iPSC
iPSC45 iPSC
iPSC60 iPSC
iPSC60 iPSC
iPSC60 iPSC
iPSC90 iPSC
iPSC90 iPSC
iPSC90 iPSC
iPSC120 iPSC
iPSC120 iPSC
iPSC120 iPSC
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Prim Prim
Fetal Comm
Fetal Comm
Fetal Comm
Fetal Comm
Fetal Comm
Fetal Comm
Heart1 Comm
Heart1 Comm
Heart1 Comm
LV Comm
LV Comm
LV Comm
HeartDis Comm
HeartDis Comm
HeartDis Comm
Heart2 Comm
Heart2 Comm
Heart2 Comm

Thank you!

limma batch effect • 1.2k views
ADD COMMENTlink modified 3.5 years ago by Aaron Lun25k • written 3.5 years ago by smt8n0

The answers to this kind of question depend on what your sample table looks like. Can you show the full table and the code that you're using to generate your design matrix? (Note that there is an "insert table" button in the edit bar when editing your question that you can use to add your sample table.)

ADD REPLYlink written 3.5 years ago by Ryan C. Thompson7.4k

Can you make a data frame with one row per sample, with a column for "RNAs" and a second column for "batches"? This is the table I'm asking you to include.

ADD REPLYlink written 3.5 years ago by Ryan C. Thompson7.4k
Answer: Batch effects in limma
1
gravatar for Aaron Lun
3.5 years ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

Your groups are fully nested within your batch. It is impossible to account for the batch effect in the model, as you cannot distinguish (uninteresting) differences between batches from (interesting) differences between groups. You should instead use one-way layout with only the groupings, as you've done to generate design in your code. Any batch effect will be absorbed into the group-specific coefficients, so it will be implicitly modelled. All you have to do is to remember to only compare groups in the same batch. Comparisons between groups in different batches can but should not be formulated when you're running makeContrasts.

More generally, I wonder whether it is sensible to include all of these groups in the same analysis. Some of these samples seem to be very different - for example, iPSCs are probably cultured and of low variability, while the biopsy samples are likely to be more variable. If you're not going to test for DE between iPSCs and the biopsies, then it's unnecessary to analyze them together. In fact, mashing everything into a single analysis is unlikely to be appropriate, as you'll make normalization and variance modelling more difficult, which would outweigh any increase in residual d.f. from more samples.

Finally, as to your question 3, you need to include batch effects when you want to block on a non-confounding batch factor. I think you'll find plenty of examples in the limma and edgeR user's guides, but here's one:

batch <- factor(rep(1:3, 3))
condition <- factor(rep(LETTERS[1:3], each=3))
design <- model.matrix(~batch + condition)

Here, we have three batches, and each batch contains the same three groups. Blocking on batch allows us to test for DE between groups, accounting for the batch effect. Unlike your example, this is possible as the groups are not nested within batches.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Aaron Lun25k

Dear Aaron,

 

Thank you for the answer. Unfortunately, it brings up a problem.

First a quick question: do I understand correctly that you recommend the  design <- model.matrix(~0 + RNAs) approach that will implicitly take care of the batches? Just making sure since I do not know the "one-way layout" term. 

The big problem is that the whole analysis was started to compare some of the iPSC batch samples to samples from the other batches. Is there any way I could do this?

 

Thank you

Slava

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by smt8n0
1

If there is any batch effect, then it will be absorbed into the estimates of the group-specific coefficients in design. It doesn't "take care of them" in the sense that the batch effects won't get removed (which is impossible for your experimental design, for reasons discussed above). However, the batch effects won't compromise the fit of the model either, which ensures that your dispersion estimates and various other statistics for comparisons within batches are valid.

If you want to compare between groups in different batches... well, that's simply not possible when your groups are nested in the batches. However, before we get too agitated, I would ask how you decided that batch effects were present in the first place. Did you do a MDS plot and see that the three sets of samples clustered separately? If so, that doesn't sound like a batch effect to me; I would expect separation between clusters just due to biological differences. On the other hand, if your batch effects are defined based on some technical reason, e.g., you prepared them on separate days/at sequencing centres, then you're out of luck. In such cases you'd have to go back to the drawing board and get the experimental design right.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Aaron Lun25k

I looked at box plots of intensities and at MDS plots. Just in case I am crazy, I am posting them below. Unfortunately, the batch effects seem too pronounced to be figments of my imagination.

 

ADD REPLYlink written 3.5 years ago by smt8n0
1

The clustering in the MDS plot is to be expected and may represent genuine biological differences. Now, I don't deal with microarray data much, but the intensity plots look strange to me. Why are all the boxplots exactly the same within each putative batch? This should only be possible if you performed, say, quantile normalization; but in that case, why are the boxplots different between samples in different batches? This suggests that you normalized within each batch, which is not a sensible strategy if you want to compare samples between different batches.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Aaron Lun25k

Thank you for noticing the normalization issue. This is a data set from GEO that did not have regular raw files posted. Most of what was posted was normalized. That is what I used and what you saw. After your post I grabbed the only file with non-normalized data that there was, reshaped it using perl. I am not quite sure whether I treated it correctly (no negative controls, only p-values and neqc could not be used because of the issue discussed here: https://support.bioconductor.org/p/49954/), but even the raw data did not look like having batches. After running the through backgroundCorrect and normalizeBetweenArrays I saw graphs without any hint of batch effects. Thanks again. 

ADD REPLYlink written 3.5 years ago by smt8n0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 185 users visited in the last hour