Question

Taking into account strong batch effect in microarray differential expression analysis with limma

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 7 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Community,

I'm currently try to analyze one microarray dataset, essentially comprised of 3 different cell lines/batches, each one having the same experimental design: the deletion (CRISPR/Cas9) of a specific gene, and includes WT (wild type) samples, versus knock-out samples. After import, normalization and filtering of all samples together (oligo R package-rma), the experimental design looks like the following:

eset.2
ExpressionSet (storageMode: lockedEnvironment)
assayData: 36451 features, 17 samples 
  element names: exprs 
protocolData
  rowNames: 01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL
    02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL ...
    6-H9_FasKO-36_(HuGene-2_0-st).CEL (17 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: 01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL
    02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL ...
    6-H9_FasKO-36_(HuGene-2_0-st).CEL (17 total)
  varLabels: index Condition_detailed Cell_line Condition_Fas
  varMetadata: labelDescription channel
featureData
  featureNames: 16657436 16657440 ... 17118478 (36451 total)
  fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hugene.2.0.st 

 head(pData(eset.2))
                                             index Condition_detailed
01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL       1   WT_reconstituted
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL       2   WT_reconstituted
03-4_(HuGene-2_0-st)_CEM-FasKO_FasWT_4.CEL       3   WT_reconstituted
08-10_(HuGene-2_0-st)_CEM-FasKO_pcDNA_10.CEL     4           KO_clone
09-11_(HuGene-2_0-st)_CEM-FasKO_pcDNA_11.CEL     5           KO_clone
1-CEM_WT_(HuGene-2_0-st).CEL                     6  WT_parental_tech1
                                             Cell_line Condition_Fas
01-1_(HuGene-2_0-st)_CEM-FasKO_FasWT_1.CEL         CEM            WT
02-3_(HuGene-2_0-st)_CEM-FasKO_FasWT_3.CEL         CEM            WT
03-4_(HuGene-2_0-st)_CEM-FasKO_FasWT_4.CEL         CEM            WT
08-10_(HuGene-2_0-st)_CEM-FasKO_pcDNA_10.CEL       CEM      KO_clone
09-11_(HuGene-2_0-st)_CEM-FasKO_pcDNA_11.CEL       CEM      KO_clone
1-CEM_WT_(HuGene-2_0-st).CEL                       CEM            WT

 table(pData(eset.2)$Cell_line)

       CEM         H9 MDA_MB_231 
        11          3          3 


table(pData(eset.2)$Condition_Fas,pData(eset.2)$Cell_line)

           CEM H9 MDA_MB_231
  KO_clone   6  2          2
  WT         5  1          1

However, the major issue-batch effect, is clear on the relative MDS plots-as you can see(attached links below), both 3 cell lines cluster clearly in distinct parts, whereas the individual biological conditions, are not clearly distinguished. As i acknowledge the putative bottlenecks in the aformentioned experimental design, how should i proceed to take into account this issue ?

https://www.dropbox.com/s/cng4b0q7djtshbm/MDSplot.FasExp.NormFiltered.CellLine.tiff?dl=0 https://www.dropbox.com/s/5vsztfun72auzh7/MDSplot.FasExperiment.Filtered.Norm.Condition.tiff?dl=0

In your opinion, i should use the general condition WT vs KO, and block on the Cell_line variable ? or this would be biased, as each cell line perhaps would have a "different biological behaviour" regarding the targeted genome editing ? and a general DEG list would not represent-or reflect the differences between each cell line phenotype ?

alternatively, i should perform pairwise comparisons within each cell line for WT vs KO ? My additional concern here, is that in two of the 3 cell lines (H9 & MDAMB231), there is only one biological replicate/sample of the wild type...

Overall, my goal is to identify any DEGs related to immunity, based on the effect of the deleted gene-that is, the comparison of WT vs KO samples-

any suggestions or ideas for this challenging scenario would be grateful !!

Kind Regards,

Efstathios

limma microarray hugeneST batch effect DE • 2.8k views

ADD COMMENT • link updated 7.0 years ago by Ryan C. Thompson ★ 7.9k • written 7.0 years ago by svlachavas ▴ 840

score 1 · Answer 1 · 2019-01-23

1

Entering edit mode

James W. MacDonald 68k

@james-w-macdonald-5106

Last seen 1 day ago

United States

The questions you ask are those that should be answered by the analyst, which in this case is you. If you are going to be doing your own analyses, then you have to be able to weigh the tradeoffs and be able to defend your choices. Otherwise, you need to find somebody local who can do that for you.

ADD COMMENT • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James,

thank you for your comment-in fact, I'm fully aware of the tradeoffs and putative solutions-however, even that i have analyzed similar data in the past, i had never encountered thus far this scenario with such a high batch effect presense-that is why, i also created a post here, as much more experienced people and specialists on the field like you could just provide an extra opinion on this matter-

Overall, if i could re-formulate my question: from the relative MDS plots, you think that this batch effect could be adressed by blocking with the confounder variable in the design matrix ? or a more "strigent" solution, like batch effect correction, and then perhaps something like weighted least squares test could be the alternative ?

Thank you in advance,

Efstathios

ADD REPLY • link 7.0 years ago svlachavas ▴ 840

1

Entering edit mode

My point is that you aren't asking about how to use the software, and you apparently understand that there are tradeoffs. So the only question left is to decide what tradeoffs are acceptable and what you should do. This is an analysis decision, and as the analyst it is up to you to make the decision. Peter and Ryan can tell you what they think are reasonable things to do or try, but you know what? They haven't seen your data, so they are just making guesses. If you want to base your analysis on the guesses of people who haven't actually dealt firsthand with your data, then that is one approach, but I am not sure that's the optimal way to proceed. YMMV.

ADD REPLY • link 7.0 years ago James W. MacDonald 68k

0

Entering edit mode

Dear James, thanks again for bringing up this point. Indeed, my main notion was to take into account some different ideas/suggestions on this matter, and of course then implement my analysis.

ADD REPLY • link 7.0 years ago svlachavas ▴ 840

score 1 · Answer 2 · 2019-01-23

1

Entering edit mode

Peter Langfelder ★ 3.0k

@peter-langfelder-4469

Last seen 15 months ago

United States

The MDS/PCA plots just tell you that the effect of cell line is much greater than the effect of the perturbation. They don't tell you whether the effects of the perturbation are similar across the cell lines. I'd run limma on each cell line separately, then create scatterplots of the fold changes (coefficients) and t statistics. If there's a decent correlation between the fold changes or t statistics (I'd say 0.3 or higher, although I have never compared such low sample numbers), I'd then use all samples and add a covariate factor(Cell_line). If not, just use the CEM line.

This experiment reminded me of Ronald Fisher's observation To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. Use a reasonable number of samples that are not confounded by uninteresting variation, otherwise you'll have uninterpretable results.

ADD COMMENT • link 7.0 years ago Peter Langfelder ★ 3.0k

1

Entering edit mode

There is also limma::genas to test the correlation between two contrasts and give a p-value for it.

ADD REPLY • link 7.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you Peter both for your explanations and clarifications !! Regarding your suggestion on "I'd run limma on each cell line separately, then create scatterplots of the fold changes (coefficients) and t statistics" :

you meant, from the raw data-"from scratch" run each batch separately and relative statistics ? and then create the scatterplots ?

or even from the merged normalized dataset that i have processed already, create something like the approach i answered below to Ryan ? with a combined factor ?

[Unfortunately, regarding the experimental design, i had no any implication or involvement- i just got the data from external collaborators...]

ADD REPLY • link 7.0 years ago svlachavas ▴ 840

1

Entering edit mode

You can either run the data separately for each cell line or use the combined factor and contrasts approach. They should give you very similar scatterplots. The combined factor and contrast approach is perhaps somewhat preferable because the empirical Bayes moderation should have more samples to work with (I'm not an expert on the inner workings of this aspect), although it assumes that the samples from different cell lines, although different, at least have similar variance characteristics. In any case, I'd use this analysis only as a suggestion as to how to proceed, so minor differences should not matter.

ADD REPLY • link 7.0 years ago Peter Langfelder ★ 3.0k

score 1 · Answer 3 · 2019-01-23

1

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 15 months ago

Icahn School of Medicine at Mount Sinai…

My experience in cases like this, where a batch effect is much larger than the effect of interest, is that you tend to have reduced (or even nonexistent) power to detect DE on the effect of interest, because even small amounts of noise in the batch effect will drown out the signal in the effect of interest. However, you might get lucky and find that the batch effect is mostly orthogonal to the effect of interest. You can determine this by looking at other principal coordinates in the MDS plot beyond the first 2 to see if your effect of interest shows up in any of them, or by making a new MDS plot after subtracting the batch effect from the data (using e.g. limma::removeBatchEffect or ComBat). If you do find this, and the effect looks consistent across cell lines, then you can reasonably justify the typical additive model of ~ batch + condition. Otherwise, the batch effect may make your desired analysis impossible. Regardless, keep in mind that even if you find part of the effect of interest is orthogonal to the batch effect and therefore amenable to analysis, they may be another another part that is parallel to the batch effect and therefore irrecoverable, which means that whatever gene list you do get might be incomplete.

ADD COMMENT • link 7.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Thank you Ryan for your considerations on this matter !! indeed, my next actions are to check further componentns/dimensions, as also to test with removeBatchEffect to test the relative changes after-and perhaps i will provide feedback afterwards-

concerning the genas function-as i have never used it before: as it has the argument coef that represent perhaps two different contrasts-you meant that i should utilize makeContrasts and make prior a combined Factor to implement in the design matrix ? that is Cellline & ConditionFas ? and then create the relative subcomparisons for each cell line, correct ?

For example, something like the following ?

comb.fac <- paste0(pData(eset.2)$Cell_line, "_", pData(eset.2)$Condition_Fas))
d <- model.matrix(~0 + comb.fac)..

ADD REPLY • link 7.0 years ago svlachavas ▴ 840

1

Entering edit mode

I believe the best way to use genas is to use makeContrasts and contrasts.fit with only the two contrasts you want to compare, and then call genas with coef=c(1,2).

ADD REPLY • link 7.0 years ago Ryan C. Thompson ★ 7.9k