Search
Question: Limma: fit a model with duplicateCorrelation()
0
2.7 years ago by
Singapore
Eleni Christodoulou150 wrote:

Dear Bioconductor community,

I am currently struggling with fitting a linear model to Illumina BeadChp HT12.v4 data. I have different cell generations (G1, G2, G3 etc) taken form the same clone. I assume these as biological replicates. For each generation I have different measurements. I assume these as technical replicates. Microarray data were measured in two different batches. Finally, I have a specific mutation that I am looking into: T790M: this is present in some of the samples and not present in others. Part of my targets file looks like this:

 Sample BiolRep T790M Batch CellClone CL75_G1.2 1 pos 1 CL75 CL75_G2 2 pos 1 CL75 CL75_G2 2 pos 1 CL75 CL75_G2 2 pos 1 CL75 CL75_G3 3 neg 1 CL75 CL75_G3 3 neg 1 CL75 CL75_G3 3 neg 1 CL75 CL86_G4 4 neg 0 CL86 CL86_G4 4 neg 0 CL86 CL86_G4 4 neg 0 CL86 CL86_G5 5 neg 0 CL86 CL86_G5 5 neg 0 CL86

I want to find differentially expressed genes between T790M positive (pos) and T790M negative (neg) samples. I am following limma's user guide for analysis of BeadArrays (section 17.3). I have reached the state where I need to factorize by my replicates. The guide says:"The study involves multiple cell types for the same patient..."

ct <- factor(targets$CellType) and then the DE genes between these cell types are detected. In my case it is different. I tried to factorize based on the Sample and on the CellClone and include the batch in my design matrix: ct<-factor(targets$Sample)
design <- model.matrix(~0+ct+targets$Batch) but I realized that if I follow the example given by limma, I then have to detect the differences of the variables that rcorrespond to the colnames of my design matrix, which would be CL75_G.2, CL86_G5 etc..This is not the comparison that I want. I then tried ct<-factor(targets$T790M)
design <- model.matrix(~0+ct+targets$Batch) dupcor <- duplicateCorrelation(gefR, design, block=targets$Sample) #cor between samples of the same clone
dupcor$consensus.correlation [1] 0.0723 then give the pos, neg and batch as column names of my design matrix and try to detect the DE genes between T790M pos and T790M neg, using the suggested procedure. fit <- lmFit(myMatrix, design, block=targets$Sample, correlation=dupcor$consensus.correlation) # the block indicates the technical replicates. contrasts <- makeContrasts('pos-neg',levels=design) fit2 <- contrasts.fit(fit, contrasts) fit2 <- eBayes(fit2) summary(decideTests(fit2, method="global")) The outcome was 0 DE genes (4597 are the total genes that are truly expressed on my array, according to their Detection p-value)  pos-neg -1 0 0 4597 1 0 I cannot imagine this result is true. I am sure I am missing something but I cannot understand what... The level of complexity of my experiment is grater than the complexity of the limma's example...I would really appreciate any comments/answers/insights from you. Thank you very much! Eleni ADD COMMENTlink modified 2.7 years ago by Aaron Lun21k • written 2.7 years ago by Eleni Christodoulou150 2 2.7 years ago by Aaron Lun21k Cambridge, United Kingdom Aaron Lun21k wrote: Your second approach seems to be correct. The only thing that's odd is that you run duplicateCorrelation on the gefR matrix, but lmFit on the myMatrix object. These had better hold the same data. Anyway, there's a couple of reasons for why you don't have any DE genes. The first is that, in the subset of the targets file you've shown, there are only 4 positive samples and all of these belong in one batch. If there are no positive samples anywhere else, this means that the DE comparison is only being performed using the positive/negative samples in the first batch (as the Batch blocking factor will absorb the rest). This will reduce power even if you have lots of negative samples in the other batches. It's worth checking on a MDS plot whether the batch effect really exists; if it doesn't, you could improve power by dropping it from the model. Another possible cause is outlier effects, either in terms of samples or gene-specific variances. The former you can identify by looking at a MDS plot (probably after running removeBatchEffect to get rid of any Batch effect, if one is present), and can be dealt with using arrayWeights. The latter you can protect against with robust=TRUE in eBayes. If none of these suggestions result in DE genes, then either the mutation doesn't result in any DE, or your data set simply does not have enough statistical power to detect it. In such cases, it's often nice to have some positive controls (i.e., genes you know to be DE upon perturbation) and see how they're behaving in order to diagnose potential problems. ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by Aaron Lun21k Dear Aaron, Thank you very much for taking time and thoroughly replying to me. gefR and myMatrix are the same, so no issue there. I also have pos and neg samples in both batches. But your third point really solved the problem! I am writing the steps that I followed in case I have misunderstood your suggestion...and also for the community, in case they run into a similar struggle. 1) apply removeBatchEffect{limma} on my matrix (gefR) 2) create a design matrix without the batch effect component design <- model.matrix(~0+ct) 3) apply arrayWeights function on my matrix 4) apply lmFit, including the outut of arrayWeights in the arguments 5) apply eBayes using robust=TRUE (yes, I protected against both sample and gene outliers) Now my output after running summary(decideTests(fit2, method="global")) is pos-neg -1 25 0 4556 1 13 Thank you very much! ADD REPLYlink written 2.7 years ago by Eleni Christodoulou150 2 There seems to be some misunderstanding here; I've only recommended using removeBatchEffect for constructing the MDS plot, not for the DE analysis itself. If there is a batch effect, then the design matrix should contain Batch. It's not appropriate to use removeBatchEffect to pre-process your data before the DE analysis, as you don't account for the uncertainty in estimating the batch effect. This will result in a loss of type I error control. ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Aaron Lun21k 1 Dear Eleni, one small addition(Aaron again has provided all beneficial solutions on this matter)---you can actually create a plot of arrayWeights, to see if indeed there is a difference between quality in your samples: AW <- arrayWeights(eSet, design) # eSet your expressionSet barplot(AW, xlab="Array", ylab="Weight", col="white", las=2) Then you can see if your samples vary significantly from 1--consider it as the equality line (of comparison)--- in an "ideal" scenario, all samples would roughfly have a weight of 1 if the variability was the same---so after including arrayWeights you have a "significant" increase in your number of DE genes, then it will surely be beneficial(as in most cases if the above plot suggests variations in quality) Best, Efstathios ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by svlachavas610 Thank you both Efstathie and Aaron. Aaron's comment was very important. By MDS plotting my data, with and without the batch effect, I saw that I do have a batch effect. I then added it in my fit. I also used duplicateCorrelation with arrayWeights (by plotting what Efstahios suggested, it turned out that not all arrays are of the same quality). As final result I don't see any differentially expressed genes between T790M pos and T790M neg cells after using decideTests and topTable with adjust.method='fdr' and p.value=0.05. If I loosen a bit the fdr criteria to 0.08 for example, I do get around 20 genes. Do you think they are unreliable? Personally, I would not think that the fdr threshold of 0.05 should be followed too religiously...Any feedback would be highly appreciated. Thank you very much again, Eleni ADD REPLYlink written 2.7 years ago by Eleni Christodoulou150 2 Well, it depends on what you're going to do with those genes. I would say that there's no compelling evidence for strong DE based on this microarray data. However, these experiments are often just the starting point for follow-up studies; if you have collaborators who are willing to investigate weak DE, then it's okay if you loosen the threshold a bit to give them something to play with. If those genes end up being validated experimentally, then it doesn't matter so much how you got there. (Of course, validation would be easier if you had stronger changes.) ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by Aaron Lun21k 1 Dear Eleni, just a small verification in order to be certain that i catch up the complete process--you mentioned that regardless of using arrayWeights, the number of DE genes you finally got was negligible ? But just to be certain, except from the arrayWeights plot, you included like below your computed arrayWeights both in the duplicateCorrelation and lmFit functions, right ?: aw <- arrayWeights(y, design) w <- asMatrixWeights(aw, dim(y)) dupcor <- duplicateCorrelation(y, design, block=block, weights=w) fit <- lmFit(y, design, block=block, correlation=dupcor$consensus, weights=w)..

like Mr Gordon Smyth has valuably proposed in one previous post ??

Thank you Aaron for the comment. I was thinking in a similar way...It always depends on what one wants to do with these genes. I now only have to ask my collaborators.

Efstathie, yes, this is what I did. And had no significant effect after use of arrayWeights. But the p-values were slightly improved :)

Best,

Elenii