Dear Bioconductor community,
I am currently struggling with fitting a linear model to Illumina BeadChp HT12.v4 data. I have different cell generations (G1, G2, G3 etc) taken form the same clone. I assume these as biological replicates. For each generation I have different measurements. I assume these as technical replicates. Microarray data were measured in two different batches. Finally, I have a specific mutation that I am looking into: T790M: this is present in some of the samples and not present in others. Part of my targets file looks like this:
Sample | BiolRep | T790M | Batch | CellClone |
CL75_G1.2 | 1 | pos | 1 | CL75 |
CL75_G2 | 2 | pos | 1 | CL75 |
CL75_G2 | 2 | pos | 1 | CL75 |
CL75_G2 | 2 | pos | 1 | CL75 |
CL75_G3 | 3 | neg | 1 | CL75 |
CL75_G3 | 3 | neg | 1 | CL75 |
CL75_G3 | 3 | neg | 1 | CL75 |
CL86_G4 | 4 | neg | 0 | CL86 |
CL86_G4 | 4 | neg | 0 | CL86 |
CL86_G4 | 4 | neg | 0 | CL86 |
CL86_G5 | 5 | neg | 0 | CL86 |
CL86_G5 | 5 | neg | 0 |
CL86 |
I want to find differentially expressed genes between T790M positive (pos) and T790M negative (neg) samples. I am following limma's user guide for analysis of BeadArrays (section 17.3). I have reached the state where I need to factorize by my replicates. The guide says:"The study involves multiple cell types for the same patient..."
ct <- factor(targets$CellType)
and then the DE genes between these cell types are detected.
In my case it is different. I tried to factorize based on the Sample and on the CellClone and include the batch in my design matrix:
ct<-factor(targets$Sample) design <- model.matrix(~0+ct+targets$Batch)
but I realized that if I follow the example given by limma, I then have to detect the differences of the variables that rcorrespond to the colnames of my design matrix, which would be CL75_G.2, CL86_G5 etc..This is not the comparison that I want.
I then tried
ct<-factor(targets$T790M) design <- model.matrix(~0+ct+targets$Batch) dupcor <- duplicateCorrelation(gefR, design, block=targets$Sample) #cor between samples of the same clone dupcor$consensus.correlation [1] 0.0723
then give the pos, neg and batch as column names of my design matrix and try to detect the DE genes between T790M pos and T790M neg, using the suggested procedure.
fit <- lmFit(myMatrix, design, block=targets$Sample, correlation=dupcor$consensus.correlation) # the block indicates the technical replicates. contrasts <- makeContrasts('pos-neg',levels=design) fit2 <- contrasts.fit(fit, contrasts) fit2 <- eBayes(fit2) summary(decideTests(fit2, method="global"))
The outcome was 0 DE genes (4597 are the total genes that are truly expressed on my array, according to their Detection p-value)
pos-neg -1 0 0 4597 1 0
I cannot imagine this result is true. I am sure I am missing something but I cannot understand what... The level of complexity of my experiment is grater than the complexity of the limma's example...I would really appreciate any comments/answers/insights from you.
Thank you very much!
Eleni
Dear Aaron,
Thank you very much for taking time and thoroughly replying to me. gefR and myMatrix are the same, so no issue there. I also have pos and neg samples in both batches. But your third point really solved the problem! I am writing the steps that I followed in case I have misunderstood your suggestion...and also for the community, in case they run into a similar struggle.
1) apply removeBatchEffect{limma} on my matrix (gefR)
2) create a design matrix without the batch effect component design <- model.matrix(~0+ct)
3) apply arrayWeights function on my matrix
4) apply lmFit, including the outut of arrayWeights in the arguments
5) apply eBayes using robust=TRUE (yes, I protected against both sample and gene outliers)
Now my output after running
Thank you very much!
There seems to be some misunderstanding here; I've only recommended using
removeBatchEffect
for constructing the MDS plot, not for the DE analysis itself. If there is a batch effect, then the design matrix should containBatch
. It's not appropriate to useremoveBatchEffect
to pre-process your data before the DE analysis, as you don't account for the uncertainty in estimating the batch effect. This will result in a loss of type I error control.Dear Eleni,
one small addition(Aaron again has provided all beneficial solutions on this matter)---you can actually create a plot of arrayWeights, to see if indeed there is a difference between quality in your samples:
AW <- arrayWeights(eSet, design) # eSet your expressionSet
barplot(AW, xlab="Array", ylab="Weight", col="white", las=2)
Then you can see if your samples vary significantly from 1--consider it as the equality line (of comparison)--- in an "ideal" scenario, all samples would roughfly have a weight of 1 if the variability was the same---so after including arrayWeights you have a "significant" increase in your number of DE genes, then it will surely be beneficial(as in most cases if the above plot suggests variations in quality)
Best,
Efstathios
Thank you both Efstathie and Aaron.
Aaron's comment was very important. By MDS plotting my data, with and without the batch effect, I saw that I do have a batch effect. I then added it in my fit. I also used duplicateCorrelation with arrayWeights (by plotting what Efstahios suggested, it turned out that not all arrays are of the same quality). As final result I don't see any differentially expressed genes between T790M pos and T790M neg cells after using decideTests and topTable with adjust.method='fdr' and p.value=0.05. If I loosen a bit the fdr criteria to 0.08 for example, I do get around 20 genes. Do you think they are unreliable? Personally, I would not think that the fdr threshold of 0.05 should be followed too religiously...Any feedback would be highly appreciated.
Thank you very much again,
Eleni
Well, it depends on what you're going to do with those genes. I would say that there's no compelling evidence for strong DE based on this microarray data. However, these experiments are often just the starting point for follow-up studies; if you have collaborators who are willing to investigate weak DE, then it's okay if you loosen the threshold a bit to give them something to play with. If those genes end up being validated experimentally, then it doesn't matter so much how you got there. (Of course, validation would be easier if you had stronger changes.)
Dear Eleni,
just a small verification in order to be certain that i catch up the complete process--you mentioned that regardless of using arrayWeights, the number of DE genes you finally got was negligible ? But just to be certain, except from the arrayWeights plot, you included like below your computed arrayWeights both in the duplicateCorrelation and lmFit functions, right ?:
aw <- arrayWeights(y, design)
w <- asMatrixWeights(aw, dim(y))
dupcor <- duplicateCorrelation(y, design, block=block, weights=w)
fit <- lmFit(y, design, block=block, correlation=dupcor$consensus, weights=w)..
like Mr Gordon Smyth has valuably proposed in one previous post ??
Thank you Aaron for the comment. I was thinking in a similar way...It always depends on what one wants to do with these genes. I now only have to ask my collaborators.
Efstathie, yes, this is what I did. And had no significant effect after use of arrayWeights. But the p-values were slightly improved :)
Best,
Elenii