Dear Bioconductor people,
I am working on the analysis of lung cancer cell line gene expression data, produced by illumina HT12 v4 array.
After initial data pre-processing using the lumi package, I am running a differential expression analysis with limma. My dataset includes related sample pre and post drug treatment. I have three replicates for each sample (technical replicates). I am using the treatment (yes or no) as a blocking variable.
This is the code that I am running: (sel.dataset is a gene expression matrix having ~20,000 genes as rows and 24 samples as columns. I concentrate on the first 6 columns. Samples 1:3 are untreated and samples 4:6 are drug-treated)
design.mat <- model.matrix(~colnames(sel.dataset[,1:6))
biolrep<-factor(colnames(sel.dataset[,1:6))
corfit<-duplicateCorrelation(dataset.log, design.mat, ndups=1, block=biolrep)
fit75<-lmFit(sel.dataset[,1:6], design.mat, block=biolrep, cor=corfit$consensus.correlation)
Unfortunately, when running lmFit with the correlation factor I get this error:
Error in chol.default(V) :
the leading minor of order 2 is not positive definite
My corfit values are 1, 1 and a matrix of NaNs repsectively. So the corfit$consensus is 1...
I understand that this depends on my data structure, but do you have any suggestion on how to overcome the problem?
I also tried another suggestion for paired samples that I found in limma user guide, which worked.
SibShip <- factor(c(1,2,3,1,2,3))
Treat <- factor(colnames(sel.dataset)[1:6], levels=c("CL75_parental","CL75_GefR"))
design <- model.matrix(~SibShip+Treat)
fit <- lmFit(sel.dataset[,1:6], design)
fit <- eBayes(fit)
Is it similar to using the previous code with biolrep? Would you suggest any of them as best suited for my data structure?
Thank you very much for any help.
Eleni
You might as well average the technical replicates together with
avearrays
. The resulting averaged samples will then directly represent biological replicates from each clone. This will simplify the ensuing analysis without losing information about the features of interest, i.e., the biological variability and the DE between groups. In fact, your current analysis with technical replicates is not appropriate, as your variance estimate represents the technical variability of the Illumina assay. This is not relevant when you are trying to look for results that are reproducible across independent biological samples.Now, the reason that your model coefficients are not estimable is because
colnames(sel.dataset)
is nested withinSibShip
. All samples withCL86_parental
andCL86_GefR
have aSibShip
term of 2. It is impossible to estimate the coefficients forCL86_*
andSibShip2
; any increase in the latter can be offset by a decrease in the former, such that an infinite number of solutions exist for the same fitted value. This is equally applicable to all yourCL*
groups. You might as well not putSibShip
in the design matrix at all, as this information is already present incolnames(sel.dataset)
via theCL
identifier.Finally, if you do average your arrays, then you won't be able to use a one-way layout because you'll only have one biological replicate per group. This requires a different parametrization, one example of which is shown below:
This is an additive model where the first few coefficients represent blocking factors for each clone, and the last coefficient represents the treatment effect averaged across all clones. Dropping the last coefficient will then test for DE. This approach will have enough residual d.f. to estimate the variance, even though you only have one sample per treatment/clone combination.
Dear Aaron,
Thank you very much for the time you spent on my question and your thorough explanation. I really understood much better my data design.
However, I have not used limma much in the past and some parts are not so clear to me.
3. I don't quite understand the whole explanation with SibShip 2 and CL86...Do you have any suggestions on what I should read in order to learn more about how this coefficient estimation works?
Actually, in the past, I was treating my replicates as biological (I did not know details about the experiment...now I do). So I was using the following code:
The genes that I was retrieving as DE in each per clone comparison (i.e CL75 parental with CL75 gefr) are very close to with what I retrieve with the model that you suggest, with coef=5. The rest coefficients (coef=1,2,3,4 in topTable) give totally different genes.This worries me...If topTable with coef=5 retrieves the average across samples, why is it similar to my previous per clone comparisons (i.e CL75 parental with CL75 gefr)?
Thank you very much for your feedback!
A one-way layout is a design where each group of samples has its own independent term in the design matrix. So, the simplest of designs, where you have one factor with several samples for each level of the factor, and the design only contains that factor (i.e., no additive or interaction terms). The design I suggested above is not a one-way layout, because you have
clone
added totreatment
.As for the contrast matrix, I presume you want to test for the treatment effect, in which case you can just drop the last coefficient of
design
. This treatment effect represents the average across all clones - you can't get the treatment effect for specific clones, because you only have one biological replicate per treatment in each cell line.As for your lack of understanding, look at the design matrix you got in your original post. If you take the
rowSum
of theCL86_parental
andCL86_GefR
columns, you'll end up with the same values as theSibship2
column. This means that any change in theSibship2
coefficient can be offset by an equal and opposite change in the values of both of theCL86_parental
andCL86_GefR
coefficients. That means that you can't estimate these coefficients, because there's no unique set of values that will minimize the sum of squares. It's as if I told you to solve for x, y and z in x = y + z. There's an infinite number of solutions for that equation.Finally, you will get different results because the interpretation of the coefficients are all different. Coefficients 1 to 4 in my model are blocking factors that represent the log-expression in each clone. Dropping these make no sense, as this is equivalent to asking whether log-expression is non-zero (which is obviously silly, because most of your genes will have non-zero expression). As I said before, you can no longer get the treatment effect for specific clones. I don't see this as being a problem, as the clone-specific treatment effect doesn't seem that interesting; it would only be relevant to that clone and cannot be generalized to future clones, which defeats the purpose of getting reproducible DE genes.
Dear Aaron,
Thank you very much for your explanations, I got to understand better how limma works. I have some more points I would like to ask you if possible...
1. To my understanding, I can detect the contrasts of interest either by dropping a coefficient in the design matrix, or by specifying the contrast matrix. I am thinking of what the respective contrast matrix be, if I want to check for the average effect of treatment across all clones. I cannot see a way of doing that with my current design matrix...Is it maybe not possible in this case?
2. Moreover, my lab is now preparing biological replicates for each of the clones. In this case, I will have more same-named columns in my sel.dataset and thus in my design matrix, it would be something like:
with as many rows as arrays and as many columns as samples (I will average again the technical replicates using avearrays). We are interested in subclonal differences, so we really need to compare each gefR clone with its parental. Would this be possible with the new data that I am expecting? I am thinking of using a contrast matrix contrast.matrix <- makeContrasts(CL75GefR-CL75Par, CL86GefR-CL86Par....., levels = factor(colnames(sel.dataset)), and then use
contr.fit_parental.gefr<-eBayes(contrasts.fit(fit, contrast.matrix)), where fit would be
fit <- lmFit(sel.dataset, design). For selecting the comparison of interest, I would drop the respective coefficient at the topTable command. For example, for detecting the DE genes between the CL75Par and CL75GefR I would use coef=1.
And, when we retrieve the rest samples, there will obviously be a batch effect, as they are hybridized on a different day than our initial batch. Do you suggest adding the batch as an additive effect in th above model like:
or is it better to use the ComBat package from bioconductor?
3. Regarding the intercept, this is my understanding and please correct me if I am wrong: With 0 intercept, ie. (i) model.matrix (~0+treatment1+treatment2) and contr.mat<-makeContrasts(treatment2-treatment1, levels=c("treatment1","treatment2")) we compare the treatments with each other and with (ii) model.matrix (~treatment1+treatment2) we compare each treatment to a common reference. If the reference is treatment1, then (ii) is similar to (i) when we use drop=3 (the last coefficient).
Thank you so much for your time!
Best regards,
Eleni
If by "current design matrix", you mean design.mat, then you could do it with something like this:
Obviously, if you're using the design matrix I suggested before, then you just drop the last coefficient.
As for your new "biological replicates", I think you need to clarify the nature of your clones. If they are randomly selected clones from a larger population, then true biological replication would involve assaying new clones, rather than re-assaying the same set of clones (which would be closer to technical replication). This is because you need more samples of the population to get more information about how it behaves. It would only make sense to examine behaviour of the individual clones if they are interesting in and of themselves, e.g., xenograft samples where you want to find patient-specific responses and don't care about how it generalizes to other patients. If this is the case, then your replication strategy makes more sense, but I'd still hesitate to call them biological replicates.
In any case, with the additive model, you cannot detect clone-specific treatment effects. This is intentional, because such effects are uninteresting when the clones are randomly selected from a larger population. If you want to detect these, you'll have to use a one-way layout (e.g., with the column names as a factor). Whether this makes any sense at all depends on the interpretation of the clones. If they are indeed randomly selected, then using a one-way layout would be crazy, as you'll end up with lots of DE that cannot be generalized to the population from whence they came.
Finally, I have no idea where
treatment1
andtreatment2
come from, so I don't know what you're referring to.Dear Aaron,
Thank you very much for your insightful explanations.
Regarding the contrast matrix: Thank you! It was so obvious but for some reason I could not think about it.
Regarding the new samples: They are from the same clones as before, which were not randomly selected; they exhibit extreme response phenotypes to certain treatment. So, we want to check the genes that are over or under-expressed in them so as to understand why they have this extreme behavior. They are not true biological replicates, you are right, so I'd better not call them as that.
So you say that an additive effect is better used if one wants to generalize his results? That is interesting, I had not realized it. So, in my case, I suspect I should have something like the following:
and correct for the batch effect after building the model, right (include the model as a variable in the ComBat function)?
Regarding the treatment variable: I am sorry, this was a general question on when and how to use the intercept and has nothing to do with the previous task. I am reading on limma now and just found that I am not certain about this point.
Thank you very much once again!
Best regards
So you say that an additive effect is better used if one wants to generalize his results?
Well, that depends. In this case, if you want to generalize across clones, then yes, an additive matrix is better. This assumes that the treatment effects in different clones are replicates of an underlying common treatment effect; if the underlying treatment effect is different for each clone (i.e., there's some clone/treatment interaction effect), then obviously, there's not much point in generalizing at all.
and correct for the batch effect after building the model, right...?
For the batch effect, you can put it in as an additive term in the design matrix. If you have new samples for every clone/treatment combination in the second batch, then all coefficients should still be estimable after blocking on the batch in the design. I find this a bit more convenient than trying to remove the batch effect separately; also, including the batch in the design matrix means that
limma
can account for the uncertainty of estimation of the blocking factors, which is more statistically rigorous.Thank you very much Aaron! So I can test for batch effects by just creating a design matrix as follows:
Please correct me if I am wrong and thanks again for your invaluable input.
You need to add it:
Oh, yes, that's what I meant...sorry :(...and thanks a lot again!
Dear Aaron,
I have performed the limma analysis and found the DE genes in the way that you suggested; adding the batch as a factor in the model. Now, I actually want to look into the fold change of specific genes. Can I extract the batch-corrected values somehow in limma? Or do I need to do the batch correction separately?
Thank you vert much,
Eleni
That's what
topTable
is for. Run it withn=Inf
, and subset by the gene name.Great! Thank you so much for the clear and quick response.
Best regards,
Eleni
Hi Aaron,
..I actually meant gene expressions, not fold changes. I think i can just correct for batch effect using RemoveBatchEffect prior to any linear model fitting and use the returned values for clustering and gene expression value extraction. Do you agree?
Thank you very much!
Eleni
Yes, as long as you don't use the batch-corrected values for linear model fitting.
No, I don't. Great! Thanks so much!
Regarding the treatment 1 and treatment 2 you are referring to the terms "pre and post drug treatment " you have mentioned in your description ? If so, you might consider the interaction term : in your design matrix---however, your design matrix is already complex and without knowing for certain what treatment 1 & 2 represent i would not like to infer any inappropriate coding, as already Aaron & Gordon have provided valuable and detailed explanations.
Thank you Stathis for the comment. It was not clear, I know. Look above :) "I am sorry, this was a general question on when and how to use the intercept and has nothing to do with the previous task. I am reading on limma now and just found that I am not certain about this point."
Best regards