Question

Error when fitting linear model with technical replicates and blocking variable

0

Entering edit mode

Eleni Christodoulou ▴ 150

@eleni-christodoulou-2653

Last seen 5.6 years ago

Singapore

Dear Bioconductor people,

I am working on the analysis of lung cancer cell line gene expression data, produced by illumina HT12 v4 array.
After initial data pre-processing using the lumi package, I am running a differential expression analysis with limma. My dataset includes related sample pre and post drug treatment. I have three replicates for each sample (technical replicates). I am using the treatment (yes or no) as a blocking variable.

This is the code that I am running: (sel.dataset is a gene expression matrix having ~20,000 genes as rows and 24 samples as columns. I concentrate on the first 6 columns. Samples 1:3 are untreated and samples 4:6 are drug-treated)

design.mat <- model.matrix(~colnames(sel.dataset[,1:6))
biolrep<-factor(colnames(sel.dataset[,1:6))
corfit<-duplicateCorrelation(dataset.log, design.mat, ndups=1, block=biolrep)
fit75<-lmFit(sel.dataset[,1:6], design.mat, block=biolrep, cor=corfit$consensus.correlation)

Unfortunately, when running lmFit with the correlation factor I get this error:

Error in chol.default(V) :
the leading minor of order 2 is not positive definite

My corfit values are 1, 1 and a matrix of NaNs repsectively. So the corfit$consensus is 1...
I understand that this depends on my data structure, but do you have any suggestion on how to overcome the problem?

I also tried another suggestion for paired samples that I found in limma user guide, which worked.

SibShip <- factor(c(1,2,3,1,2,3))
Treat <- factor(colnames(sel.dataset)[1:6], levels=c("CL75_parental","CL75_GefR"))
design <- model.matrix(~SibShip+Treat)
fit <- lmFit(sel.dataset[,1:6], design)
fit <- eBayes(fit)

Is it similar to using the previous code with biolrep? Would you suggest any of them as best suited for my data structure?

Thank you very much for any help.

Eleni

limma technical replicates • 3.3k views

ADD COMMENT • link 8.4 years ago Eleni Christodoulou ▴ 150

score 3 · Answer 1 · 2015-12-09

First a couple of clarifications. You say that you have technical replicates but that can't be right. If the three pairs (values of SibShip) were technical replicates, then you would have no true biological replicates, so you wouldn't be able to do any statistical analysis. So I will assume that the three pairs are actually biological replicates. Second, blocking variables and treatments are not the same thing. For this data, the blocking variable is SibShip rather than treatment.

Anyway, if you just use the first 6 samples, then you have a simple paired design, so the paired analysis is correct.

The reason why the first (duplicateCorrelation) approach gave you an error was because you used the same factor for biolrep as you used to form the design matrix. These have to be different factors. You would have got a lot of warnings from duplicateCorrelation when you ran it. I assume that you actually meant to set biolrep <- SibShip.

More generally though, it is not usually the best idea to analyze a small group of samples separately. You have 24 samples in total. One of the purposes of the limma software is to get benefits from analyzing all the samples at once. So I would encourage you to take a larger view of the data and the analysis.

score 0 · Answer 2 · 2015-12-09

Dear Prof Smyth,

Thank you for your reply. I tried biolrep similar to SibShip and it worked (now both methods give me the same results). Yes, in my previous implementation, I had many warnings after running duplicateCorrelation.

However, I am too concerned with the design of the experiment and the design matrix and my analysis.

For my whole dataset (24 samples), I have combined two Illumina HT12 v4 one-color chips. Each of them contains 4 paired samples: 2 pre and 2 post-drug treatment, in triplicates. Each sample corresponds to a different clone, coming from the same parental cell line. Thus, each chip contains 12 arrays and the chip's design is given in the following table for your convenience:

Clone 1, replicate 1, pre-drug

Clone 1, replicate 2, pre-drug

Clone 1, replicate 3, pre-drug

Clone 1, replicate 1, post-drug

Clone 1, replicate 2, post-drug

Clone 1, replicate 3, post-drug

Clone 2, replicate 1, pre-drug

Clone 2, replicate 2, pre-drug

Clone 2, replicate 3, pre-drug

Clone 2, replicate 1, post-drug

Clone 2, replicate 2, post-drug

Clone 2, replicate 3, post-drug

The person who performed the experiment told me that, regarding the replicates, she extracted the same sample from each clone and hybridized it on the three different arrays of the chip...thus I think these are technical replicates, no?

In my script above, I am fitting a linear model for the differencial gene expression between clone 1 pre- and post-treatment samples. Our aim is to find the differentially expressed genes in this clone. Do you think that it does not make sense statistically?

Thank you very much!

Eleni

score 0 · Answer 3 · 2015-12-10

0

Entering edit mode

Eleni Christodoulou ▴ 150

@eleni-christodoulou-2653

Last seen 5.6 years ago

Singapore

Dear Bioconductor people,

Regarding the experiment above, how can I include all my samples (24 samples, 12 pre and 12 post treatment, consisting of 4 biological replicates x 3 technical replicates each) in a paired analysis? I would like to calculate the differences between the pre and post- treatment gene expressions. I am using one color Illumina arrays. This is the head of my dataset, which is a matrix of 24 cloumns (=samples) and 20829 rows (=probes).

head(sel.dataset)
       CL75_parental CL75_parental CL75_parental CL75_GefR CL75_GefR CL75_GefR CL86_parental CL86_parental
A1BG        6.401588      6.606306      6.789451  6.702613  6.776846  6.500754      6.483652      6.405271
A2LD1       7.502173      7.907765      7.591849  7.727033  7.719438  7.610230      7.584259      7.424716
A4GALT      6.783810      6.724379      6.728013  6.697062  6.638398  6.469095      6.396735      6.461457
AAA1        6.760574      6.659293      6.634600  6.834031  6.626767  6.601760      6.388439      6.650179
AAAS        8.454223      8.344867      8.571720  7.969051  8.316882  8.207657      7.929736      8.022445
AACS        8.147424      8.280784      7.908773  7.978336  7.976440  8.156404      7.689166      7.554935
       CL86_parental CL86_GefR CL86_GefR CL86_GefR CL131_parental CL131_parental CL131_parental CL131_GefR
A1BG        6.470172  6.492782  6.527619  6.099269       6.734018       6.606441       6.416072   6.721640
A2LD1       7.370863  7.168888  7.141221  7.303593       7.329300       7.455667       7.221101   7.037138
A4GALT      6.400155  5.973673  6.110442  6.048572       6.396769       6.222972       6.495962   5.918933
AAA1        6.759579  6.784353  6.719203  6.635560       6.520499       6.499157       6.445014   6.328983
AAAS        8.164899  7.942730  7.802210  7.822633       8.303862       8.215421       8.209198   7.791404
AACS        7.555452  8.122289  8.134646  8.004010       7.591696       7.960951       7.996682   7.710887
       CL131_GefR CL131_GefR CL230_parental CL230_parental CL230_parental CL230_GefR CL230_GefR CL230_GefR
A1BG     6.464390   6.750134       6.401766       6.519603       6.739735   6.298140   6.474740   6.432751
A2LD1    6.977045   6.931082       7.816609       7.620243       7.634343   7.703414   7.715810   7.729909
A4GALT   6.052917   5.933933       6.631942       6.749567       6.737948   5.837037   6.056628   5.896360
AAA1     6.438146   6.287846       6.243877       6.270896       6.370740   6.324773   6.445498   6.494862
AAAS     7.731876   7.538460       8.302801       8.134375       8.074086   7.819765   7.867493   7.728546
AACS     7.727346   7.801935       7.678773       7.699589       7.578517   8.044944   7.987840   8.106606

And here is what I have coded:

SibShip <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)  #Each sample has three technical replicates pre and three technical replicates post treatment 

design <- model.matrix(~0+SibShip+colnames(sel.dataset))
fit <- lmFit(sel.dataset, design)

And R returns:

Coefficients not estimable: colnames(sel.dataset) CL86_parental Warning message: Partial NA coefficients for 20929 probe(s)

It is a warning, I know, but I really need to calculate these coefficients and this does not seem to work. Can you help me sort this out?

Thank you very much!

ADD COMMENT • link 8.4 years ago Eleni Christodoulou ▴ 150

1

Entering edit mode

You might as well average the technical replicates together with avearrays. The resulting averaged samples will then directly represent biological replicates from each clone. This will simplify the ensuing analysis without losing information about the features of interest, i.e., the biological variability and the DE between groups. In fact, your current analysis with technical replicates is not appropriate, as your variance estimate represents the technical variability of the Illumina assay. This is not relevant when you are trying to look for results that are reproducible across independent biological samples.

Now, the reason that your model coefficients are not estimable is because colnames(sel.dataset) is nested within SibShip. All samples with CL86_parental and CL86_GefR have a SibShip term of 2. It is impossible to estimate the coefficients for CL86_* and SibShip2; any increase in the latter can be offset by a decrease in the former, such that an infinite number of solutions exist for the same fitted value. This is equally applicable to all your CL* groups. You might as well not put SibShip in the design matrix at all, as this information is already present in colnames(sel.dataset) via the CL identifier.

Finally, if you do average your arrays, then you won't be able to use a one-way layout because you'll only have one biological replicate per group. This requires a different parametrization, one example of which is shown below:

# Assuming you have averaged sel.dataset:
clone <- factor(sub("_.*", "", colnames(ave.dataset)))
treatment <- factor(sub(".*_", "", colnames(ave.dataset)))
design <- model.matrix(~0 + clone + treatment)

This is an additive model where the first few coefficients represent blocking factors for each clone, and the last coefficient represents the treatment effect averaged across all clones. Dropping the last coefficient will then test for DE. This approach will have enough residual d.f. to estimate the variance, even though you only have one sample per treatment/clone combination.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thank you very much for the time you spent on my question and your thorough explanation. I really understood much better my data design.

However, I have not used limma much in the past and some parts are not so clear to me.

What do you mean by 'one-way layout'?
What should the contrast matrix be in this case? Using only the design matrix to build the model is not so intuitive to me...

3. I don't quite understand the whole explanation with SibShip 2 and CL86...Do you have any suggestions on what I should read in order to learn more about how this coefficient estimation works?

Actually, in the past, I was treating my replicates as biological (I did not know details about the experiment...now I do). So I was using the following code:

# # # Fit a linear model (limma)
design.mat <- model.matrix(~colnames(sel.dataset))
colnames(design.mat) = levels(factor(colnames(sel.dataset)))
fit <- lmFit(sel.dataset, design.mat)
fit$genes = rownames(sel.dataset)

##  ##   ##  Parental vs Gef_R clone comparisons
contr.mat_parental.gefr <- makeContrasts(CL75_GefR-CL75_parental, CL86_GefR-CL86_parental, CL131_GefR-CL131_parental, CL230_GefR-CL230_parental, levels=factor(colnames(sel.dataset)))
contr.fit_parental.gefr<-eBayes(contrasts.fit(fit, contr.mat_parental.gefr))
coef.names = colnames(coef(contr.fit_parental.gefr))

#shows the results of CL75 comparison
print(topTable(contr.fit_parental.gefr, coef=coef.names[1], adjust='fdr', number=10))

#shows the result of clone 86 comparison
print(topTable(contr.fit_parental.gefr, coef=coef.names[2], adjust='fdr', number=10))

volcanoplot(contr.fit_parental.gefr, highlight=10, coef=coef.names[1], names=fit$genes)

The genes that I was retrieving as DE in each per clone comparison (i.e CL75 parental with CL75 gefr) are very close to with what I retrieve with the model that you suggest, with coef=5. The rest coefficients (coef=1,2,3,4 in topTable) give totally different genes.This worries me...If topTable with coef=5 retrieves the average across samples, why is it similar to my previous per clone comparisons (i.e CL75 parental with CL75 gefr)?

Thank you very much for your feedback!

ADD REPLY • link 8.4 years ago Eleni Christodoulou ▴ 150

1

Entering edit mode

A one-way layout is a design where each group of samples has its own independent term in the design matrix. So, the simplest of designs, where you have one factor with several samples for each level of the factor, and the design only contains that factor (i.e., no additive or interaction terms). The design I suggested above is not a one-way layout, because you have clone added to treatment.

As for the contrast matrix, I presume you want to test for the treatment effect, in which case you can just drop the last coefficient of design. This treatment effect represents the average across all clones - you can't get the treatment effect for specific clones, because you only have one biological replicate per treatment in each cell line.

As for your lack of understanding, look at the design matrix you got in your original post. If you take the rowSum of the CL86_parental and CL86_GefR columns, you'll end up with the same values as the Sibship2 column. This means that any change in the Sibship2 coefficient can be offset by an equal and opposite change in the values of both of the CL86_parental and CL86_GefR coefficients. That means that you can't estimate these coefficients, because there's no unique set of values that will minimize the sum of squares. It's as if I told you to solve for x, y and z in x = y + z. There's an infinite number of solutions for that equation.

Finally, you will get different results because the interpretation of the coefficients are all different. Coefficients 1 to 4 in my model are blocking factors that represent the log-expression in each clone. Dropping these make no sense, as this is equivalent to asking whether log-expression is non-zero (which is obviously silly, because most of your genes will have non-zero expression). As I said before, you can no longer get the treatment effect for specific clones. I don't see this as being a problem, as the clone-specific treatment effect doesn't seem that interesting; it would only be relevant to that clone and cannot be generalized to future clones, which defeats the purpose of getting reproducible DE genes.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thank you very much for your explanations, I got to understand better how limma works. I have some more points I would like to ask you if possible...

1. To my understanding, I can detect the contrasts of interest either by dropping a coefficient in the design matrix, or by specifying the contrast matrix. I am thinking of what the respective contrast matrix be, if I want to check for the average effect of treatment across all clones. I cannot see a way of doing that with my current design matrix...Is it maybe not possible in this case?

2. Moreover, my lab is now preparing biological replicates for each of the clones. In this case, I will have more same-named columns in my sel.dataset and thus in my design matrix, it would be something like:

CL75_Par	CL75_Par	CL75_GefR	CL75_GefR	CL75_GefR	.	.	.	.
1
1
	1
	1
		1
		1
			1
			1
...	...	...	...	...

with as many rows as arrays and as many columns as samples (I will average again the technical replicates using avearrays). We are interested in subclonal differences, so we really need to compare each gefR clone with its parental. Would this be possible with the new data that I am expecting? I am thinking of using a contrast matrix contrast.matrix <- makeContrasts(CL75GefR-CL75Par, CL86GefR-CL86Par....., levels = factor(colnames(sel.dataset)), and then use

contr.fit_parental.gefr<-eBayes(contrasts.fit(fit, contrast.matrix)), where fit would be

fit <- lmFit(sel.dataset, design). For selecting the comparison of interest, I would drop the respective coefficient at the topTable command. For example, for detecting the DE genes between the CL75Par and CL75GefR I would use coef=1.

And, when we retrieve the rest samples, there will obviously be a batch effect, as they are hybridized on a different day than our initial batch. Do you suggest adding the batch as an additive effect in th above model like:

design <- model.matrix(~0 + clone + treatment + batch)

or is it better to use the ComBat package from bioconductor?

3. Regarding the intercept, this is my understanding and please correct me if I am wrong: With 0 intercept, ie. (i) model.matrix (~0+treatment1+treatment2) and contr.mat<-makeContrasts(treatment2-treatment1, levels=c("treatment1","treatment2")) we compare the treatments with each other and with (ii) model.matrix (~treatment1+treatment2) we compare each treatment to a common reference. If the reference is treatment1, then (ii) is similar to (i) when we use drop=3 (the last coefficient).

Thank you so much for your time!

Best regards,

Eleni

ADD REPLY • link 8.3 years ago Eleni Christodoulou ▴ 150

2

Entering edit mode

If by "current design matrix", you mean design.mat, then you could do it with something like this:

con <- makeContrasts(
    (CL75_parental + CL86_parental + CL131_parental + CL230_parental)/4 - 
    (CL75_GefR + CL86_GefR + CL131_GefR + CL230_GefR)/4, 
levels=design.mat)

Obviously, if you're using the design matrix I suggested before, then you just drop the last coefficient.

As for your new "biological replicates", I think you need to clarify the nature of your clones. If they are randomly selected clones from a larger population, then true biological replication would involve assaying new clones, rather than re-assaying the same set of clones (which would be closer to technical replication). This is because you need more samples of the population to get more information about how it behaves. It would only make sense to examine behaviour of the individual clones if they are interesting in and of themselves, e.g., xenograft samples where you want to find patient-specific responses and don't care about how it generalizes to other patients. If this is the case, then your replication strategy makes more sense, but I'd still hesitate to call them biological replicates.

In any case, with the additive model, you cannot detect clone-specific treatment effects. This is intentional, because such effects are uninteresting when the clones are randomly selected from a larger population. If you want to detect these, you'll have to use a one-way layout (e.g., with the column names as a factor). Whether this makes any sense at all depends on the interpretation of the clones. If they are indeed randomly selected, then using a one-way layout would be crazy, as you'll end up with lots of DE that cannot be generalized to the population from whence they came.

Finally, I have no idea where treatment1 and treatment2 come from, so I don't know what you're referring to.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

Thank you very much for your insightful explanations.

Regarding the contrast matrix: Thank you! It was so obvious but for some reason I could not think about it.

Regarding the new samples: They are from the same clones as before, which were not randomly selected; they exhibit extreme response phenotypes to certain treatment. So, we want to check the genes that are over or under-expressed in them so as to understand why they have this extreme behavior. They are not true biological replicates, you are right, so I'd better not call them as that.

So you say that an additive effect is better used if one wants to generalize his results? That is interesting, I had not realized it. So, in my case, I suspect I should have something like the following:

design.mat <- model.matrix(~colnames(sel.dataset))  #here my sel.dataset contains more columns with the same name, as I have more than one 'biological' replicates (I know, not correct, but I don't know how to indicate them here so as to make clear the difference form the rest replicates)

colnames(design.mat) = levels(factor(colnames(sel.dataset))) #better than putting it manually because I want to have it connected with the way I created my desgn matrix

contr.mat_parental.gefr <- makeContrasts(CL75_GefR-CL75_parental, CL86_GefR-CL86_parental, CL131_GefR-CL131_parental, CL230_GefR-CL230_parental, levels=factor(colnames(sel.dataset))) #contrasts between all parental and all GefR clones

contr.fit_parental.gefr<-eBayes(contrasts.fit(fit, contr.mat_parental.gefr))
coef.names = colnames(coef(contr.fit_parental.gefr))
print(topTable(contr.fit_parental.gefr, coef=coef.names[1], adjust='fdr', number=10))  #i.e. for the first pairwise comparison, clone 75

and correct for the batch effect after building the model, right (include the model as a variable in the ComBat function)?

Regarding the treatment variable: I am sorry, this was a general question on when and how to use the intercept and has nothing to do with the previous task. I am reading on limma now and just found that I am not certain about this point.

Thank you very much once again!

Best regards

ADD REPLY • link 8.3 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

So you say that an additive effect is better used if one wants to generalize his results?

Well, that depends. In this case, if you want to generalize across clones, then yes, an additive matrix is better. This assumes that the treatment effects in different clones are replicates of an underlying common treatment effect; if the underlying treatment effect is different for each clone (i.e., there's some clone/treatment interaction effect), then obviously, there's not much point in generalizing at all.

and correct for the batch effect after building the model, right...?

For the batch effect, you can put it in as an additive term in the design matrix. If you have new samples for every clone/treatment combination in the second batch, then all coefficients should still be estimable after blocking on the batch in the design. I find this a bit more convenient than trying to remove the batch effect separately; also, including the batch in the design matrix means that limma can account for the uncertainty of estimation of the blocking factors, which is more statistically rigorous.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank you very much Aaron! So I can test for batch effects by just creating a design matrix as follows:

design.mat <- model.matrix(~colnames(sel.dataset), batch=c(1,1,1,0,0,1,.....) #1 if the sample is from the second batch, 0 if it is from the first batch

Please correct me if I am wrong and thanks again for your invaluable input.

ADD REPLY • link 8.3 years ago Eleni Christodoulou ▴ 150

1

Entering edit mode

You need to add it:

design.mat <- model.matrix(~colnames(sel.dataset) + batch)

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

0

Entering edit mode

Oh, yes, that's what I meant...sorry :(...and thanks a lot again!

ADD REPLY • link 8.3 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

Dear Aaron,

I have performed the limma analysis and found the DE genes in the way that you suggested; adding the batch as a factor in the model. Now, I actually want to look into the fold change of specific genes. Can I extract the batch-corrected values somehow in limma? Or do I need to do the batch correction separately?

Thank you vert much,

Eleni

ADD REPLY • link 8.1 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

That's what topTable is for. Run it with n=Inf, and subset by the gene name.

ADD REPLY • link 8.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Great! Thank you so much for the clear and quick response.

Best regards,

Eleni

ADD REPLY • link 8.1 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

Hi Aaron,

..I actually meant gene expressions, not fold changes. I think i can just correct for batch effect using RemoveBatchEffect prior to any linear model fitting and use the returned values for clustering and gene expression value extraction. Do you agree?

Thank you very much!

Eleni

ADD REPLY • link 8.1 years ago Eleni Christodoulou ▴ 150

0

Entering edit mode

Yes, as long as you don't use the batch-corrected values for linear model fitting.

ADD REPLY • link 8.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

No, I don't. Great! Thanks so much!

ADD REPLY • link 8.1 years ago Eleni Christodoulou ▴ 150

1

Entering edit mode

Regarding the treatment 1 and treatment 2 you are referring to the terms "pre and post drug treatment " you have mentioned in your description ? If so, you might consider the interaction term : in your design matrix---however, your design matrix is already complex and without knowing for certain what treatment 1 & 2 represent i would not like to infer any inappropriate coding, as already Aaron & Gordon have provided valuable and detailed explanations.

ADD REPLY • link 8.3 years ago svlachavas ▴ 830

0

Entering edit mode

Thank you Stathis for the comment. It was not clear, I know. Look above :) "I am sorry, this was a general question on when and how to use the intercept and has nothing to do with the previous task. I am reading on limma now and just found that I am not certain about this point."

Best regards

ADD REPLY • link 8.3 years ago Eleni Christodoulou ▴ 150