Question

Possible explanations and suggestions corresponding the interprentation of results of limma multifactorial design

1

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor Community,

i have returned with a previous problem i have faced from a previous post (Possible implementation of multifactorial contrasts in limma regarding a microarray dataset). To provide a small summary from the link, with the great assistance and crusial suggestions of Aaron Lun i implemented in my merged colorectal affymetrix dataset, comprised of 30 different patients with two samples each(cancer and adjucent control), a multifactorial design in limma, incoprorating except from the "paired comparison" the specific tumor location of each patient:

head(pData(eset.2))
Disease Location Meta_factor Study
St_1_WL57.CEL Normal sigmoid_colon 0 hgu133plus2
St_2_WL57.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EC59.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_EC59.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EJ58.CEL Normal cecum 0 hgu133plus2
St_T_EJ58.CEL Cancer cecum 0 hgu133plus2

with the following code:

pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2)
design <- design[,-grep("Normal", colnames(design))] # get full rank
colnames(design)
fit <- lmFit(eset.2, design)
fit2 <- eBayes(fit, trend=TRUE)

Moreover, as Aaron mentioned, except from the first 30 coefficients which represent to the patient-specific blocking factors, the coefficients 31:39 correspond to the location-specific log-fold change of cancer vs control samples.

Thus, for istanse top2 <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none") represents the log-fold change of the 8 ascending colon tumor samples versus their adjucent controls

So, to adress and highlight my problem, although there are 9 coefficients which represent 9 different anatomic locations, only in the above(ascending colon) i get from topTable DE genes with an FDR < 0.05.

table(pData(eset.2)$Location)

ascending_colon cecum descending_colon left_colic_flexure
10 12 6 2
rectosigmoid_colon rectum right_colic_flexure sigmoid_colon
4 8 6 10
transverse_colon
2

As you can see from the above, although some comparisons include a very low number of samples, in the other comparisons it is very weird (like cecum and sigmoid colon) not to return any DEG genes. I acknoledge that each study has its specific details and experimental design regarding the biology of each system, as also other various reasons related with the analysis itself, but as it is a remaining issue i also discussed with my lab members and remains under discussion, i would like to adress some more questions(as im mainly a molecular biologist and perform data analysis about a year) to get also a statistical view or suggestions on the matter:

1. As Aaron also suggested various possible explanations, i would like first to highlight one of them(and please correct me for any misunderstanding or misconception): how it is possible(from a statistical view) ,a high variability in a specific location(although patients are different) to mask the DE genes expression in the other locations ? Even if generally cancer vs control samples generally yield a lot of different gene expression patterns ? And furthermore, as this model Aaron suggested models the variability in the cancer/normal logFC between the patients ?

2. One other suggestion was to construct an MDS plot but based on the log-fold changes of each of the above nine coefficients. Thus, i have to use something like this:

q <- fit2$coefficients[,31:39] and then run: plotMDS(x=q, top=500, gene.selection="pairwise") ?

And if the above implementation is correct, i should inspect the plot for samples that "behave" as outliers or gather far from the others ?

3. To continue from the question 3, then i should remove the specific patient samples from my expression set, and then re-run the analysis ?

I know also the function removeBatchEffect() but i have never use it and i dont know if an how fits for my current problem ?

4. Finally, there are any other diagnostic plots or ways to adress the problem, or alternative solutions ?

5. To mention, before performing limma, from my EDA plots like clustering the samples and PCA, one specific control sample seems to be a common outlier. Also, i have perfomed paired analysis with limma without the location information

(*Please excuse me for any naive questions and my long message, but as im generally new in R and i dont have many experience in these specific issues and statistical models, i would like to cover any important part of my analysis and also to gain as much as possible feedback !!)

limma multifactorial design diagnostic plots affymetrix microarrays • 1.8k views

ADD COMMENT • link updated 8.6 years ago by James W. MacDonald 65k • written 8.6 years ago by svlachavas ▴ 830

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 13 hours ago

United States

When you fit an ANOVA model and compute contrasts, the denominator of your statistic is based on (essentially) the average within group variability, using all groups. In one sense this is better, as you are using more data to estimate variance, and all else equal, more data is better than less data for estimating variance. However, if you have one group that has an unexpectedly high within-group variance, it may inflate the variance estimate 'too much'.
If you have highly variable samples, you might consider using a weighted model fit. Sometimes you have enough samples that you can afford to remove outliers, but that is pretty ad hoc, and usually if you remove one sample, then another starts looking like an outlier, so you remove that one too, but then there is another outlier...
Like I said, using sample weights is often a better idea than seeking and removing outliers. removeBatchEffect() is only intended to remove known batch effects, and in that case only for plotting purposes. If you have known batches you should either fit as part of the linear model or use sva.

ADD COMMENT • link 8.6 years ago James W. MacDonald 65k

0

Entering edit mode

Dear James,

thanks so much for your explanations. Because im still a "noob" in stastistics, i naively thought that modeling the tissue-specific log fold change of each of the nine above comparisons, will control of this variance in each location. Thus, if i have understand properly your explanation vol.1, although the above flexible model Aaron suggests models the variability in each tumor location, you pinpoint that if a group of a specific anatomic location(each comprises of different patients) presents an unusual high variance between its samples(i.e between the different patients which correspond to the same location), because it is included when the model is fitted and the variance is estimated from all the samples(naively said, that is for each group the statistical test stills computes the variance from all the samples, not only the ones compared for this specific location right ?) , it will possibly "change" the variance estimates, correctly ?(or im still "out of topic"). I insist on this, because i was struggling to explain to my biologist lab members why an unexpected variance from one group-location, could "hide" the results-and thus any results of deg expression in the other categories, even the ones with the largest number of samples !!

2. Regarding the weighted model fit you have proposed, because i have not perform in the past anything similar, i should refer to the chapter 14 of limma users guide(page 64) ?. In detail, i have to use the arrayWeights() function and then perform again all the above steps of my analysis ??

3. Furthermore, i have noticed that in some vignettes of various packages(like WGCNA), when EDA is performed before any downstream statistical methodology, if any outlier samples are characterized, they are totally removed from any further analysis. Or from your perspective and also regarding my analysis(i have one sample which i mention above) i should not remove in that way any sample, and consider the other options ?

4. Finally, i followed Aaron instructions, and i create a matrix with the 9 coefficients representing the log-fold changes of each anatomic location[q <- fit2$coefficients[,31:39]; plotMDS(x=q)].

The dropbox link for the above plot is : https://www.dropbox.com/s/ixa337ytuwemgvh/plotMDS_coefficients_9.jpeg?dl=0

So, according to the inspection of the plot, i could consider the coefficient "DiseaseCancer:LocationLeftColicFlexure" and thus the specific anatomic location-group as a possible outlier ?? Or the interprentation of this specific MDS plot with the log-fold changes as an input is different ?

5. Finally, considering the batch effect removal or sva, i have already performed COMBAT when merging the two datasets, and then using various plots showed a generally good separation and the two studies nicely mixed together.

Thank you in advance for your time and help !!

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

Plotting the coefficients in the MDS plot won't really tell you that much. What I meant was to make a MDS plot from the cancer/normal log-fold changes within each patient (i.e., run plotMDS on a matrix with one log-fold change for each patient, such that there are 30 values in total for each gene); or, even better, run removeBatchEffect to remove the patient-specific batch effects, and then make a MDS plot from the resulting batch-free expression matrix. The latter is a bit better as it will show changes between cancer and normal samples, whereas the former focuses on differences in the cancer/normal changes between anatomical locations (the actual value of the cancer/normal log-fold changes is not visualized). In both cases, the idea is to have a look at the spread of the samples for each anatomical location. Ideally, patients/samples from the same location should form a reasonably tight cluster that is distinct from other clusters, indicating that there is some differential expression to be found between cancer/normal samples and/or between locations.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

please excuse me if i misunderstand your suggestions and created something irrelevant. So, you propose of making an MDS plot with 30 values-log FCs for each gene. But how can i isolate these coefficients ? From the first 30 coefficients which are the patient-blocking factors ? Or i have to create another design matrix ?

Secondly, regarding your proposal of removeBatchEffect() function-because i couldnt find it from the limma users guide and only from ?removeBatchEffect :

removeBatchEffect(x, batch=NULL, batch2=NULL, covariates=NULL, design=matrix(1,ncol(x),1), ...)

So, if im correct, x represents my initial expression matrix(exprs(eset.2)).

But regarding the argument : "batch/batch2 :factor or vector indicating batches" what should i define ? Probably something like : batch <- pairs # where pairs is the factor defining the 30 paired samples ( pairs <- factor(rep(1:30, each=2)) ?

Im asking this because i couldnt find an example in the users guide, and from the documentation of the function it states that it is used prior to EDA plots, as also it is not intended to use with linear modelling.

Thus, the only way to deal with this problem after the above "diagnostic procedure", is to use arrayWeights() prior using the same code as above ?

Thank you for your consideration on this matter !!

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

Forget about the log-fold changes. Just use the second method with removeBatchEffect to make your MDS plot, it's probably more relevant for your application. As for the batch argument, setting it to the pairs object would be correct in order to get rid of patient-specific differences in expression. And, yes, the best remedy to any problematic samples would be to use arrayWeights to downweight them. If there are obvious outliers, then these could be directly removed, but see James' comment above.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, thanks for the verification of the above steps-but as i searched chapter 14(and especially the example 1-14.2 section/page 64, as also 14.4. i have two important and quick questions:

1. Because i can't reproduce the code from the example which is not similar to my study: i have to feed in the arrayWeights() my above created design matrix ? i.e.,

design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2) and then

array_weighted<- arrayWeights(design); fit <- lmFit( design, weights=array_weighted)...??

or because of the specific complexity of my design matrix(not all individual samples represented here), i have to implement something different ?

For instance : arrayw <- arrayWeights(eset.2)..but if this is the case how can i move to lmFit ?

2. Secondly, for the interprentation of the effect of arrayWeights(except from the final outcome of genes), , i could create a barplot(similar as the one in the above example) ? To see which possible array(s) get the maximum weight and could be considered as outliers (except from the other diagnostic plots you have mentioned)?

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

0

Entering edit mode

Read ?arrayWeights.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I had already checked it along with the vignette above(however, although in the description the object argument accepts a kind of matrix/ExpressionSet(..), the example in the tutorial uses the same object: "MAlms" both for the arrayWeights and for the lmFit, so thats why i was confused of which of the above i should use)

So , to conclude : arrayw <- arrayWeights(eset.2); fit <- lmFit(eset.2, design, weights=arrayw); and then ebayes() and the following is the right choise ? (#with the above design matrix)

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

You should also pass design to arrayWeights.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I see. I think i got it now :

design1 <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2)
design1 <- design1[,-grep("Normal", colnames(design1))] # to get the full rank you have mentioned

arrayw <- arrayWeights(eset.2, design=design1); fit <- lmFit(eset.2, design1, weights=arrayw).... (and continue with eBayes and inspect the results again for each of the nine comparisons-locations)

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

0

Entering edit mode

Dear Aaron,

i returned with some final questions and feedback about the above implementations, because with the arrayWeights implementation (with the exception of one specific location) all the other ones returned some hundreds of DE genes, and with very interesting functional enrichment analysis results !! So, i would like to post some final comments/questions for evaluation:

[Assuming my below code is correct]

pairs <- factor(rep(1:30, each=2))

design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2)

design1 <- design[,-grep("Normal", colnames(design))]

arrayw <- arrayWeights(eset.2, design=design1)

fit <- lmFit(eset.2, design1, weights=arrayw)

fit2 <- eBayes(fit, trend=TRUE)....:

1. Regarding the object arrayw, i used the code from the users guide:

barplot(arrayw, xlab="Array", ylab="Weight", col="white", las=2) ; abline(h=1, lwd=1, lty=2)

Thus, according to the highlighted value=1 of weight in the y-axis[used both in the paper in 2006 and the users guide], the arrays which fall less from one(especially these with very low values), are these which have possible technical problems and/or the possible unwanted high variation we have discussed, and can be called as "less reliable" ?

2. If my above procedure is correct, then because the above arrays represent a relative important number of the total arrays, there is no point in excluding them from further analysis-just stick to the code above and include them in the analysis, but down-weighted right ?

3. Regarding the batch effect interprentation, the function removeBatchEffect & the explanation of the plotMDS function: i first created one MDS plot before limma and without removeBatchEffect and one after the function removeBatchEffect:

corrected <- removeBatchEffect(exprs(eset.2), batch=pairs) # the pairs factor from above

correct <- new("ExpressionSet", exprs = corrected)...

i) the main difference from the MDS plot before and after removing batch effect is that the cancer and control groups cluster better and more close to each of the two classes (with the exceptions of some samples and of course of the one clear sample outlier-Normal) ?

ii) And finally, the leading logFC dim1 in both plots represents (mostly) the biological variation ? that is the difference between cancer and control groups, while the leading log FC dim2 also represents the underlying biological differences between the two groups, but highly reduced in comparison to dim1, right ?

The links for the three plots:

Thank you all so much for your time, help and valuable suggestions !!

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

Arrays with lower weights are of lower quality. Arrays with higher weights are of higher quality. I think they multiply to unity within each group, but I wouldn't read too much into whether they are greater or less than one. For example, a small number of arrays with very low weights will push all others up, even if some of those other arrays are of low quality.
Yes.
Yes, it does look better after removing the batch effect. However, it's still concerning that some cancer samples are mixed in with the normal group, and vice versa. You should have a close look at these samples, and check that they're not mislabelled or anything. As for what the dimensions mean; you can see that the first dimension corresponds to the difference between cancer and normal samples, while the second dimension represents the variability within each group (possibly related to anatomical location, you'll have to look into this yourself).

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I see !! Thank you for your confirmation about the 2 first questions-regarding the 3rd question, i also have labeled and colored the samples based on their anatomic location so i will check this visualization info -but concerning the important issue about some mixed cancer with normal and the reverse:

I have also noticed this issue when i analyzed separately each dataset-especially on the one dataset. Definately i should check one more time these samples(i have checked them a lot of times), but from a "complete naive perspective", could this happen due to general heterogeneity of tumor biopsy specimens or even due to general variation ?

Anyway i should check also if these specific mixed-up samples belong to the same patients(matched).

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

Dunno. You'll have to ask an expert for this type of data.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

score 3 · Accepted Answer · 2015-09-16

You never 'control' variance. You use variance as a yardstick to say if a given fold change is big enough to reliably say the underlying population difference is different from zero. The larger your variance, the bigger the fold change has to be. But do note that we are using sample variance as our estimator, and sample variance is not a very efficient statistic. So it takes more data for the sample estimate to converge to the underlying population parameter we are trying to estimate. This is why an ANOVA model with multiple groups is more powerful than a t-test; you borrow strength from all the data in the model. But if one group is much more variable, then it may undermine the results. Or maybe that group is more reliably estimating the true variance, and the other groups have lower than expected variance.
Yes. You estimate weights and then use them in the lmFit() step.
Methods like WGCNA don't accept sample weights, so you have to do other things. Removing outliers is one option.
I'll let Aaron answer questions about suggestions he has made.

I applaud your interest in learning statistics, and wanting to do things for yourself. But do note that the primary purpose of this support site is not teaching people statistics, but helping people who have technical problems when running Bioconductor packages. I (and Aaron, Gordon, and some others) do sometimes veer off topic to help people with fundamental statistical questions, but you should not consider this to be an on demand Coursera teaching site.

If you really plan to continue analyzing data for your group, you should take it upon yourself to learn enough statistics that you can both competently analyze data, as well as explain to your colleagues what you have done, why you did it, and the tradeoffs you had to make. You cannot expect to get that knowledge here, by asking repeated questions about each analysis you undertake. Both Rafael Irizarray's group (on edX) and Jeff Leek/Roger Peng/Brian Caffo (on Coursera) have online classes that you can take. There is also Julian Faraway's R modeling book that you can get free on r-project.org. Those are more appropriate ways to get the knowledge you seek.