Question

Possible implementation of multifactorial contrasts in limma regarding a microarray dataset

1

Entering edit mode

svlachavas ▴ 830

@svlachavas-7225

Last seen 6 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

I decided to repost and create a new complete post-question from one of my previous post regarding one part of my comments-questions(C: Correct creation of design matrix in limma regarding a multi-level experiment), in order to get as much as possible feedback and evaluation of my implementation regarding my project analysis, from the specialists of the field concerning multifactorial analysis with limma. In summary, im showing a part of the phenotypic information of my merged microarray affymetrix dataset:

head(pData(eset_COMBAT))
Disease Location Meta_factor Study
St_1_WL57.CEL Normal sigmoid_colon 0 hgu133plus2
St_2_WL57.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EC59.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_EC59.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EJ58.CEL Normal cecum 0 hgu133plus2
St_T_EJ58.CEL Cancer cecum 0 hgu133plus2

...[in total 60 samples]

In detail, I would like to incorporate the tumor location information, as I have mentioned in the above post[" in my current project we are highly motivated in implementing this information in our microarray analysis, as from both biology field and other perspectives, we are interested in identifying-if it is possible and regarding the results- any crusial and interesting gene expression alteration patterns between different anatomic locations of the patiens colorectal tumors involved in our analysis(i.e ascending vs descending colon, which could possibly underline fundamental functional and distinct pathways-groups of genes involved"). Thus, I want to perform two major analyses if possible:

A. To compare specific tumor (locations) vs their adjacent control samples (i.e ascending. Tumor vs ascending. Control)

B. If it is possible compare different tumor anatomic locations (i.e ascending vs descending tumor samples)

So,- if my idea/implementation although naive is correct without any incorrect points-, my main questions are:

1. In order to start incorporate this information, should I proceed like the limma guide[section 9.7, page 48]-that is:

combine <- factor(paste(eset_COMBAT$Disease, eset_COMBAT$Location, sep="."))

design <- model.matrix(~0 +combine)

colnames(design) <- levels(combine)

2. Secondly, to estimate the correlation between the matched samples on the same patient:

corfit <- duplicateCorrelation(eset_COMBAT, design, block=pairs) ?

OR regarding the block argument should i firstly incorporate this information in my phenoData object:

eset_COMBAT$Subjects <- pairs

and then corfit <- duplicateCorrelation(eset_COMBAT, design, block=eset_COMBAT$Subjects) ??

3. Finally, fit the selected model with

fit <- lmFit(eset_COMBAT,design,block=eset_COMBAT$Subjects,correlation=corfit$consensus)

and then highlight the specific contrasts of interest with makeContrasts..etc??

Any suggestions/corrections/evaluation of my above ideas-code would be essential for my project !!

limma multifactorial design design and contrast matrix affymetrix microarrays • 2.0k views

ADD COMMENT • link updated 8.7 years ago by Aaron Lun ★ 28k • written 8.7 years ago by svlachavas ▴ 830

score 3 · Accepted Answer · 2015-09-02

3

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 6 hours ago

The city by the bay

Based on the previous post, it seems you have a paired-sample design where each pair of normal and cancer samples are sourced from the same patient. I believe that a more appropriate design would be something like:

pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset_COMBAT)
design <- design[,-grep("Normal", colnames(design))] # get full rank

The first 30 coefficients correspond to patient-specific blocking factors, while the last few coefficients represent the location-specific log-fold change of cancer over normal samples. This increases the flexibility of the model, as it means that the cancer can have different effects in different locations. Dropping each of these coefficients will test for the effect of the cancer in the corresponding location, which answers your question A.

Similarly, using makeContrasts to compare one coefficient to another will test for a differential effect of cancer between locations. This should answer your question B (note that you need to change the coefficient names though, as makeContrasts doesn't like expressions with colons in them). I'd argue that it's more relevant to test for differences in the cancer/normal log-fold change than directly testing for differences in cancer sample expression between locations. This is because you've got matched normal samples for each patient that corrects for patient-specific biases - you might as well use them.

You don't have to use duplicateCorrelation anywhere with this approach. This avoids some of its limitations; check out A: LIMMA: interpreting interaction terms in an unbalanced factorial design with rep for a brief description.

ADD COMMENT • link 8.7 years ago • updated 7.8 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

thank you for your detailed answer. Despite being a bit "noob", i would like to ask you some further questions/explanations in order to understand fully your implementation and dont ask every time a different subject-because this part is extremely important-

1. Correctly, as you have mentioned, for each patient tissue from both the tumor and the adjucent control sample has been pooled. Thats why, i have performed already a paired analysis, as Mr Gordon Smyth corrected without the use of the tumor location, in order to compare total tumor vs control samples

2. Regarding the "more appropriate design you mention"-with the 3 lines of code-:-if i have undertand well(without having tested your code in order to understand some basic parts) :

A. Your above procedure tests for differences between each tumor anatomic location group(i.e asceding tumor colon samples) vs their adjucent control samples ? That is the use of Disease:Location(:) to indicate the interaction ? or it also returns a coefficient like "conditionCancer" representing the "total effect size"(if i describe it correctly) between cancer and control samples ? Im asking this because i have also from the previous post acquired a topTable regarding DEG probesets.

Thus, should i re-run my limma analysis with your above code ?

B. Secondly, the makeContrasts() you indicate can also lead in the same result for my question A? Or if i would like to perform, regarding my question B, comparisons like: asceding_colon.tumor samples vs descending_tumor samples, then i have to use this approach ? And if yes, which would be my design matrix for this case ? The above from your answer ?

3. Finally, regarding your answer about duplicateCorrelation: in order to have a complete knowledge on this matter, the reason you are proposing that i dont have to use duplicateCorrelation() anywere, is that it will exploit the variation between the patients in the model, so i will probably loose any important differences regarding my comparisons??

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

2

Entering edit mode

Your original analysis does not allow for location-specific cancer effects. So, it's not as flexible to what I've proposed in my answer above.
Yes. Interaction. Yes. Yes, though it's easier to just drop a coefficient. No. You would need to use duplicateCorrelation, though I feel it's more relevant to test for differences in cancer/normal log-fold changes between tumour locations. No.
The real question is how you want to model variability between patients. As I've said, the cancer/normal log-fold change seems to be more relevant for your analysis, so you should be modelling variability in the log-fold change between patients. In contrast, duplicateCorrelation will model variability in absolute expression between patients.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

So, to sum up :

i will run your code, evaluate my results and i will provide feedback and any other questions that maybe arise.

Finally, regarding your answer in my second question-as you correctly again mentioned, our project interest is mainly in observing any interest gene expression patterns between different tumor locations, vs their adjucent control samples(with of course the necessary requirement, that this does not fit in all possible locations-i.e for the left colic flexure, there is only one cancer and its control sample). My second possible implementation about comparing directly "different tumors without their control samples", resulted from the notion that we thought that could be also some interesting observations regarding the direct comparison of tumor samples of different anatomic locations.

Thus, to conclude, if i want also to perform the B.part of the comparisons, i have to use makeContrasts and duplicate correlation, along with my original post code right ?:

combine <- factor(paste(eset_COMBAT$Disease, eset_COMBAT$Location, sep="."))

design <- model.matrix(~0 +combine)

colnames(design) <- levels(combine).....

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

0

Entering edit mode

Dear Aaron,

i runned your code above, so i would like to mention my final questions, in order to correctly construct my final model and comparisons without any misconseption:

pairs <- factor(rep(1:30, each=2))

design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset_COMBAT)

design <- design[,-grep("Normal", colnames(design))]

1. After the above three blocks, as you mentioned : colnames(design)

[1] "pairs1"
[2] "pairs2"
[3] "pairs3"
[4] "pairs4"
[5] "pairs5"...(total 30 pairs) and then

[31] "DiseaseCancer:Locationascending_colon"
[32] "DiseaseCancer:Locationcecum"
[33] "DiseaseCancer:Locationdescending_colon"
[34] "DiseaseCancer:Locationleft_colic_flexure"
[35] "DiseaseCancer:Locationrectosigmoid_colon"
[36] "DiseaseCancer:Locationrectum"
[37] "DiseaseCancer:Locationright_colic_flexure"
[38] "DiseaseCancer:Locationsigmoid_colon"
[39] "DiseaseCancer:Locationtransverse_colon"

Moreover, as you have stated above the coefficients from 31-39 "represent the location-specific log-fold change of cancer over normal samples".

Thus, to continue fit my selected model, i should continue with:

fit <- lmFit(eset_COMBAT, design)

fit2 <- eBayes(fit, trend=TRUE) ?? and i dont have to define any contrasts with makeContrasts() as the interaction term (:) includes them ??

For example, if i type: top2 <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none")

The selected coefficient represents the log-fold change of the 8 ascending colon tumor samples versus their adjucent control samples(the other 8 samples) ?

2. If im also interested in a single coefficient like "DiseaseCancer" which represents the total log-fold change of all the cancer vs the normal samples, i have to use the approach discussed in the previous post without incorporating the location information, correctly ??

3. If for a subsequent analysis i would like to testing directly for differences among different tumor samples regarding their respective location(above question B) i have to create something like the code posted firstly in my question with duplicateCorrelation ? and then makeContrasts?

4. Finally, i would like to post some two optional blocks of code Mr. Smyth posted in another post regarding non-specific filtering:

pairs <- factor(rep(1:30, each=2))

design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset_COMBAT)

design <- design[,-grep("Normal", colnames(design))]

fit <- lmFit(eset_COMBAT, design)

keep <- fit$Amean > median(fit$Amean)
fit <- eBayes(fit[keep,], trend=TRUE)

Should i use the two final lines of code with the above model incorporation the location specific log-fold change of the paired samples, or is too strict because the effect would vary for each coefficient-location ? and i could use it only in the previous post without the location information ??

Thank you for your patience and your consideration on this crusial topic !!!

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

1

Entering edit mode

Yes.
Yes, though the sensibility of doing so is questionable. That design assumes a consistent effect of cancer across locations. It's not clear how you would interpret the results if the effect is different between locations.
Yes. However, a direct comparison between tumour samples from different locations is suboptimal as it fails to consider the matched normal samples. It would be better to compare the cancer/normal log-fold changes between locations.
I don't really know what you're asking here, but you should filter out low-abundance probes prior to the empirical Bayes procedure. You should do this regardless of whether your experimental design has location in it. Filtering based on the median of A-values seems a bit aggressive to me; check out some of the case studies in the limma user's guide.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

thank you one more time for your great contribution in my analysis !! I have some final considerations-question updates on your comments above & for the complete post(i hope i have not already disturb you):

1. Firstly, in your code blocks you used : design <- design[,-grep("Normal", colnames(design))] and you mentioned to "get full rank"- Thus, if i leave the design matrix without this removal, i have also coefficients like:

"DiseaseNormal:Locationascending_colon"-So these coefficients represent something like the baseline level(i.e "like" the intercept terms of the interaction comparison ?(I just insist on this to get more confidence about my above question 1 because i have never used the : term in a design matrix)

2. Regarding your answer on the above number 2.: you fairly argue about the usage of the location information. On the other hand, because we didnt have this information at the begining of the analysis, i thought that creating firstly a "naive" model [condition <- factor(eset.2$Disease, levels=c("Normal","Cancer")) ;pairs <- factor(rep(1:30, each = 2)) ; design <- model.matrix(~condition +pairs) ] i would have a first "general" comparison-one coefficient of cancer vs normal samples, in order to give me a first list of DEG genes. And then with this "pool" of DEGs, somehow interpret the results with the subsets of the resulted DEGs from the location specific comparisons(maybe find common genes..etc)

3. Moreover, one other important issue that i would like your opinion is after i create your design model(with the full rank), and create the model, i will have the coefficients from 31-39 representing the specific location comparisons. So dropping each of them in topTable will give me the possible DEG candidates regarding cancer vs control samples for this location.

However, if i would like to find possible-common DE genes(if any) across any of these conditions, how i would proceed ?

normally i would something like : common <- topTable(fit2, number=nrow(fit2)) # fit2 after eBayes or topTableF (although i have never used it).

Again my main concern is that there are also the coefficients from the pairs-blocking factor -also, im troubled because two comparisons regarding two locations(transverse_colon and rectosigmoid) have very few samples for any comparisons(2 samples-1 cancer vs its adjucent normal) and 4 samples(2 vs 2) respectively), so it may be unuseless to include them in the common gene task, as it may reduce dramatically the number of(any common). My ultimate goal about this step is to subset my merged eset based on these possible common genes across these contrasts, and then use a heatmap to see any interesting pattern across location-specific samples.

So how could i deal with this problem ??

4. Finally, regarding the non-specific intensity filtering, i found the post with Mr. Gordon Smyths code:

C: Trouble identifying differentially expressed genes on Affymetrix Human Gene 2.0.

Before i use limma, i implemented a non-specific intensity filtering, like the one Mr Johannes Rainer kindly proposed

https://support.bioconductor.org/p/65138/#65139- With which, from 12.136 probesets remained the 9884 probesets-but it still contained low-intensity probesets, thats why i asked you about using keep <- fit$Amean > median(fit$Amean). However, to report, after this step i remain with the half of the 9984 probesets to test for DE(4992). Usually, this step calculates the average log intensity for each probeset across all samples, and then computes a median value based on the vector of all mean values of each probeset as an input ? Correct ?

Thank you one more time for your crusial help !!!

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

1

Entering edit mode

Ignore these coefficients, as they won't be estimable. There's no point trying to interpret them.
Well, now that you have location information, you should be using it.
To find common DE genes for two or more locations, drop each of the corresponding coefficients individually, define a DE list at a given FDR threshold, and then intersect those lists across locations. The DE genes remaining in the intersected list will be those that are common (it might be worthwhile making sure that the signs of the log-fold changes are also the same). The patient-specific blocking factors just absorb patient-specific variability in absolute expression, you don't need to worry about them when you're comparing normal to cancer within the same patient. Also, don't worry about the different sample sizes. Fewer samples means reduced DE detection power, but it won't affect the validity of the results.
I can't speak for what other people do. My best advice is to follow the user's guide. For anything else, your mileage may vary.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

So, use the full rank design matrix, and get rid of the coefficients which include the term "DiseaseNormal", as far as with the interaction term the coefficients which have the term "DiseaseCancer" have already include the difference for the cancer vs normal samples regarding their respective same location.

Regarding the common genes, you are proposing create for instanse 9 deg lists(each one from topTable), the same number as the different location-tumor specific coefficients . I would like to ask if this is also possible with decideTests() somehow, or it is irrelevant ?

Finally regarding the non-specic intensity filtering, i will follow your advice and check the methodologies in users guide

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

1

Entering edit mode

You can run decideTests to get an integer vector of -1, 0 or 1 for each comparison; you can get the intersection by asking for those genes with a value of 1 or -1 in both vectors.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

because i have not used decideTests before, according to your idea, in order to have a similar approach with topTable, should i use the default arguments:

decideTests(fit2 ,method="separate",adjust.method="fdr" ,p.value=0.05) -and this will characterize DE genes if their adjusted p value is less than 0.05 correct? or i could also use method="nestedF" ?

and the interprentation of t statistic with -1,0 or 1 is for downregulated, non-DE and upregulated ? concering the fact that usually for instanse a negative t-statistic indicates a negative log-FC(if it is significant) ?

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

0

Entering edit mode

Yes and yes. Read the documentation if you're not sure.

ADD REPLY • link 8.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

i checked it before posting and disturbing you again-i just could not specify the exact difference of method="separate", with method="nestedF"

ADD REPLY • link 8.7 years ago svlachavas ▴ 830

1

Entering edit mode

Read the user's guide. You'll see that nestedF is still experimental. So, stick with separate.

Edit: For a bit more detail, the idea with nestedF is to identify those genes that have significant differences in any of the contrasts via a F-test (i.e., after dropping any of the cancer:location coefficients - obviously, you don't want to drop the patient blocking factors, so you'd have to run contrasts.fit prior to using decideTests if you wanted to do this). DE genes are identified after applying the BH method on the resulting p-values and picking a FDR threshold. For these genes, the function goes back to the t-statistics for each contrast and tries to figure out which contrasts contributed to rejection in the F-test. With separate, significant genes are identified by directly applying the BH method on the p-values from the t-tests in each separate comparison.

The nestedF approach may provide more detection power as it uses information from multiple contrasts. However, as the user's guide says, formal FDR control is only provided across genes, not across contrasts. You need the FDR to be controlled in each contrast if you want to detect genes that are DE in each location, prior to intersection of DE lists. Thus, I'd be inclined to use separate.

ADD REPLY • link 8.7 years ago • updated 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron, i read carefully your answer, as the chapter 13 of the limma vignette. Thus, to make some final(i hope comments) :

1. So, if im not mistaken, as nestedF tests all the contrasts created prior, it would not suit in my case as it also includes the not estimable pair coefficients you mention right ? And if i wanted anyway to test it, i would have to use duplicate correlation, make.Contrasts and contrasts.fit like the first question B correct ?

Moreover, just to verify it, from the vignette of limma in page 62 it reports in the paragraph of method:separate : "Another disadvantage is that the raw p-value cutoff corresponding to significance can be very different for different contrasts, depending on the number of DE probes"--but the argument p.value=0.05 performs cutoff on the adjusted p-values, like the topTable, correct ??

2. Secondly, except the formal way you proposed,of creating 9 different DE lists from topTable, at a specific FDR threshold(like adjusted p val < 0.05), i also tried to use decideTests. In detail, after the above code(fit2 object created with eBayes):

results <- decideTests(fit2,,adjust.method="fdr" ,p.value=0.05)

> colnames(results)[31:39]
[1] "DiseaseCancer:Locationascending_colon"
[2] "DiseaseCancer:Locationcecum"
[3] "DiseaseCancer:Locationdescending_colon"
[4] "DiseaseCancer:Locationleft_colic_flexure"
[5] "DiseaseCancer:Locationrectosigmoid_colon"
[6] "DiseaseCancer:Locationrectum"
[7] "DiseaseCancer:Locationright_colic_flexure"
[8] "DiseaseCancer:Locationsigmoid_colon"
[9] "DiseaseCancer:Locationtransverse_colon"

Then, i naively tried :

u <- intersect(results[,"DiseaseCancer:Locationascending_colon"], results[,"DiseaseCancer:Locationdescending_colon"])

u
[1] 0

Thus, i wonder if then it would have any results for trying to identify total common DE across all tumor locations, if already the above two have no common DEGs ?

Or i should also try with the approach of topTable ?

The last thing i thought that maybe contributes of the non-common resulted genes above, is that i used also the aggresive filtering you have mentioned, and maybe i could have removed possible common probesets

3. Finally,

from the post report genes in intersection of Venn diagram with topTable?,

you had proposed to find common genes(although was for two contrasts-coefficients):

intersected <- rowSums(results[, c("br.low", "sd.low")] != 0L) == 2L

could i use something similar here ? or it is irrelevant ?

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

0

Entering edit mode

Dear Aaron,

please excuse me for returning with longer messages, but i have some feedback and your opinion on some crusial matters that resulted. In detail, i created based on your instruction, 9 DEG lists representing each one tumor location specific contrast between cancer and control samples, with an FDR threshold < 0.05: (im just summarize the code to check again any naive possibility of mistakes):

pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2) # eset.2- resulted from above eset_COMBAT with a first initial non-specific intensity filtering
design <- design[,-grep("Normal", colnames(design))] # get full rank
colnames(design)
fit <- lmFit(eset.2, design)
keep <- fit$Amean > median(fit$Amean)
fit2 <- eBayes(fit[keep,], trend=TRUE)

and then for instance, for the ascending_colon location:(total 5 cancer vs their 5 adjucent controls)

top2_ascending <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none",p.value=0.05)

Although the above coefficient returns a lot of genes, with smaller p values from 0.01, and even with a lfc for just testing-however, none of the other coefficients(all the rest 8) returned either a single gene/probeset with adjusted p value less than 0.05. I just noticed that in some of these coefficients, there were some raw p.values less than 0.05, but not a great number.

Thus, which is your opinion-suggestion on this matter ? regarding the fact, that also in other locations with the most total samples(i.e. cecum=6 vs 6, sigmoid 5 vs 5) returned no results ?

In other words, is there a possibility that there is a strong biological effect which is depicted with a strong pattern & also number of DEG genes on this specific location, while on the other locations there is no such effect ? Or it might counts that the merged eset is consisted of only the common probesets from the two datasets ?Also from your evaluation, as the tutorials, i think there is no problem with the code. To mention, i also tried with the less stricter non-specific intensity filtering procedures, but nothing changed. Finally, to pinpoint, i also used a customCDF for the annotation.

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

The answer to this question depends a lot on the biology of the system, of which I'm not an expert. The only advice I can give involves general observations; one, you'd generally expect a lot of DE between cancer and normal samples; and two, human data is highly variable. This variability may be masking the DE at the other locations. I'd suggest creating a MDS plot after running removeBatchEffect to remove patient-specific expression. See if there are any outliers or substructure that might be inflating the variance estimates.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

I understand Aaron-it is a complicated question to aswer, thats why i needed your opinion about this specific issue. Regarding the variability you have highlighted, i have never thought that could alter so much the results. When i use the model without the tumor location in the design matrix, i get a lot of DEG genes, even again with a logFC cutoff for checking. Regarding the plotMDS, i send you the link to the one i created after the merge of the two datasets[but the function is from the inSilicoMerging package]:

plotMDS {inSilicoMerging} R Documentation
Create double-labeled MDS plot from (merged) ExpressionSet

Description

Create Multidimensional Scaling (MDS) plot from ExpressionSet. Very similar to Principal Component Analysis (PCA) plots all samples are plotted in a two-dimensional space where both axis represent the two principle axis of expression variation. In this plot each sample can be labeled with a color and with a symbol.

Usage

plotMDS(eset, colLabel, symLabel, legend=TRUE, file=NULL, ...)

and the link to the file to also inspect it :

https://www.dropbox.com/s/02f3zgizqm5s0el/plotMDS_revised_newcdf.jpeg?dl=0

There a sample(which i located the specific patient)-red triangle on the top that seems to be an outlier-but i didnt exclude it, because i thought(maybe wrongfully for this case) that the adjusting for pairs covers this possible artifact !! So based on the interprentation of the picture(also in a PCA plot the same sample seems an outlier or it lies far from the other samples) i should change any step of my analysis pipeline, for the specific tumor location implementation ?

Thank you a lot for your help one more time !!

ADD REPLY • link 8.6 years ago svlachavas ▴ 830

1

Entering edit mode

Again, I can't really give much of an opinion - you'll have to find someone who's spent more time with the data. I'm not familiar with their plotMDS function, either. Here's some general comments:

The DE in the simpler model (i.e., without location-specificity) may be either driven wholly by DE in ascending_colon by itself; or, it may be due to the fact that all samples are being used in the cancer/normal comparison, which provides more power to detect DE.
You should be looking at making a MDS plot out of the cancer/normal log-fold change within each patient, rather than the absolute expression values. Differences in absolute expression don't matter when you're blocking on patient. Outlier patients can then be identified and dropped.
Try analyzing each data set by itself, to avoid any inflation of the variability from merging the data sets. This makes sure that platform effects aren't involved.

I don't think I can give any more specific advice, short of analyzing the data set myself.

ADD REPLY • link 8.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Dear Aaron,

thank you again for your approach and comments. I have already analysed each dataset separately.Moreover:

Our current goal is to somehow perform the same analysis with a merged dataset, as our further goal is except identifying common functional enrichment pathways or annotations, also construct a classifier for the whole merged dataset based on common probesets-and thus have more samples-as the two separate have different annotation(as you have also noticed from a previous post). That why i naively insisted on the model without the tumor location in the start, as my intention is to use a subset of the features(i.e.the DEG genes) to construct a classifier(but this is a different topic so i wont analyze)

Next, for simplicity, i just post their github link with the detailed function-as im not an expert, but i can see that it also exploits the multidimensional scaling that the default function plotMDS of limma does:

[https://github.com/Bioconductor-mirror/inSilicoMerging/blob/release-3.1/R/plotMDS.R]

Finally, regarding your proposal of constructing " a MDS plot out of the cancer/normal log-fold change within each patient", how can i proceed with this in order to detect any possible outlier patients(like the specifc sample in the above plot) ? something like the default function of limma:

plotMDS(x=fit2$coefficients[,"DiseaseCancer:Locationascending_colon"]...) ??

or it is irrelevant and you mean something different ?

ADD REPLY • link 8.6 years ago svlachavas ▴ 830