Dear Bioconductor community,
I decided to repost and create a new complete post-question from one of my previous post regarding one part of my comments-questions(C: Correct creation of design matrix in limma regarding a multi-level experiment), in order to get as much as possible feedback and evaluation of my implementation regarding my project analysis, from the specialists of the field concerning multifactorial analysis with limma. In summary, im showing a part of the phenotypic information of my merged microarray affymetrix dataset:
head(pData(eset_COMBAT))
Disease Location Meta_factor Study
St_1_WL57.CEL Normal sigmoid_colon 0 hgu133plus2
St_2_WL57.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EC59.CEL Normal sigmoid_colon 0 hgu133plus2
St_T_EC59.CEL Cancer sigmoid_colon 0 hgu133plus2
St_N_EJ58.CEL Normal cecum 0 hgu133plus2
St_T_EJ58.CEL Cancer cecum 0 hgu133plus2
...[in total 60 samples]
In detail, I would like to incorporate the tumor location information, as I have mentioned in the above post[" in my current project we are highly motivated in implementing this information in our microarray analysis, as from both biology field and other perspectives, we are interested in identifying-if it is possible and regarding the results- any crusial and interesting gene expression alteration patterns between different anatomic locations of the patiens colorectal tumors involved in our analysis(i.e ascending vs descending colon, which could possibly underline fundamental functional and distinct pathways-groups of genes involved"). Thus, I want to perform two major analyses if possible:
A. To compare specific tumor (locations) vs their adjacent control samples (i.e ascending. Tumor vs ascending. Control)
B. If it is possible compare different tumor anatomic locations (i.e ascending vs descending tumor samples)
So,- if my idea/implementation although naive is correct without any incorrect points-, my main questions are:
1. In order to start incorporate this information, should I proceed like the limma guide[section 9.7, page 48]-that is:
combine <- factor(paste(eset_COMBAT$Disease, eset_COMBAT$Location, sep="."))
design <- model.matrix(~0 +combine)
colnames(design) <- levels(combine)
2. Secondly, to estimate the correlation between the matched samples on the same patient:
corfit <- duplicateCorrelation(eset_COMBAT, design, block=pairs) ?
OR regarding the block argument should i firstly incorporate this information in my phenoData object:
eset_COMBAT$Subjects <- pairs
and then corfit <- duplicateCorrelation(eset_COMBAT, design, block=eset_COMBAT$Subjects) ??
3. Finally, fit the selected model with
fit <- lmFit(eset_COMBAT,design,block=eset_COMBAT$Subjects,correlation=corfit$consensus)
and then highlight the specific contrasts of interest with makeContrasts..etc??
Any suggestions/corrections/evaluation of my above ideas-code would be essential for my project !!
Dear Aaron,
thank you for your detailed answer. Despite being a bit "noob", i would like to ask you some further questions/explanations in order to understand fully your implementation and dont ask every time a different subject-because this part is extremely important-
1. Correctly, as you have mentioned, for each patient tissue from both the tumor and the adjucent control sample has been pooled. Thats why, i have performed already a paired analysis, as Mr Gordon Smyth corrected without the use of the tumor location, in order to compare total tumor vs control samples
2. Regarding the "more appropriate design you mention"-with the 3 lines of code-:-if i have undertand well(without having tested your code in order to understand some basic parts) :
A. Your above procedure tests for differences between each tumor anatomic location group(i.e asceding tumor colon samples) vs their adjucent control samples ? That is the use of Disease:Location(:) to indicate the interaction ? or it also returns a coefficient like "conditionCancer" representing the "total effect size"(if i describe it correctly) between cancer and control samples ? Im asking this because i have also from the previous post acquired a topTable regarding DEG probesets.
Thus, should i re-run my limma analysis with your above code ?
B. Secondly, the makeContrasts() you indicate can also lead in the same result for my question A? Or if i would like to perform, regarding my question B, comparisons like: asceding_colon.tumor samples vs descending_tumor samples, then i have to use this approach ? And if yes, which would be my design matrix for this case ? The above from your answer ?
3. Finally, regarding your answer about duplicateCorrelation: in order to have a complete knowledge on this matter, the reason you are proposing that i dont have to use duplicateCorrelation() anywere, is that it will exploit the variation between the patients in the model, so i will probably loose any important differences regarding my comparisons??
duplicateCorrelation
, though I feel it's more relevant to test for differences in cancer/normal log-fold changes between tumour locations. No.duplicateCorrelation
will model variability in absolute expression between patients.So, to sum up :
i will run your code, evaluate my results and i will provide feedback and any other questions that maybe arise.
Finally, regarding your answer in my second question-as you correctly again mentioned, our project interest is mainly in observing any interest gene expression patterns between different tumor locations, vs their adjucent control samples(with of course the necessary requirement, that this does not fit in all possible locations-i.e for the left colic flexure, there is only one cancer and its control sample). My second possible implementation about comparing directly "different tumors without their control samples", resulted from the notion that we thought that could be also some interesting observations regarding the direct comparison of tumor samples of different anatomic locations.
Thus, to conclude, if i want also to perform the B.part of the comparisons, i have to use makeContrasts and duplicate correlation, along with my original post code right ?:
combine <- factor(paste(eset_COMBAT$Disease, eset_COMBAT$Location, sep="."))
design <- model.matrix(~0 +combine)
colnames(design) <- levels(combine).....
Dear Aaron,
i runned your code above, so i would like to mention my final questions, in order to correctly construct my final model and comparisons without any misconseption:
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset_COMBAT)
design <- design[,-grep("Normal", colnames(design))]
1. After the above three blocks, as you mentioned : colnames(design)
[1] "pairs1"
[2] "pairs2"
[3] "pairs3"
[4] "pairs4"
[5] "pairs5"...(total 30 pairs) and then
[31] "DiseaseCancer:Locationascending_colon"
[32] "DiseaseCancer:Locationcecum"
[33] "DiseaseCancer:Locationdescending_colon"
[34] "DiseaseCancer:Locationleft_colic_flexure"
[35] "DiseaseCancer:Locationrectosigmoid_colon"
[36] "DiseaseCancer:Locationrectum"
[37] "DiseaseCancer:Locationright_colic_flexure"
[38] "DiseaseCancer:Locationsigmoid_colon"
[39] "DiseaseCancer:Locationtransverse_colon"
Moreover, as you have stated above the coefficients from 31-39 "represent the location-specific log-fold change of cancer over normal samples".
Thus, to continue fit my selected model, i should continue with:
fit <- lmFit(eset_COMBAT, design)
fit2 <- eBayes(fit, trend=TRUE) ?? and i dont have to define any contrasts with makeContrasts() as the interaction term (:) includes them ??
For example, if i type: top2 <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none")
The selected coefficient represents the log-fold change of the 8 ascending colon tumor samples versus their adjucent control samples(the other 8 samples) ?
2. If im also interested in a single coefficient like "DiseaseCancer" which represents the total log-fold change of all the cancer vs the normal samples, i have to use the approach discussed in the previous post without incorporating the location information, correctly ??
3. If for a subsequent analysis i would like to testing directly for differences among different tumor samples regarding their respective location(above question B) i have to create something like the code posted firstly in my question with duplicateCorrelation ? and then makeContrasts?
4. Finally, i would like to post some two optional blocks of code Mr. Smyth posted in another post regarding non-specific filtering:
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset_COMBAT)
design <- design[,-grep("Normal", colnames(design))]
fit <- lmFit(eset_COMBAT, design)
keep <- fit$Amean > median(fit$Amean)
fit <- eBayes(fit[keep,], trend=TRUE)
Should i use the two final lines of code with the above model incorporation the location specific log-fold change of the paired samples, or is too strict because the effect would vary for each coefficient-location ? and i could use it only in the previous post without the location information ??
Thank you for your patience and your consideration on this crusial topic !!!
limma
user's guide.Dear Aaron,
thank you one more time for your great contribution in my analysis !! I have some final considerations-question updates on your comments above & for the complete post(i hope i have not already disturb you):
1. Firstly, in your code blocks you used : design <- design[,-grep("Normal", colnames(design))] and you mentioned to "get full rank"- Thus, if i leave the design matrix without this removal, i have also coefficients like:
"DiseaseNormal:Locationascending_colon"-So these coefficients represent something like the baseline level(i.e "like" the intercept terms of the interaction comparison ?(I just insist on this to get more confidence about my above question 1 because i have never used the : term in a design matrix)
2. Regarding your answer on the above number 2.: you fairly argue about the usage of the location information. On the other hand, because we didnt have this information at the begining of the analysis, i thought that creating firstly a "naive" model [condition <- factor(eset.2$Disease, levels=c("Normal","Cancer")) ;pairs <- factor(rep(1:30, each = 2)) ; design <- model.matrix(~condition +pairs) ] i would have a first "general" comparison-one coefficient of cancer vs normal samples, in order to give me a first list of DEG genes. And then with this "pool" of DEGs, somehow interpret the results with the subsets of the resulted DEGs from the location specific comparisons(maybe find common genes..etc)
3. Moreover, one other important issue that i would like your opinion is after i create your design model(with the full rank), and create the model, i will have the coefficients from 31-39 representing the specific location comparisons. So dropping each of them in topTable will give me the possible DEG candidates regarding cancer vs control samples for this location.
However, if i would like to find possible-common DE genes(if any) across any of these conditions, how i would proceed ?
normally i would something like : common <- topTable(fit2, number=nrow(fit2)) # fit2 after eBayes or topTableF (although i have never used it).
Again my main concern is that there are also the coefficients from the pairs-blocking factor -also, im troubled because two comparisons regarding two locations(transverse_colon and rectosigmoid) have very few samples for any comparisons(2 samples-1 cancer vs its adjucent normal) and 4 samples(2 vs 2) respectively), so it may be unuseless to include them in the common gene task, as it may reduce dramatically the number of(any common). My ultimate goal about this step is to subset my merged eset based on these possible common genes across these contrasts, and then use a heatmap to see any interesting pattern across location-specific samples.
So how could i deal with this problem ??
4. Finally, regarding the non-specific intensity filtering, i found the post with Mr. Gordon Smyths code:
C: Trouble identifying differentially expressed genes on Affymetrix Human Gene 2.0.
Before i use limma, i implemented a non-specific intensity filtering, like the one Mr Johannes Rainer kindly proposed
https://support.bioconductor.org/p/65138/#65139- With which, from 12.136 probesets remained the 9884 probesets-but it still contained low-intensity probesets, thats why i asked you about using keep <- fit$Amean > median(fit$Amean). However, to report, after this step i remain with the half of the 9984 probesets to test for DE(4992). Usually, this step calculates the average log intensity for each probeset across all samples, and then computes a median value based on the vector of all mean values of each probeset as an input ? Correct ?
Thank you one more time for your crusial help !!!
So, use the full rank design matrix, and get rid of the coefficients which include the term "DiseaseNormal", as far as with the interaction term the coefficients which have the term "DiseaseCancer" have already include the difference for the cancer vs normal samples regarding their respective same location.
Regarding the common genes, you are proposing create for instanse 9 deg lists(each one from topTable), the same number as the different location-tumor specific coefficients . I would like to ask if this is also possible with decideTests() somehow, or it is irrelevant ?
Finally regarding the non-specic intensity filtering, i will follow your advice and check the methodologies in users guide
You can run
decideTests
to get an integer vector of -1, 0 or 1 for each comparison; you can get the intersection by asking for those genes with a value of 1 or -1 in both vectors.because i have not used decideTests before, according to your idea, in order to have a similar approach with topTable, should i use the default arguments:
decideTests(fit2 ,method="separate",adjust.method="fdr" ,p.value=0.05) -and this will characterize DE genes if their adjusted p value is less than 0.05 correct? or i could also use method="nestedF" ?
and the interprentation of t statistic with -1,0 or 1 is for downregulated, non-DE and upregulated ? concering the fact that usually for instanse a negative t-statistic indicates a negative log-FC(if it is significant) ?
Yes and yes. Read the documentation if you're not sure.
i checked it before posting and disturbing you again-i just could not specify the exact difference of method="separate", with method="nestedF"
Read the user's guide. You'll see that
nestedF
is still experimental. So, stick withseparate
.Edit: For a bit more detail, the idea with
nestedF
is to identify those genes that have significant differences in any of the contrasts via a F-test (i.e., after dropping any of the cancer:location coefficients - obviously, you don't want to drop the patient blocking factors, so you'd have to runcontrasts.fit
prior to usingdecideTests
if you wanted to do this). DE genes are identified after applying the BH method on the resulting p-values and picking a FDR threshold. For these genes, the function goes back to the t-statistics for each contrast and tries to figure out which contrasts contributed to rejection in the F-test. Withseparate
, significant genes are identified by directly applying the BH method on the p-values from the t-tests in each separate comparison.The
nestedF
approach may provide more detection power as it uses information from multiple contrasts. However, as the user's guide says, formal FDR control is only provided across genes, not across contrasts. You need the FDR to be controlled in each contrast if you want to detect genes that are DE in each location, prior to intersection of DE lists. Thus, I'd be inclined to useseparate
.Dear Aaron, i read carefully your answer, as the chapter 13 of the limma vignette. Thus, to make some final(i hope comments) :
1. So, if im not mistaken, as nestedF tests all the contrasts created prior, it would not suit in my case as it also includes the not estimable pair coefficients you mention right ? And if i wanted anyway to test it, i would have to use duplicate correlation, make.Contrasts and contrasts.fit like the first question B correct ?
Moreover, just to verify it, from the vignette of limma in page 62 it reports in the paragraph of method:separate : "Another disadvantage is that the raw p-value cutoff corresponding to significance can be very different for different contrasts, depending on the number of DE probes"--but the argument p.value=0.05 performs cutoff on the adjusted p-values, like the topTable, correct ??
2. Secondly, except the formal way you proposed,of creating 9 different DE lists from topTable, at a specific FDR threshold(like adjusted p val < 0.05), i also tried to use decideTests. In detail, after the above code(fit2 object created with eBayes):
results <- decideTests(fit2,,adjust.method="fdr" ,p.value=0.05)
> colnames(results)[31:39]
[1] "DiseaseCancer:Locationascending_colon"
[2] "DiseaseCancer:Locationcecum"
[3] "DiseaseCancer:Locationdescending_colon"
[4] "DiseaseCancer:Locationleft_colic_flexure"
[5] "DiseaseCancer:Locationrectosigmoid_colon"
[6] "DiseaseCancer:Locationrectum"
[7] "DiseaseCancer:Locationright_colic_flexure"
[8] "DiseaseCancer:Locationsigmoid_colon"
[9] "DiseaseCancer:Locationtransverse_colon"
Then, i naively tried :
u <- intersect(results[,"DiseaseCancer:Locationascending_colon"], results[,"DiseaseCancer:Locationdescending_colon"])
u
[1] 0
Thus, i wonder if then it would have any results for trying to identify total common DE across all tumor locations, if already the above two have no common DEGs ?
Or i should also try with the approach of topTable ?
The last thing i thought that maybe contributes of the non-common resulted genes above, is that i used also the aggresive filtering you have mentioned, and maybe i could have removed possible common probesets
3. Finally,
from the post report genes in intersection of Venn diagram with topTable?,
you had proposed to find common genes(although was for two contrasts-coefficients):
could i use something similar here ? or it is irrelevant ?
Dear Aaron,
please excuse me for returning with longer messages, but i have some feedback and your opinion on some crusial matters that resulted. In detail, i created based on your instruction, 9 DEG lists representing each one tumor location specific contrast between cancer and control samples, with an FDR threshold < 0.05: (im just summarize the code to check again any naive possibility of mistakes):
pairs <- factor(rep(1:30, each=2))
design <- model.matrix(~ 0 + pairs + Disease:Location, data=eset.2) # eset.2- resulted from above eset_COMBAT with a first initial non-specific intensity filtering
design <- design[,-grep("Normal", colnames(design))] # get full rank
colnames(design)
fit <- lmFit(eset.2, design)
keep <- fit$Amean > median(fit$Amean)
fit2 <- eBayes(fit[keep,], trend=TRUE)
and then for instance, for the ascending_colon location:(total 5 cancer vs their 5 adjucent controls)
top2_ascending <- topTable(fit2, coef="DiseaseCancer:Locationascending_colon", number=nrow(fit2), adjust.method="fdr", sort.by="none",p.value=0.05)
Although the above coefficient returns a lot of genes, with smaller p values from 0.01, and even with a lfc for just testing-however, none of the other coefficients(all the rest 8) returned either a single gene/probeset with adjusted p value less than 0.05. I just noticed that in some of these coefficients, there were some raw p.values less than 0.05, but not a great number.
Thus, which is your opinion-suggestion on this matter ? regarding the fact, that also in other locations with the most total samples(i.e. cecum=6 vs 6, sigmoid 5 vs 5) returned no results ?
In other words, is there a possibility that there is a strong biological effect which is depicted with a strong pattern & also number of DEG genes on this specific location, while on the other locations there is no such effect ? Or it might counts that the merged eset is consisted of only the common probesets from the two datasets ?Also from your evaluation, as the tutorials, i think there is no problem with the code. To mention, i also tried with the less stricter non-specific intensity filtering procedures, but nothing changed. Finally, to pinpoint, i also used a customCDF for the annotation.
The answer to this question depends a lot on the biology of the system, of which I'm not an expert. The only advice I can give involves general observations; one, you'd generally expect a lot of DE between cancer and normal samples; and two, human data is highly variable. This variability may be masking the DE at the other locations. I'd suggest creating a MDS plot after running
removeBatchEffect
to remove patient-specific expression. See if there are any outliers or substructure that might be inflating the variance estimates.I understand Aaron-it is a complicated question to aswer, thats why i needed your opinion about this specific issue. Regarding the variability you have highlighted, i have never thought that could alter so much the results. When i use the model without the tumor location in the design matrix, i get a lot of DEG genes, even again with a logFC cutoff for checking. Regarding the plotMDS, i send you the link to the one i created after the merge of the two datasets[but the function is from the inSilicoMerging package]:
plotMDS {inSilicoMerging} R Documentation
Create double-labeled MDS plot from (merged) ExpressionSet
Description
Create Multidimensional Scaling (MDS) plot from ExpressionSet. Very similar to Principal Component Analysis (PCA) plots all samples are plotted in a two-dimensional space where both axis represent the two principle axis of expression variation. In this plot each sample can be labeled with a color and with a symbol.
Usage
plotMDS(eset, colLabel, symLabel, legend=TRUE, file=NULL, ...)
and the link to the file to also inspect it :
https://www.dropbox.com/s/02f3zgizqm5s0el/plotMDS_revised_newcdf.jpeg?dl=0
There a sample(which i located the specific patient)-red triangle on the top that seems to be an outlier-but i didnt exclude it, because i thought(maybe wrongfully for this case) that the adjusting for pairs covers this possible artifact !! So based on the interprentation of the picture(also in a PCA plot the same sample seems an outlier or it lies far from the other samples) i should change any step of my analysis pipeline, for the specific tumor location implementation ?
Thank you a lot for your help one more time !!
Again, I can't really give much of an opinion - you'll have to find someone who's spent more time with the data. I'm not familiar with their
plotMDS
function, either. Here's some general comments:ascending_colon
by itself; or, it may be due to the fact that all samples are being used in the cancer/normal comparison, which provides more power to detect DE.patient
. Outlier patients can then be identified and dropped.I don't think I can give any more specific advice, short of analyzing the data set myself.
Dear Aaron,
thank you again for your approach and comments. I have already analysed each dataset separately.Moreover:
[https://github.com/Bioconductor-mirror/inSilicoMerging/blob/release-3.1/R/plotMDS.R]
plotMDS(x=fit2$coefficients[,"DiseaseCancer:Locationascending_colon"]...) ??
or it is irrelevant and you mean something different ?