i have recently analyzed two different microarray datasets using the same pipeline in R. Both of the datasets have the same variables and the comparizon was cancer vs normal samples in order to find DE genes. Moreover. both of the datasets are Affymetrix platform, but different genechip: hgu133a & hgu133plus2. Finally, after annotating results from both datasets, i found in common 278 genes with same probeIDs(from topTable in limma). My question is whether it is possible and applicable in someway to combine both datasets and the expressions about these specific genes, in order to infer common patterns or similar expressions in these genes ?(for instance with heatmap). Im concerned that merging different datasets includes many pitfalls or serious batch effects, but here my goal is to test the possibility to infer any important information that can be excluded from both of these datasets regarding colon cancer ?
I'm not sure what you mean by excluding important information. What are you trying to achieve by combining the two datasets? One obvious application would be to find genes that are significant in both datasets, and thus, more likely to be genuinely DE.
Yes, i would like to find common genes or groups of genes that "behave similarly" and have common expression patterns from both datasets. Thats why i posted the question to hear every possible idea
Directly combining the two datasets into a single limma analysis would be unwise, due to batch effects and the fact that two different chips are involved. Instead, I'd suggest doing some sort of meta-analysis. As I mentioned before, the obvious approach would be to identify genes that are DE in both experiments. This can be done informally by intersecting the two DE lists, or more rigorously by using an intersection-union test.
Through this approach, you can identify genes that are consistently detected in both experiments. If the two experiments involve the same cancer type, then the intersected subset is unlikely to provide extra biological information than either DE list on its own; however, the genes in this subset are more likely to be genuinely DE. If the two experiments involve different cancers, then the subset might be biologically interesting, e.g., to find common genes that are dysregulated across different cancers.
Another strategy might be to use the DE list in one experiment to define a gene set. You can then use ROAST to test for DE for those genes in the other experiment. This will tell you whether the DE pattern is broadly similar between the two experiments.
Thank you both for your aswers and suggestions. Dear Aron both datasets refer to colon cancer, but if i require information about the patients, maybe there different subtypes of colon cancer-and thus could be more biologically interesting(although also more genuine DE). I have just used Microsoft Access because i didnt know how to intersect in R, and i found between the two dataframes of the DE genes from the two datasets(& from the different platforms): 281 common DE genes from the one dataset with 1248 DE genes(hgu133a) and from the other dataframe with 1149 DE genes(hgup133plus2). Moreover, the majority of these common genes showed a common behaviour in terms of logFC(upregulation or downregulation). So i guess this subset of genes is more genuine DE. Dear Mr MacDonald, i would definately check the above package as from the vignette it looks very interesting.
please excume me for writting after 8 weeks, but regarding the above methodology, as for the time being im trying also to learn and test other methodologies about comparing the two DEG lists from above to strengthen my results, i would like to ask you if it is possible(because im not experienced in R) how could i use roast about your above idea ? i have used mroast in the past with a help from a vignette but it was for testing differentially expressed KEGG pathways with pathview.
Ok i understand but i tried first to post it here for simplicity because you have proposed the idea and i have already posted the specific question- i will then also post it as a separate question
I'm not sure what you mean by excluding important information. What are you trying to achieve by combining the two datasets? One obvious application would be to find genes that are significant in both datasets, and thus, more likely to be genuinely DE.
Yes, i would like to find common genes or groups of genes that "behave similarly" and have common expression patterns from both datasets. Thats why i posted the question to hear every possible idea