Question: Appropriate methodologies for using various inputs(GOs, gene-sets) for mroast testing
0
3.6 years ago by
svlachavas630
Greece/Athens/National Hellenic Research Foundation
svlachavas630 wrote:

Dear Community,

in addition to one of my previous posts about mroast testing for findind "differentially activated" KEGG pathways, i would like to make two very important questions regarding some useful implementations about my current analysis in a illumina microarray dataset(Human HT-12 v4 beadchip) which i have analyzed and pre-processed with limma. My first question is if i could similarly test for multiple gene sets regarding Gene Ontology via mroast. My design matrix for my expressionset is also mentioned in other post(C: Correct construction of design matrix in limma for multiple contrasts for gene e)

My first approach is similar to my post for KEGG with mroast(C: Appropriate use of the function mroast to find KEGG pathways for differential ex) :

library(illuminaHumanv4.db)

x <- illuminaHumanv4GO2PROBE

mapped_probes <- mappedkeys(x)

xx <- as.list(x[mapped_probes])

indices <- ids2indices(xx, rownames(filtered.2)) # filtered.2= normalized & filtered-EListRaw-class

res <- mroast(filtered.2, indices, design, contrast=6) # contrast=6 in the design matrix in the above link represents the difference between the combined therapy of two compounds vs DMSO(control group)

My second approach is based on an edx lesson(PH525x series) for gene set testing in R, which i slightly modified-the motivation for the second approach is that i wanted to define more the minimum length for each gene set for testing:

go2probe <- as.list(illuminaHumanv4GO2PROBE)
govector <- unlist(go2probe)
golengths <- sapply(go2probe, length)
idxvector <- match(govector, rownames(filtered.2))
idx <- split(idxvector, rep(names(go2probe),golengths))
idxclean <- lapply(idx,function(x) x[is.na(x)])
idxlengths <- sapply(idxclean,length)
idxsub <- idxclean[idxlengths>=10]
res <- mroast(filtered.2,idxsub,design,contrast=6)

and then for both cases use the GO.db database to find each GO term to which specific term is annotated(i.e. MF, BP or CC).

To sum up, my first question is (because i use R for the past 6 months) if both the above methodologies are appropriate for the use of mroast for finding "DE GO terms" ?

Moreover, my second and also very crusial question, is if any of these above methods could be also used with other curated gene-sets as inputs for mroast in R, such as the molecular signature databases in the Broad Institute (http://bioinf.wehi.edu.au/software/MSigDB/) ?? My main reason for asking is that besides various pathway analysis methodologies and repositories like KEGG or Reactome, as my current dataset describes the use of combinatorial inhibitors in a specific metastatic breast cancer cell line, i found extremely interesting and possibly useful to use specific molecular signatures from the Broad Institute, such as the C6 oncogenic signatures. Or these are completely different and irrelevant for my specific analysis ?

modified 3.6 years ago by Aaron Lun22k • written 3.6 years ago by svlachavas630
Answer: Appropriate methodologies for using various inputs(GOs, gene-sets) for mroast te
1
3.6 years ago by
Aaron Lun22k
Cambridge, United Kingdom
Aaron Lun22k wrote:

For your first question, your first approach seems a lot better. Your second approach involves a lot of (unnecessary) fiddling, so you'll have to be pretty sure that you haven't introduced errors anywhere. If you want to restrict to GO terms with a minimum size, you can just use your first approach and do:

set.size <- sapply(indices, FUN=length)
indices2 <- indices[set.size >= 10]

Application of ROAST will do a self-contained gene set test for each GO term. This will test whether any of the genes in the set are differentially expressed. Thus, a significant result can be interpreted as differential expression across the genes for that term, i.e., a "DE GO term", as it were.

For your second question, if you want to use other gene sets as inputs into ROAST, you're free to do so. It seems entirely reasonable to try to test for DE in the molecular signatures, especially for oncogenic signatures when you're studying cancer.

Dear Aaron,

thank you for your useful recommendations !! Regarding the second part of my question, for instance if i choose the C6 oncogenic signatures, how i can imprort and correctly use the "human_c6_v4.rdata" ?? I used the function  load("C:/Users/Efstathios/Desktop/human_c6_v4.rdata") but i couldnt manage to read the data object

1

That command should work; it should load into memory a list called Hs.c6, where each element of the list is a named set that contains Entrez gene IDs. You probably need to match your Illumina probe IDs with their Entrez IDs first:

entrezid <- select(illuminaHumanv4.db, keytype="PROBEID",
column="ENTREZID", keys=rownames(filtered.2))

and then match the IDs with the elements in Hs.c6:

indices <- ids2indices(Hs.c6, entrezid$ENTREZID) --- Edit: to account for duplicated probe IDs, you'll need to match them back to the row names. This will give you the index of filtered.2 corresponding to each probe ID and, subsequently, each Entrez ID in the set. original.index <- match(entrezid$PROBEID, rownames(filtered.2))
indices <- lapply(indices, FUN=function(x) { original.index[x] })

Dear Aaron,

your code works fine and i get the match of the probe ENTREZIDs with these of each gene set. Only i have two considerations i would like to ask you and discuss:

1. Firstly, as my filtered.2 eset contains duplicate probeIDs, after the function select i get both duplicate PROBEIDs, as duplicate ENTREZIDs. In your opinion sould i use the filtered.2 eset after i have removed any duplicate and probeIDs matching to NAs, or this should not be a problem for mroast ? I mean, if there are double probeIDs counted in each gene set or it is not something i have to concern ?

2. Secondly, regarding the output of mroast (i.e the gene set $GLI1_UP.V1_UP), i searched that for the example of C6 gene set, additional information can be found at (http://www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=C6). Thus, for example if the total direction for an example gene set like AKT_UP.V1_DN is up(upregulated), and from description "Genes down-regulated in mouse prostate by transgenic expression of human AKT1 gene[GeneID=207] vs controls" it means that this specific group of genes found downregulated in prostate cancer are totally upregulated in my expression set ? Or i should interpret differently the results ? Thank you for your condideration on this matter ADD REPLYlink written 3.6 years ago by svlachavas630 1 mroast automatically handles duplicate probe IDs for the same gene. This is because the correlations between probes from the same gene effectively downweights their effect. Even if a gene has many probes in the set, the contribution of those duplicate probes to the ROAST result will be limited if the probes are highly correlated (as later probes provide no more information than the first). Manual removal of duplicate probes is unnecessary. In fact, removal is unwise in the presence of weak correlations, where different probes may actually provide additional information. For your second question, your interpretation seems correct. If the overall direction reported by mroast is "Up", then the genes in the set are generally up-regulated across your specified contrast. Whether this is consistent with the annotation of the set depends on the nature of your contrast and the biology of your system. ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Aaron Lun22k Thank you again Aaron for your answers !! To sum up,i just need to run the code with ids2indices as you described above, then order the results of mroast(i.e with PValue mixed or adjustedPvalue) and then based on the info of each gene set found DE to interprent the description of the set with the total perturbation in my expressionset. ADD REPLYlink written 3.6 years ago by svlachavas630 Dear Aaron, one more naive question: regarding the use of molecular signatures from the Broad Insitute: based on your code: load("C:/Users/Efstathios/Desktop/human_c6_v4.rdata") entrezid <- select(illuminaHumanv4.db, keytype="PROBEID", column="ENTREZID", keys=rownames(filtered.2)) indices <- ids2indices(Hs.c6, entrezid$ENTREZID)
set.size <- sapply(indices, FUN=length)
indices2 <- indices[set.size >= 10]
res <- mroast(filtered.2, indices2, design, contrast=4)

But after the final function mroast it gives me this error:

Error in y[index, , drop = FALSE] : subscript out of bounds

I also used another number of contrast like 3 or 2, as my design matrix except from the first column which is the intercept has other 6 colums which represent various coefficients.

Moreover, my filtered.2 object is of class "EList", and with rownames:

[1] "ILMN_3241953" "ILMN_1755321" "ILMN_1698554" "ILMN_2061446" "ILMN_1676336" "ILMN_3237396"

Any ideas or suggestions regarding this specific error ??

1

i think i have found the problem:

dim(filtered.2)
[1] 13358    12

sum(duplicated(entrezid\$PROBEID))
[1] 806

-So there are except ENTREZIDs also duplicate PROBEIDs mathing to the same entrezid-

entrezid <- entrezid[!duplicated(entrezid[,1]),]

when i removed the duplicated probeIDs then the code worked fine, but do you agree with my approach ??

1

Don't do it that way, because if you remove duplicates in entrezid, you'll eliminate some relationships that might be interesting. For example, consider a probe A that is associated with genes X and Y. This shows up as duplicate rows A:X and A:Y in the entrezid table. However, if you eliminate duplicate entries of A, you'll throw out the A:Y relationship that might be interesting. More specifically, gene sets containing Y would fail to include probe A after processing into indices.