Possible TF binding site enrichment analysis through mroast and conversion problems
1
1
Entering edit mode
svlachavas ▴ 780
@svlachavas-7225
Last seen 8 hours ago
Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

except the analysis i have performed to identify potential master regulators in one microarray dataset i have analyzed, i would also like to perform some kind of TF-binding site analysis /& promoter analysis to also identify enriched TF-binding sites(which could also strengthen my results for the possible TFs i have identify) and also important cis-motifs. I have found through MsigDB (http://www.broadinstitute.org/gsea/msigdb/collections.jsp#C6) the c3: motif gene sets, which i would like to use as an input in mroast with my expressionset in similar way to ony older post i have created(https://support.bioconductor.org/p/68727/#69012). However, this time i dont want to download the complete rdata file which contains 3 different genes sets, and i intend to download the motif gene sets & the TF-gene sets to perform two separate inputs to feed for limma. Thus, as i can only download in gmt file(symbol or EntrezID), how i can convert the  gmt file into an appropriate form of a list in R, in order to match with the probeIDs in my eset ?

Furthermore, there are any other web-tools to perform similar work and use as to further validate my results ??

Any help or recommendation would be essential !!!

2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 22 hours ago
The city by the bay

Is there a particular reason you don't want to use WEHI's rdata file for the C3 sets? You can just subset the Hs.c3 list for any specific motifs that you're interested in (watch out for gene sets that contain $ in their names; use the [ or [[ operators to be safe) and follow the procedures that you used for your previous post (C: Appropriate methodologies for using various inputs(GOs, gene-sets) for mroast te). The file size isn't particularly large, either. Why bother wrangling with the GMT file type? ADD COMMENT 0 Entering edit mode Dear Aaron, thank you for your response. The initial reason for which i wanted to download a specific gmt file instead of the whole rdata file, was to also perform a similar analysis and download from the c2 gene-set only the genes that belong to the KEGG and not perform the analysis on the entire different gene categories. Regarding the fact of subseting, i havent thought of your proposal but i dont know how much more clear it would be to perform subsetting after the import of the rdata. Maybe, i could perform the entire analysis on the whole rdata set, and then rank the total results with the mixed PValue and select the top30 results, which might include different gene-sets. One other important question regarding this specific use of the C3 gene-sets. Could i perform in your opinion the mroast analysis but instead use the different DEG lists from my different coefficients as you saw in my previous posts in limma(each biological treatment vs DMSO) ? My thinking was that if for instance from my DEG list of 415 genes, can be produced some gene-sets that match a specific TF-binding site-or motif(even to try also for KEGG paths), then it could give more strength about my results of co-regulatory analysis i have also performed ---or my thinking is wrong as mroast is a self-contained test, thus maybe it needs the whole gene-set in order to evaluate better the coordinated differential expression between a specific set of genes ? Please excuse me for my many questions but i believe that they are very crusial to interpret appropriately my analysis. I mean, that instead of using the whole gene expression set, could i use the different DEG lists to see if each pool of DEG genes can result in some groups of DEG which co-regulate together/or match together in pathway for instance-Thus, in this way can further validate my results of enrichment analysis ?? ADD REPLY 1 Entering edit mode If you just want the KEGG pathways, you can load in the Rdata file and subset like so: kegg.only <- Hs.c2[grep("KEGG", names(Hs.c2))] Then you can just use this subsetted list in ROAST. This ensures that you only analyze the gene sets of interest. The same logic applies to any of the other gene set lists - for example, the C3 gene set list can be split into MIR and TF target gene sets by looking for the "MIR" string in the list names. I find this easier than messing around with GMT files. For your second question; if I understand correctly, you want to restrict ROAST to use only those genes that are DE in your comparisons and also present in your motif set. This is unwise, as you're constructing the gene set based on DE. If you test the set for DE with ROAST using the same data, it's obviously going to be significant, even if the DE was spurious. This will result in loss of type I error control. In short, the construction of the gene set must be independent of the DE testing. Use the entire motif gene set for each TF, rather than modifying it based on the DE you observe from your data. ADD REPLY 0 Entering edit mode Dear Aaron, thank you for your description, especially for the second part of the question regarding the methodology of mroast. I was confused in the start but after i also read the paper, i understand that mroast is different from gene-wise testing and if i use my DEG list instead of the whole expression set, it will result in the problematic situations you have described. Thus, to sum up with 3 quick update questions to stop disturbing you: 1. As i use the whole expression set and the design matrix from my previous post (https://support.bioconductor.org/p/68307/#68531), each time i use a different contrast the mroast will test the specific coefficient from the design matrix(i.e., contrast=6, that is the difference between C10_OHT vs the DMSO control group) to find gene-sets which are DE between this specific contrast, right ? 2. Secondly, as i have mentioned the problem/presence of a batch effect regarding the different biological replicates, for which i have the variable replicate in targets group, can mroast take into account this information ? Or as it is included in the design matrix i dont have to specify it in mroast ? 3. Finally, if i want to visualize DEG GO terms from mroast, in a similar way as someone can visualize DE pathways with pathview, RamiGO is only choise ? Because i saw from its vignette, that it can be used with the c5 gene set from MsigDB, but i have never used Thank you for your consideration on this matter !!! ADD REPLY 1 Entering edit mode 1. Yes. 2. It's included in the design matrix that you supply to roast, so you don't need to specify anything extra. 3. Visualization is up to you - limma only provides barcodeplot for looking at ROAST output. ADD REPLY 0 Entering edit mode Dear Aaron, regarding the barcodeplot, it is designed only for roast or also for mroast ? In detail. i searched the function and it states : "This function plots the positions of one or two gene sets in a ranked list of statistics"... Thus, could i use it with the results of mroast ? For instance, after following your instructions head(res) # res the ordered by mixed P-value output of mroast NGenes PropDown PropUp Direction PValue FDR PValue.Mixed GO:0070062 2286 0.2082240 0.3171479 Up 0.001 0.003906349 0.001 GO:0044822 1414 0.3500707 0.1343706 Down 0.001 0.003906349 0.001 GO:0010467 972 0.3158436 0.1265432 Down 0.001 0.003906349 0.001 GO:0005730 877 0.3580388 0.1915621 Down 0.001 0.003906349 0.001 GO:0016032 706 0.2592068 0.1543909 Down 0.001 0.003906349 0.001 GO:0005615 496 0.1995968 0.3850806 Up 0.001 0.003906349 0.001 FDR.Mixed GO:0070062 0.001 GO:0044822 0.001 GO:0010467 0.001 GO:0005730 0.001 GO:0016032 0.001 GO:0005615 0.001 ADD REPLY 1 Entering edit mode The barcodeplot function will visualize the behaviour of one gene set at a time. If you have multiple gene sets from mroast, you'll have to generate a series of plots by calling barcodeplot on each gene set separately. Visualization of two gene sets at a time is possible, but this is easiest to interpret for related sets that are changing in opposing directions. For example, one GO term might represent up-regulation of a process while another might represent down-regulation of the same process - you'll have to find these paired terms yourself if you want to visualize them on a barcode plot. Note that barcodeplot doesn't actually take the output of mroast - you can run it without running either roast or mroast. Construction of a barcode plot visualizes the behaviour of the set, while running ROAST determines the statistical significance of DE across the set. While related, these are two different procedures. ADD REPLY 0 Entering edit mode Thank you for your explanation-moreover, because im a bit struggling to use the function with mroast, how can i use a specific gene-set found DE in mroast(i.e. PValue mixed < 0.01)-such as the "GO:0070062" from above ? ADD REPLY 1 Entering edit mode Just supply the gene set indices for your desired term as the index argument in barcodeplot. This should be as simple as taking whatever list you supplied to mroast, subsetting that list to keep only "GO:0070062", and supplying the resulting index vector to barcodeplot. As for the statistics; you need to get the t-statistics for your comparison of interest. For example, if you're interested in coefficient 6 to compare C10 + OHT to DMSO, then you can supply fit$t[,6] as the statistics argument in the call to barcodeplot (assuming fit is the output from eBayes).

0
Entering edit mode

I see-thats why you mentioned above that the barcodeplot function doesnt in fact "feed" with the result of mroast/roast but more generally to a desired gene-set-also but the term fit\$t you mean the output of ebayes right ?