I have RNASeq data quantified by Salmon against the Ensembl transcriptome, and imported with tximport. I'd like to run this through GAGE. I have some questions.
Does GAGE require normalized input? All ?gage() says is:
" exprs: an expression matrix or matrix-like data structure, with
genes as rows and samples as columns."
Normalization is not mentioned anywhere in gage.pdf or dataPrep.pdf. Normalization IS mentioned in RNA-seqWorkflow.pdf. How did the author of that document know to normalize their counts before GAGE?
Next, assuming that I need normalized data, how do I get that from my txi? Is that what txi$abundance is? I gather from the documentation that txi$abundance is TPM, and TPM is a normalization method, so I assume that this is just good to go. Is this assumption correct?
So, with more hope than confidence I'm going to attempt the GAGE and see what happens. I have 4 conditions in my experiment, only doing 2 comparisons.
gage.data <- txi.salmon$abundance colnames(gage.data) <- rownames(sampleTable) kg.mus <- kegg.gsets("mmu") kegg.gs<-kg.mus$kg.sets[kg.mus$sigmet.idx] kegg.gs.sym <- lapply(kegg.gs, function(X) { mapIds(org.Mm.eg.db, keys=X, column="SYMBOL", keytype="ENTREZID")}) E11_ref <- grep("E11.5 F20HET", colnames(gage.data)) E12_ref <- grep("E12.5 F20HET", colnames(gage.data)) E11_test <- grep("E11.5 F9.20DKO", colnames(gage.data)) E12_test <- grep("E12.5 F9.20DKO", colnames(gage.data)) E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired") E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired") > E11_kegg$greater[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu00190 Oxidative phosphorylation 0.01915897 1.2753756 0.01473745 0.8648741 120 mmu04260 Cardiac muscle contraction 0.11135302 0.9193310 0.05730777 0.8648741 74 mmu04723 Retrograde endocannabinoid signaling 0.19837221 0.5629723 0.16567317 0.8648741 141 > E11_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03040 Spliceosome 0.03489858 -1.607514 0.002886588 0.5581727 130 mmu03050 Proteasome 0.05356778 -1.254981 0.017127011 0.5581727 46 mmu03013 RNA transport 0.10930116 -1.021654 0.039141231 0.5581727 166 > E12_kegg$greater[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03050 Proteasome 0.08137625 1.2713452 0.01503186 0.6074224 46 mmu04350 TGF-beta signaling pathway 0.13299879 1.0604810 0.03375249 0.6074224 84 mmu04530 Tight junction 0.15116806 0.9769626 0.04572445 0.6074224 167 > E12_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03010 Ribosome 3.613262e-10 -6.5393759 1.133191e-26 2.538349e-24 136 mmu04217 Necroptosis 1.492073e-01 -0.9598561 4.879629e-02 9.056753e-01 175 mmu04978 Mineral absorption 1.819284e-01 -0.9111373 5.894268e-02 9.056753e-01 44
My q.vals are all crap except one, mmu03010 Ribosome, and only at E12. Let's see if this result is realistic by checking against my results table.
> ribo_genes <- kegg.gs.sym$"mmu03010 Ribosome" > ribo_res_E12 <- res_E12[na.omit(match(ribo_genes, row.names(res_E12))),] > ribo_res_E12[order(ribo_res_E12$padj),] log2 fold change (MLE): condition E12.5 F9.20DKO vs E12.5 F20HET Wald test p-value: condition E12.5 F9.20DKO vs E12.5 F20HET DataFrame with 136 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Rpl3 5844.2119 -0.7370311 0.1509027 -4.884147 1.038773e-06 0.0002949744 Rpl30 3114.4671 -0.7273430 0.2222087 -3.273242 1.063214e-03 0.0583008082 Rps23 4427.1601 -0.5449952 0.1776096 -3.068501 2.151355e-03 0.0897922536 Mrps11 818.0561 -0.4252406 0.1483233 -2.866985 4.144027e-03 0.1401217085 Rps4x 5558.1320 -0.3596796 0.1294615 -2.778275 5.464830e-03 0.1664784148
So I have 3 genes with a padj<.1, out of a set of 136. I doubt this is highly significant.
I feel like I'm doing something wrong here, and I'm not sure what it is. My first thought is something to do with normalization. The documentation isn't clear as I mentioned above. My second thought is to do with that na.omit() after matching identifiers. Interestingly...
> count_missing_IDs <- unlist(lapply(lapply(lapply(lapply(kegg.gs.sym, match, row.names(res_E12)), is.na), which), length)) > count_missing_IDs[which.max(count_missing_IDs)] mmu03010 Ribosome 41
I don't know if that means anything, but it's a suggestive coincidence. I don't know how to follow up on this.
Thanks, I reran GAGE using the counts from my dds. This definitely changed the numbers, but didn't change the GAGE analysis
Still just one significant pathway, which seems very unlikely to be actually significant. Does it mean anything that many of the q.val are the same, or is this just an artifact?
Many q values being the same is a natural consequence of the BH method. I have a post explaining this hidden here somewhere on the support site.
I’d say you should choose your input to GAGE based not on results but on what the method is designed to have as input. I don’t know the correct answer but I wanted to point you to the different matrices you could export.
I figured it might be.
I wouldn't normally massage input data to get desired results. But since I don't rightly know what gage() expects, inspecting the output for absurdity is perhaps a hint I'm doing something wrong.