I have RNASeq data quantified by Salmon against the Ensembl transcriptome, and imported with tximport. I'd like to run this through GAGE. I have some questions.
Does GAGE require normalized input? All ?gage() says is:
" exprs: an expression matrix or matrix-like data structure, with
genes as rows and samples as columns."
Normalization is not mentioned anywhere in gage.pdf or dataPrep.pdf. Normalization IS mentioned in RNA-seqWorkflow.pdf. How did the author of that document know to normalize their counts before GAGE?
Next, assuming that I need normalized data, how do I get that from my txi? Is that what txi$abundance is? I gather from the documentation that txi$abundance is TPM, and TPM is a normalization method, so I assume that this is just good to go. Is this assumption correct?
So, with more hope than confidence I'm going to attempt the GAGE and see what happens. I have 4 conditions in my experiment, only doing 2 comparisons.
gage.data <- txi.salmon$abundance
colnames(gage.data) <- rownames(sampleTable)
kg.mus <- kegg.gsets("mmu")
kegg.gs<-kg.mus$kg.sets[kg.mus$sigmet.idx]
kegg.gs.sym <- lapply(kegg.gs, function(X) { mapIds(org.Mm.eg.db, keys=X, column="SYMBOL", keytype="ENTREZID")})
E11_ref <- grep("E11.5 F20HET", colnames(gage.data))
E12_ref <- grep("E12.5 F20HET", colnames(gage.data))
E11_test <- grep("E11.5 F9.20DKO", colnames(gage.data))
E12_test <- grep("E12.5 F9.20DKO", colnames(gage.data))
E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired")
E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired")
> E11_kegg$greater[1:3,1:5]
p.geomean stat.mean p.val q.val set.size
mmu00190 Oxidative phosphorylation 0.01915897 1.2753756 0.01473745 0.8648741 120
mmu04260 Cardiac muscle contraction 0.11135302 0.9193310 0.05730777 0.8648741 74
mmu04723 Retrograde endocannabinoid signaling 0.19837221 0.5629723 0.16567317 0.8648741 141
> E11_kegg$less[1:3,1:5]
p.geomean stat.mean p.val q.val set.size
mmu03040 Spliceosome 0.03489858 -1.607514 0.002886588 0.5581727 130
mmu03050 Proteasome 0.05356778 -1.254981 0.017127011 0.5581727 46
mmu03013 RNA transport 0.10930116 -1.021654 0.039141231 0.5581727 166
> E12_kegg$greater[1:3,1:5]
p.geomean stat.mean p.val q.val set.size
mmu03050 Proteasome 0.08137625 1.2713452 0.01503186 0.6074224 46
mmu04350 TGF-beta signaling pathway 0.13299879 1.0604810 0.03375249 0.6074224 84
mmu04530 Tight junction 0.15116806 0.9769626 0.04572445 0.6074224 167
> E12_kegg$less[1:3,1:5]
p.geomean stat.mean p.val q.val set.size
mmu03010 Ribosome 3.613262e-10 -6.5393759 1.133191e-26 2.538349e-24 136
mmu04217 Necroptosis 1.492073e-01 -0.9598561 4.879629e-02 9.056753e-01 175
mmu04978 Mineral absorption 1.819284e-01 -0.9111373 5.894268e-02 9.056753e-01 44
My q.vals are all crap except one, mmu03010 Ribosome, and only at E12. Let's see if this result is realistic by checking against my results table.
> ribo_genes <- kegg.gs.sym$"mmu03010 Ribosome"
> ribo_res_E12 <- res_E12[na.omit(match(ribo_genes, row.names(res_E12))),]
> ribo_res_E12[order(ribo_res_E12$padj),]
log2 fold change (MLE): condition E12.5 F9.20DKO vs E12.5 F20HET
Wald test p-value: condition E12.5 F9.20DKO vs E12.5 F20HET
DataFrame with 136 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Rpl3 5844.2119 -0.7370311 0.1509027 -4.884147 1.038773e-06 0.0002949744
Rpl30 3114.4671 -0.7273430 0.2222087 -3.273242 1.063214e-03 0.0583008082
Rps23 4427.1601 -0.5449952 0.1776096 -3.068501 2.151355e-03 0.0897922536
Mrps11 818.0561 -0.4252406 0.1483233 -2.866985 4.144027e-03 0.1401217085
Rps4x 5558.1320 -0.3596796 0.1294615 -2.778275 5.464830e-03 0.1664784148
So I have 3 genes with a padj<.1, out of a set of 136. I doubt this is highly significant.
I feel like I'm doing something wrong here, and I'm not sure what it is. My first thought is something to do with normalization. The documentation isn't clear as I mentioned above. My second thought is to do with that na.omit() after matching identifiers. Interestingly...
> count_missing_IDs <- unlist(lapply(lapply(lapply(lapply(kegg.gs.sym, match, row.names(res_E12)), is.na), which), length))
> count_missing_IDs[which.max(count_missing_IDs)]
mmu03010 Ribosome
41
I don't know if that means anything, but it's a suggestive coincidence. I don't know how to follow up on this.

Thanks, I reran GAGE using the counts from my dds. This definitely changed the numbers, but didn't change the GAGE analysis
> gage.data<- counts(dds2, normalized=TRUE) > E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired") > E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired") > E11_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03040 Spliceosome 0.06147842 -1.4554939 0.006097321 0.6178698 130 mmu04141 Protein processing in endoplasmic reticulum 0.11860644 -1.0269354 0.038206715 0.6178698 164 mmu03013 RNA transport 0.15677785 -0.9327364 0.053667302 0.6178698 166 > E11_kegg$greater[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03010 Ribosome 0.001179756 1.1408798 0.02635548 0.7988174 136 mmu00190 Oxidative phosphorylation 0.044616079 1.0849046 0.03158992 0.7988174 120 mmu04260 Cardiac muscle contraction 0.254409951 0.4984543 0.19530930 0.7988174 74 > E12_kegg$greater[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu04350 TGF-beta signaling pathway 0.1884577 0.8585151 0.06923999 0.7043309 84 mmu04137 Mitophagy - animal 0.2192353 0.7454941 0.09935115 0.7043309 63 mmu03050 Proteasome 0.1207776 0.6629597 0.13137149 0.7043309 46 > E12_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03010 Ribosome 2.013326e-08 -5.0347143 4.660289e-17 1.043905e-14 136 mmu04217 Necroptosis 1.171161e-01 -0.9363680 5.320313e-02 7.291190e-01 175 mmu04216 Ferroptosis 1.948577e-01 -0.8308787 7.707273e-02 7.291190e-01 40Still just one significant pathway, which seems very unlikely to be actually significant. Does it mean anything that many of the q.val are the same, or is this just an artifact?
Many q values being the same is a natural consequence of the BH method. I have a post explaining this hidden here somewhere on the support site.
I’d say you should choose your input to GAGE based not on results but on what the method is designed to have as input. I don’t know the correct answer but I wanted to point you to the different matrices you could export.
I figured it might be.
I wouldn't normally massage input data to get desired results. But since I don't rightly know what gage() expects, inspecting the output for absurdity is perhaps a hint I'm doing something wrong.