#### The support.bioconductor.org editor has been updated to markdown! Please see more info at: Tutorial: Updated Support Site Editor

Question: GAGE after tximport
0
15 months ago by
Ed Siefker220
United States
Ed Siefker220 wrote:

I have RNASeq data quantified by Salmon against the Ensembl transcriptome, and imported with tximport.  I'd like to run this through GAGE.  I have some questions.

Does GAGE require normalized input?  All ?gage() says is:
"   exprs: an expression matrix or matrix-like data structure, with
genes as rows and samples as columns."

Normalization is not mentioned anywhere in gage.pdf or dataPrep.pdf.  Normalization IS mentioned in RNA-seqWorkflow.pdf.  How did the author of that document know to normalize their counts before GAGE?

Next, assuming that I need normalized data, how do I get that from my txi?  Is that what txi$abundance is? I gather from the documentation that txi$abundance is TPM, and TPM is a normalization method, so I assume that this is just good to go.  Is this assumption correct?

So, with more hope than confidence I'm going to attempt the GAGE and see what happens. I have 4 conditions in my experiment, only doing 2 comparisons.

gage.data <- txi.salmon$abundance colnames(gage.data) <- rownames(sampleTable) kg.mus <- kegg.gsets("mmu") kegg.gs<-kg.mus$kg.sets[kg.mus$sigmet.idx] kegg.gs.sym <- lapply(kegg.gs, function(X) { mapIds(org.Mm.eg.db, keys=X, column="SYMBOL", keytype="ENTREZID")}) E11_ref <- grep("E11.5 F20HET", colnames(gage.data)) E12_ref <- grep("E12.5 F20HET", colnames(gage.data)) E11_test <- grep("E11.5 F9.20DKO", colnames(gage.data)) E12_test <- grep("E12.5 F9.20DKO", colnames(gage.data)) E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired") E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired") > E11_kegg$greater[1:3,1:5]
p.geomean stat.mean      p.val     q.val set.size
mmu00190 Oxidative phosphorylation            0.01915897 1.2753756 0.01473745 0.8648741      120
mmu04260 Cardiac muscle contraction           0.11135302 0.9193310 0.05730777 0.8648741       74
mmu04723 Retrograde endocannabinoid signaling 0.19837221 0.5629723 0.16567317 0.8648741      141
> E11_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03040 Spliceosome 0.03489858 -1.607514 0.002886588 0.5581727 130 mmu03050 Proteasome 0.05356778 -1.254981 0.017127011 0.5581727 46 mmu03013 RNA transport 0.10930116 -1.021654 0.039141231 0.5581727 166 > E12_kegg$greater[1:3,1:5]
p.geomean stat.mean      p.val     q.val set.size
mmu03050 Proteasome                 0.08137625 1.2713452 0.01503186 0.6074224       46
mmu04350 TGF-beta signaling pathway 0.13299879 1.0604810 0.03375249 0.6074224       84
mmu04530 Tight junction             0.15116806 0.9769626 0.04572445 0.6074224      167
> E12_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03010 Ribosome 3.613262e-10 -6.5393759 1.133191e-26 2.538349e-24 136 mmu04217 Necroptosis 1.492073e-01 -0.9598561 4.879629e-02 9.056753e-01 175 mmu04978 Mineral absorption 1.819284e-01 -0.9111373 5.894268e-02 9.056753e-01 44 My q.vals are all crap except one, mmu03010 Ribosome, and only at E12. Let's see if this result is realistic by checking against my results table. > ribo_genes <- kegg.gs.sym$"mmu03010 Ribosome"
> ribo_res_E12 <- res_E12[na.omit(match(ribo_genes, row.names(res_E12))),]
> ribo_res_E12[order(ribo_res_E12$padj),] log2 fold change (MLE): condition E12.5 F9.20DKO vs E12.5 F20HET Wald test p-value: condition E12.5 F9.20DKO vs E12.5 F20HET DataFrame with 136 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Rpl3 5844.2119 -0.7370311 0.1509027 -4.884147 1.038773e-06 0.0002949744 Rpl30 3114.4671 -0.7273430 0.2222087 -3.273242 1.063214e-03 0.0583008082 Rps23 4427.1601 -0.5449952 0.1776096 -3.068501 2.151355e-03 0.0897922536 Mrps11 818.0561 -0.4252406 0.1483233 -2.866985 4.144027e-03 0.1401217085 Rps4x 5558.1320 -0.3596796 0.1294615 -2.778275 5.464830e-03 0.1664784148 So I have 3 genes with a padj<.1, out of a set of 136. I doubt this is highly significant. I feel like I'm doing something wrong here, and I'm not sure what it is. My first thought is something to do with normalization. The documentation isn't clear as I mentioned above. My second thought is to do with that na.omit() after matching identifiers. Interestingly... > count_missing_IDs <- unlist(lapply(lapply(lapply(lapply(kegg.gs.sym, match, row.names(res_E12)), is.na), which), length)) > count_missing_IDs[which.max(count_missing_IDs)] mmu03010 Ribosome 41  I don't know if that means anything, but it's a suggestive coincidence. I don't know how to follow up on this. gage tximport • 422 views ADD COMMENTlink modified 15 months ago by Luo Weijun1.4k • written 15 months ago by Ed Siefker220 Answer: GAGE after tximport 2 15 months ago by Luo Weijun1.4k United States Luo Weijun1.4k wrote: GAGE does expect normalized expression data as input. It is essentially a pathway (gene set) level differential expression analysis. Your results make perfect sense. Sometimes, you get many significant pathways, sometimes none. Methods like GAGE or GSEA may select pathway with no individual genes significantly perturbed, but including many genes with small yet coordinated changes. you may check the GAGE paper for details: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-161. The pathway (mmu03010 Ribosome) is clearly a true positive, highly significant and consistently selected! Congratulations, you’ve got cool results, but you don’t believe it. To further convince yourself, you may visualize the gene expression changes in a pathway as scatterplot or heatmap using geneData() function or using pathview package. For details: http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/gage.pdf http://bioconductor.org/packages/release/bioc/html/pathview.html If convenient, you can post your pathview graph, scatterplot or heatmap (for mmu03010 Ribosome) as a follow up. HTHs. ADD COMMENTlink modified 15 months ago • written 15 months ago by Luo Weijun1.4k Answer: GAGE after tximport 1 15 months ago by Michael Love22k United States Michael Love22k wrote: Not a complete answer, but you have a few options of what to use downstream of tximport for different methods (we've got the DE methods covered in the vignette) TPM are in txi$abundance. The column sums are set to 1e6, which means that it's not robustly normalized for library size: a single gene with a large TPM and a change across samples will greatly shift the TPM for all the other genes.

You can build a DESeqDataSetFromTximport, and then get sequencing-depth-normalized counts, counts(dds, normalized=TRUE), or transformed counts (on the log2 scale, and normalized for sequencing depth):

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

Thanks, I reran GAGE using the counts from my dds.  This definitely changed the numbers, but didn't change the GAGE analysis

> gage.data<- counts(dds2, normalized=TRUE)
> E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired")
> E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired")
> E11_kegg$less[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu03040 Spliceosome 0.06147842 -1.4554939 0.006097321 0.6178698 130 mmu04141 Protein processing in endoplasmic reticulum 0.11860644 -1.0269354 0.038206715 0.6178698 164 mmu03013 RNA transport 0.15677785 -0.9327364 0.053667302 0.6178698 166 > E11_kegg$greater[1:3,1:5]
p.geomean stat.mean      p.val     q.val set.size
mmu03010 Ribosome                   0.001179756 1.1408798 0.02635548 0.7988174      136
mmu00190 Oxidative phosphorylation  0.044616079 1.0849046 0.03158992 0.7988174      120
mmu04260 Cardiac muscle contraction 0.254409951 0.4984543 0.19530930 0.7988174       74
> E12_kegg$greater[1:3,1:5] p.geomean stat.mean p.val q.val set.size mmu04350 TGF-beta signaling pathway 0.1884577 0.8585151 0.06923999 0.7043309 84 mmu04137 Mitophagy - animal 0.2192353 0.7454941 0.09935115 0.7043309 63 mmu03050 Proteasome 0.1207776 0.6629597 0.13137149 0.7043309 46 > E12_kegg$less[1:3,1:5]
p.geomean  stat.mean        p.val        q.val set.size
mmu03010 Ribosome    2.013326e-08 -5.0347143 4.660289e-17 1.043905e-14      136
mmu04217 Necroptosis 1.171161e-01 -0.9363680 5.320313e-02 7.291190e-01      175
mmu04216 Ferroptosis 1.948577e-01 -0.8308787 7.707273e-02 7.291190e-01       40

Still just one significant pathway, which seems very unlikely to be actually significant.  Does it mean anything that many of the q.val are the same, or is this just an artifact?

1

Many q values being the same is a natural consequence of the BH method. I have a post explaining this hidden here somewhere on the support site.

I’d say you should choose your input to GAGE based not on results but on what the method is designed to have as input.  I don’t know the correct answer but I wanted to point you to the different matrices you could export.

I figured it might be.

I wouldn't normally massage input data to get desired results.  But since I don't rightly know what gage() expects, inspecting the output for absurdity is perhaps a hint I'm doing something wrong.