Question

GAGE after tximport

0

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 13 months ago

United States

I have RNASeq data quantified by Salmon against the Ensembl transcriptome, and imported with tximport. I'd like to run this through GAGE. I have some questions.

Does GAGE require normalized input? All ?gage() says is:
" exprs: an expression matrix or matrix-like data structure, with
genes as rows and samples as columns."

Normalization is not mentioned anywhere in gage.pdf or dataPrep.pdf. Normalization IS mentioned in RNA-seqWorkflow.pdf. How did the author of that document know to normalize their counts before GAGE?

Next, assuming that I need normalized data, how do I get that from my txi? Is that what txi$abundance is? I gather from the documentation that txi$abundance is TPM, and TPM is a normalization method, so I assume that this is just good to go. Is this assumption correct?

So, with more hope than confidence I'm going to attempt the GAGE and see what happens. I have 4 conditions in my experiment, only doing 2 comparisons.

gage.data <- txi.salmon$abundance
colnames(gage.data) <- rownames(sampleTable)
kg.mus <- kegg.gsets("mmu")
kegg.gs<-kg.mus$kg.sets[kg.mus$sigmet.idx]
kegg.gs.sym <- lapply(kegg.gs, function(X) { mapIds(org.Mm.eg.db, keys=X, column="SYMBOL", keytype="ENTREZID")})
E11_ref <- grep("E11.5 F20HET", colnames(gage.data))
E12_ref <- grep("E12.5 F20HET", colnames(gage.data))
E11_test <- grep("E11.5 F9.20DKO", colnames(gage.data))
E12_test <- grep("E12.5 F9.20DKO", colnames(gage.data))
E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired")
E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired")

> E11_kegg$greater[1:3,1:5]
                                               p.geomean stat.mean      p.val     q.val set.size
mmu00190 Oxidative phosphorylation            0.01915897 1.2753756 0.01473745 0.8648741      120
mmu04260 Cardiac muscle contraction           0.11135302 0.9193310 0.05730777 0.8648741       74
mmu04723 Retrograde endocannabinoid signaling 0.19837221 0.5629723 0.16567317 0.8648741      141
> E11_kegg$less[1:3,1:5]
                        p.geomean stat.mean       p.val     q.val set.size
mmu03040 Spliceosome   0.03489858 -1.607514 0.002886588 0.5581727      130
mmu03050 Proteasome    0.05356778 -1.254981 0.017127011 0.5581727       46
mmu03013 RNA transport 0.10930116 -1.021654 0.039141231 0.5581727      166
> E12_kegg$greater[1:3,1:5]
                                     p.geomean stat.mean      p.val     q.val set.size
mmu03050 Proteasome                 0.08137625 1.2713452 0.01503186 0.6074224       46
mmu04350 TGF-beta signaling pathway 0.13299879 1.0604810 0.03375249 0.6074224       84
mmu04530 Tight junction             0.15116806 0.9769626 0.04572445 0.6074224      167
> E12_kegg$less[1:3,1:5]
                               p.geomean  stat.mean        p.val        q.val set.size
mmu03010 Ribosome           3.613262e-10 -6.5393759 1.133191e-26 2.538349e-24      136
mmu04217 Necroptosis        1.492073e-01 -0.9598561 4.879629e-02 9.056753e-01      175
mmu04978 Mineral absorption 1.819284e-01 -0.9111373 5.894268e-02 9.056753e-01       44

My q.vals are all crap except one, mmu03010 Ribosome, and only at E12. Let's see if this result is realistic by checking against my results table.

> ribo_genes <- kegg.gs.sym$"mmu03010 Ribosome"
> ribo_res_E12 <- res_E12[na.omit(match(ribo_genes, row.names(res_E12))),]
> ribo_res_E12[order(ribo_res_E12$padj),]
log2 fold change (MLE): condition E12.5 F9.20DKO vs E12.5 F20HET
Wald test p-value: condition E12.5 F9.20DKO vs E12.5 F20HET
DataFrame with 136 rows and 6 columns
             baseMean log2FoldChange     lfcSE        stat       pvalue         padj
            <numeric>      <numeric> <numeric>   <numeric>    <numeric>    <numeric>
Rpl3        5844.2119     -0.7370311 0.1509027   -4.884147 1.038773e-06 0.0002949744
Rpl30       3114.4671     -0.7273430 0.2222087   -3.273242 1.063214e-03 0.0583008082
Rps23       4427.1601     -0.5449952 0.1776096   -3.068501 2.151355e-03 0.0897922536
Mrps11       818.0561     -0.4252406 0.1483233   -2.866985 4.144027e-03 0.1401217085
Rps4x       5558.1320     -0.3596796 0.1294615   -2.778275 5.464830e-03 0.1664784148

So I have 3 genes with a padj<.1, out of a set of 136. I doubt this is highly significant.

I feel like I'm doing something wrong here, and I'm not sure what it is. My first thought is something to do with normalization. The documentation isn't clear as I mentioned above. My second thought is to do with that na.omit() after matching identifiers. Interestingly...

>  count_missing_IDs <- unlist(lapply(lapply(lapply(lapply(kegg.gs.sym, match, row.names(res_E12)), is.na), which), length))
> count_missing_IDs[which.max(count_missing_IDs)]
mmu03010 Ribosome
               41

I don't know if that means anything, but it's a suggestive coincidence. I don't know how to follow up on this.

gage tximport • 2.0k views

ADD COMMENT • link updated 7.1 years ago by Luo Weijun ★ 1.6k • written 7.1 years ago by Ed Siefker ▴ 230

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

Not a complete answer, but you have a few options of what to use downstream of tximport for different methods (we've got the DE methods covered in the vignette)

TPM are in txi$abundance. The column sums are set to 1e6, which means that it's not robustly normalized for library size: a single gene with a large TPM and a change across samples will greatly shift the TPM for all the other genes.

You can build a DESeqDataSetFromTximport, and then get sequencing-depth-normalized counts, counts(dds, normalized=TRUE), or transformed counts (on the log2 scale, and normalized for sequencing depth):

http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

ADD COMMENT • link 7.1 years ago Michael Love 43k

0

Entering edit mode

Thanks, I reran GAGE using the counts from my dds. This definitely changed the numbers, but didn't change the GAGE analysis

> gage.data<- counts(dds2, normalized=TRUE)
> E11_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E11_ref,samp=E11_test, compare="unpaired")
> E12_kegg <- gage(gage.data, gsets=kegg.gs.sym,ref=E12_ref,samp=E12_test, compare="unpaired")
> E11_kegg$less[1:3,1:5]

                                                      p.geomean  stat.mean       p.val     q.val set.size
mmu03040 Spliceosome                                 0.06147842 -1.4554939 0.006097321 0.6178698      130
mmu04141 Protein processing in endoplasmic reticulum 0.11860644 -1.0269354 0.038206715 0.6178698      164
mmu03013 RNA transport                               0.15677785 -0.9327364 0.053667302 0.6178698      166
> E11_kegg$greater[1:3,1:5]
                                      p.geomean stat.mean      p.val     q.val set.size
mmu03010 Ribosome                   0.001179756 1.1408798 0.02635548 0.7988174      136
mmu00190 Oxidative phosphorylation  0.044616079 1.0849046 0.03158992 0.7988174      120
mmu04260 Cardiac muscle contraction 0.254409951 0.4984543 0.19530930 0.7988174       74
> E12_kegg$greater[1:3,1:5]
                                    p.geomean stat.mean      p.val     q.val set.size
mmu04350 TGF-beta signaling pathway 0.1884577 0.8585151 0.06923999 0.7043309       84
mmu04137 Mitophagy - animal         0.2192353 0.7454941 0.09935115 0.7043309       63
mmu03050 Proteasome                 0.1207776 0.6629597 0.13137149 0.7043309       46
> E12_kegg$less[1:3,1:5]
                        p.geomean  stat.mean        p.val        q.val set.size
mmu03010 Ribosome    2.013326e-08 -5.0347143 4.660289e-17 1.043905e-14      136
mmu04217 Necroptosis 1.171161e-01 -0.9363680 5.320313e-02 7.291190e-01      175
mmu04216 Ferroptosis 1.948577e-01 -0.8308787 7.707273e-02 7.291190e-01       40

Still just one significant pathway, which seems very unlikely to be actually significant. Does it mean anything that many of the q.val are the same, or is this just an artifact?

ADD REPLY • link 7.1 years ago Ed Siefker ▴ 230

1

Entering edit mode

Many q values being the same is a natural consequence of the BH method. I have a post explaining this hidden here somewhere on the support site.

I’d say you should choose your input to GAGE based not on results but on what the method is designed to have as input. I don’t know the correct answer but I wanted to point you to the different matrices you could export.

ADD REPLY • link 7.1 years ago Michael Love 43k

0

Entering edit mode

I figured it might be.

I wouldn't normally massage input data to get desired results. But since I don't rightly know what gage() expects, inspecting the output for absurdity is perhaps a hint I'm doing something wrong.

ADD REPLY • link 7.1 years ago Ed Siefker ▴ 230

score 2 · Accepted Answer · 2017-11-06

GAGE does expect normalized expression data as input. It is essentially a pathway (gene set) level differential expression analysis.

Your results make perfect sense. Sometimes, you get many significant pathways, sometimes none. Methods like GAGE or GSEA may select pathway with no individual genes significantly perturbed, but including many genes with small yet coordinated changes. you may check the GAGE paper for details: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-161.

The pathway (mmu03010 Ribosome) is clearly a true positive, highly significant and consistently selected! Congratulations, you’ve got cool results, but you don’t believe it.

To further convince yourself, you may visualize the gene expression changes in a pathway as scatterplot or heatmap using geneData() function or using pathview package. For details:

http://bioconductor.org/packages/release/bioc/vignettes/gage/inst/doc/gage.pdf

http://bioconductor.org/packages/release/bioc/html/pathview.html

If convenient, you can post your pathview graph, scatterplot or heatmap (for mmu03010 Ribosome) as a follow up.

HTHs.