goseq: probability weighting function returns NA for bias and weighting info. for some genes (mm10); why?
1
0
Entering edit mode
mjnolte • 0
@mjnolte-8784
Last seen 7.5 years ago
United States

Relevant sessionInfo(): R version 3.2.2; BiocInstaller_1.18.4 ; goseq_1.20.0

I am relatively new to R and to Bioconductor packages. I am using goseq for GO enrichment analysis of mouse RNA-seq data. I have generated the appropriate named vector to begin the analysis - for example, condition1. All elements in condition1 have either a 1 (DE) or a 0 (non-DE) assigned to them. However, when this vector is passed to the nullp function to create a data frame containing bias and weighting information a substantial # of genes have no data - that is, NA in both the bias.data and pwf columns. The number of NAs is substantial (2077 genes, or about 10% of the genes). The mm10 genome is supported according to the supportedGenomes() function, although while nullp is running the following message is printed to the console:

Can't find mm10/ensGene length data in genLenDataBase...
Found the annotaion package, TxDb.Mmusculus.UCSC.mm10.knownGene
Trying to get the gene lengths from it.

Related Questions: (1) Does "NA" in bias.data mean there is no length information for that particular Ensembl gene? And if so, how is that possible, given that the gene has been given an identifier? (2) Is there missing gene length data in TxDb.Mmusculus.UCSC.mm10.knownGene that might be causing the NAs? (3) I'm assuming that if a gene has NA for bias.data and for pwf it is not used in downstream functions when GO term enrichment is carried out, correct? (4) Is there a way I can "force" the use of these genes in the GO term enrichment analysis despite their NA pwf values? (5) I'm assuming that at this stage in the goseq pipeline the NAs have nothing to do with whether the genes have GO annotations, correct? (I cannot find responses in the goseq User's Guide)

Below: just some snippets of what I'm seeing on my console ... didn't want to provide the whole condition1 vector!

pwf.condition1 <- nullp(condition1, "mm10", "ensGene")

                   DEgenes bias.data       pwf
ENSMUSG00000027014       1      2445 0.2320392
ENSMUSG00000032028       1      2637 0.2352164
ENSMUSG00000078139       1        NA        NA
ENSMUSG00000056234       1      4025 0.2525322
ENSMUSG00000087107       1      1408 0.2159865
ENSMUSG00000027597       1      2541 0.2336145

sumis.na(pwf.condition1\$bias.data)) # 2077

goseq pwf • 1.3k views
0
Entering edit mode
Last seen 3.9 years ago
Australia

Hi,

Try installing TxDb.Mmusculus.UCSC.mm10.ensGene. I suspect the NA are because there's a conversion between ensembl IDs and knownGene IDs. So if there is no knownGene ID for a particular ensembl gene, the length can't be found.

This might fix the NA in the pwf table, but it's also possible that (for the same reason) GO terms won't be allocated for some genes, as the GO term look up also requires a conversion of gene IDs. The only way to get around that would be to manually provide goseq with the GO term IDs (for example if you obtain these from biomart).

Cheers,