Relevant sessionInfo(): R version 3.2.2; BiocInstaller_1.18.4 ; goseq_1.20.0
I am relatively new to R and to Bioconductor packages. I am using goseq for GO enrichment analysis of mouse RNA-seq data. I have generated the appropriate named vector to begin the analysis - for example, condition1. All elements in condition1 have either a 1 (DE) or a 0 (non-DE) assigned to them. However, when this vector is passed to the nullp function to create a data frame containing bias and weighting information a substantial # of genes have no data - that is, NA in both the bias.data and pwf columns. The number of NAs is substantial (2077 genes, or about 10% of the genes). The mm10 genome is supported according to the supportedGenomes() function, although while nullp is running the following message is printed to the console:
Can't find mm10/ensGene length data in genLenDataBase... Found the annotaion package, TxDb.Mmusculus.UCSC.mm10.knownGene Trying to get the gene lengths from it.
Related Questions: (1) Does "NA" in bias.data mean there is no length information for that particular Ensembl gene? And if so, how is that possible, given that the gene has been given an identifier? (2) Is there missing gene length data in TxDb.Mmusculus.UCSC.mm10.knownGene that might be causing the NAs? (3) I'm assuming that if a gene has NA for bias.data and for pwf it is not used in downstream functions when GO term enrichment is carried out, correct? (4) Is there a way I can "force" the use of these genes in the GO term enrichment analysis despite their NA pwf values? (5) I'm assuming that at this stage in the goseq pipeline the NAs have nothing to do with whether the genes have GO annotations, correct? (I cannot find responses in the goseq User's Guide)
Below: just some snippets of what I'm seeing on my console ... didn't want to provide the whole condition1 vector!
pwf.condition1 <- nullp(condition1, "mm10", "ensGene")
DEgenes bias.data pwf ENSMUSG00000027014 1 2445 0.2320392 ENSMUSG00000032028 1 2637 0.2352164 ENSMUSG00000078139 1 NA NA ENSMUSG00000056234 1 4025 0.2525322 ENSMUSG00000087107 1 1408 0.2159865 ENSMUSG00000027597 1 2541 0.2336145
sumis.na(pwf.condition1$bias.data)) # 2077