Following on the question goseq pwf length bias plot: help interpreting plot, but with a similar and yet slightly different problem:
I am also using goseq with a manually compiled annotation, and am getting a strange plot similar to the one described by the author above (but I'm not prefiltering more than I should be, *I think*):
What does this plot mean? Should I be using as my background *all* gene lengths, including those not tested by limma, for example because of insufficient read counts /isexpr <- rowSums(cpm(y)>0.7) >= 3 where y is my DGElist/?
What I'm doing:
1. Using the command-line version of featureCounts to count reads to collapsed transcripts (i.e. genes), importing the resulting table into R and getting a clean data frame with counts (I'm using limma voom for differential expression), and a named vector with gene lengths.
2. I am then running the following code (as a first step before testing GO enrichment of genes upregulated in contrast 1 of interest):
goBiasLength <- counttableCompleteLen.vector[order(match(names(counttableCompleteLen.vector),names(results[,"Contrast1"])),na.last=NA)] #na,last=NA removes those that have no matches Contrast1goUP <- as.numeric((results[,"Contrast1"] > 0))
names(Contrast14goUP) <- names(results[,"Contrast1"])
pwf.Contrast1 <- nullp(Contrast14goUP, bias.data=goBiasLength,plot.fit=TRUE)
3. To give you an idea what these are:
head(counttableCompleteLen.vector, n=3L)
Gene1 Gene2 Gene3 1070 110 6094
class(counttableCompleteLen.vector) [1] "numeric"
head(results[,"Contrast1"], n=3L) Gene100 Gene101 Gene102 0 0 0
class(results) [1] "TestResults" attr(,"package") [1] "limma"
Gene identifiers are unique. Thanks in advance!
sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.1 LTS locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 [6] LC_MESSAGES=en_AU.UTF-8 LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] graphics grDevices utils datasets stats methods base other attached packages: [1] goseq_1.24.0 geneLenDataBase_1.8.0 BiasedUrn_1.07 edgeR_3.14.0 rJava_0.9-8 limma_3.28.14 [7] biomaRt_2.28.0 reshape2_1.4.1 RColorBrewer_1.1-2 data.table_1.9.6 ggplot2_1.0.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.1 locfit_1.5-9.1 lattice_0.20-29 GO.db_3.3.0 Rsamtools_1.24.0 [6] Biostrings_2.40.2 digest_0.6.8 GenomeInfoDb_1.8.1 plyr_1.8.3 chron_2.3-47 [11] acepack_1.3-3.3 stats4_3.3.0 RSQLite_1.0.0 zlibbioc_1.18.0 GenomicFeatures_1.24.3 [16] annotate_1.50.0 S4Vectors_0.10.1 Matrix_1.2-6 rpart_4.1-8 proto_0.3-10 [21] splines_3.3.0 BiocParallel_1.6.2 geneplotter_1.50.0 stringr_1.0.0 foreign_0.8-61 [26] RCurl_1.95-4.8 munsell_0.4.2 rtracklayer_1.32.1 BiocGenerics_0.18.0 mgcv_1.8-12 [31] nnet_7.3-8 SummarizedExperiment_1.2.3 gridExtra_2.2.1 Hmisc_3.17-0 IRanges_2.6.1 [36] XML_3.98-1.4 GenomicAlignments_1.8.3 MASS_7.3-34 bitops_1.0-6 grid_3.3.0 [41] nlme_3.1-117 xtable_1.8-2 gtable_0.1.2 DBI_0.4-1 magrittr_1.5 [46] scales_0.3.0 stringi_1.0-1 XVector_0.12.0 genefilter_1.54.2 latticeExtra_0.6-28 [51] Formula_1.2-1 tools_3.3.0 Biobase_2.32.0 DESeq2_1.12.3 parallel_3.3.0 [56] survival_2.38-1 AnnotationDbi_1.34.3 colorspace_1.2-6 cluster_1.15.3 GenomicRanges_1.24.2 [61] knitr_1.13