Question

goseq pwf length bias plot: interpreting plot

1

Entering edit mode

Darya Vanichkina ▴ 120

@darya-vanichkina-6050

Last seen 7.1 years ago

Australia/Centenary Institute Universit…

Following on the question goseq pwf length bias plot: help interpreting plot, but with a similar and yet slightly different problem:

I am also using goseq with a manually compiled annotation, and am getting a strange plot similar to the one described by the author above (but I'm not prefiltering more than I should be, *I think*):

What does this plot mean? Should I be using as my background *all* gene lengths, including those not tested by limma, for example because of insufficient read counts /isexpr <- rowSums(cpm(y)>0.7) >= 3 where y is my DGElist/?

What I'm doing:

1. Using the command-line version of featureCounts to count reads to collapsed transcripts (i.e. genes), importing the resulting table into R and getting a clean data frame with counts (I'm using limma voom for differential expression), and a named vector with gene lengths.

2. I am then running the following code (as a first step before testing GO enrichment of genes upregulated in contrast 1 of interest):

goBiasLength <- counttableCompleteLen.vector[order(match(names(counttableCompleteLen.vector),names(results[,"Contrast1"])),na.last=NA)] #na,last=NA removes those that have no matches
Contrast1goUP <- as.numeric((results[,"Contrast1"] > 0))

names(Contrast14goUP) <- names(results[,"Contrast1"])

pwf.Contrast1 <- nullp(Contrast14goUP, bias.data=goBiasLength,plot.fit=TRUE)

3. To give you an idea what these are:

head(counttableCompleteLen.vector, n=3L)

Gene1 Gene2 Gene3 
              1070                110               6094

class(counttableCompleteLen.vector)
[1] "numeric"

head(results[,"Contrast1"], n=3L)
Gene100 Gene101 Gene102 
                 0                  0                  0

class(results)
[1] "TestResults"
attr(,"package")
[1] "limma"

Gene identifiers are unique. Thanks in advance!

sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8   
 [6] LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] graphics  grDevices utils     datasets  stats     methods   base     

other attached packages:
 [1] goseq_1.24.0          geneLenDataBase_1.8.0 BiasedUrn_1.07        edgeR_3.14.0          rJava_0.9-8           limma_3.28.14        
 [7] biomaRt_2.28.0        reshape2_1.4.1        RColorBrewer_1.1-2    data.table_1.9.6      ggplot2_1.0.1        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.1                locfit_1.5-9.1             lattice_0.20-29            GO.db_3.3.0                Rsamtools_1.24.0          
 [6] Biostrings_2.40.2          digest_0.6.8               GenomeInfoDb_1.8.1         plyr_1.8.3                 chron_2.3-47              
[11] acepack_1.3-3.3            stats4_3.3.0               RSQLite_1.0.0              zlibbioc_1.18.0            GenomicFeatures_1.24.3    
[16] annotate_1.50.0            S4Vectors_0.10.1           Matrix_1.2-6               rpart_4.1-8                proto_0.3-10              
[21] splines_3.3.0              BiocParallel_1.6.2         geneplotter_1.50.0         stringr_1.0.0              foreign_0.8-61            
[26] RCurl_1.95-4.8             munsell_0.4.2              rtracklayer_1.32.1         BiocGenerics_0.18.0        mgcv_1.8-12               
[31] nnet_7.3-8                 SummarizedExperiment_1.2.3 gridExtra_2.2.1            Hmisc_3.17-0               IRanges_2.6.1             
[36] XML_3.98-1.4               GenomicAlignments_1.8.3    MASS_7.3-34                bitops_1.0-6               grid_3.3.0                
[41] nlme_3.1-117               xtable_1.8-2               gtable_0.1.2               DBI_0.4-1                  magrittr_1.5              
[46] scales_0.3.0               stringi_1.0-1              XVector_0.12.0             genefilter_1.54.2          latticeExtra_0.6-28       
[51] Formula_1.2-1              tools_3.3.0                Biobase_2.32.0             DESeq2_1.12.3              parallel_3.3.0            
[56] survival_2.38-1            AnnotationDbi_1.34.3       colorspace_1.2-6           cluster_1.15.3             GenomicRanges_1.24.2      
[61] knitr_1.13

goseq • 1.7k views

ADD COMMENT • link updated 7.8 years ago by Gordon Smyth 50k • written 7.8 years ago by Darya Vanichkina ▴ 120

score 2 · Answer 1 · 2016-07-09

I don't entirely follow your code, but this may help:

No, you should not be using the genes that were filtered as not expressed.
However, you should use all non-filtered genes together to estimate the length bias trend. Do not separate up and down genes for this purpose. You do not need separate trends for up and down genes.
If the trend still seems to be decreasing, then just do an ordinary GO analysis without trend correction.