Hello
I have been experimenting with the script in RiboProfiling pacakge vigenette and have noticed an unexpected beahviour of aroundPromoter().
In line 84 of the script we have the call:
oneBinRanges <- aroundPromoter(txdb, alnGRanges, percBestExpressed=0.001)
This call returns a GRanges covering 2 CDSs (note that percBestExpressed=0.001 is a very stringent threshold compared to the default of 0.03, hence the small number of CDSs) , and the ranges covered are indeed around the translation start site.
However, when I re-ran this line after changing percBestExpressed to 0.01, I got among others the following ranges:
oneBinRanges[42:82] GRanges object with 41 ranges and 2 metadata columns: seqnames ranges strand | values idSeq <Rle> <IRanges> <Rle> | <numeric> <integer> 2023 chr1 [8926540, 8926540] - | 0 12858 2023 chr1 [8926541, 8926541] - | 0 12858 2023 chr1 [8926542, 8926542] - | 0 12858 2023 chr1 [8926543, 8926543] - | 0 12858 2023 chr1 [8926544, 8926544] - | 0 12858 ... ... ... ... . ... ... 2023 chr1 [8926576, 8926576] - | 0 12858 2023 chr1 [8926577, 8926577] - | 0 12858 2023 chr1 [8926578, 8926578] - | 0 12858 2023 chr1 [8926579, 8926579] - | 0 12858 2023 chr1 [8926580, 8926580] - | 0 12858 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths
This region fall in the boundary of exon 4 of UCSC gene uc001api.2 which is not a translation start site.
Furthermore, when I call readStartCov() I get these reuslts:
Note that for length 29 there are three peaks, one of them downstream the translation start site.
Note that I run the script with the original TxDb and bam file indicate in the vignette. The only change I introduced was in the value of percBestExpressed.
I also got similar results when running this code on another bam file and another TxDb. When I used the same bam and gff with plastid I get offsets that are more similar to what I expect (all peaks upstream the translation start site, mostly around -12 or -13).
Having looked at the source code of aroundPromoter() I think the problem is the lines:
#choose the longest transcript per gene (very rapid) maxWidthTranscBestExpGene <- max(width(cdsByBestExpressed)) cdsByBestExprLongTransc <- cdsByBestExpressed[width(cdsByBestExpressed) == maxWidthTranscBestExpGene, ]
From what I see, for multi-exon CDSs, running this code chooses the longest exon, not longest transcript.
Am I missing something?
Thanks in advance
Dolev Rahat
The Alexander Silberman Institute of Life Sciences,
Hebrew University of Jerusalem Israel
sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.3 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=he_IL.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=he_IL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=he_IL.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=he_IL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods [9] base other attached packages: [1] ggplot2_2.2.1 GenomicAlignments_1.12.2 SummarizedExperiment_1.6.5 [4] DelayedArray_0.2.7 matrixStats_0.52.2 Rsamtools_1.28.0 [7] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2 Biobase_2.36.2 [10] GenomicRanges_1.28.6 GenomeInfoDb_1.12.3 RiboProfiling_1.6.0 [13] Biostrings_2.44.2 XVector_0.16.0 IRanges_2.10.5 [16] S4Vectors_0.14.7 BiocGenerics_0.22.1 loaded via a namespace (and not attached): [1] httr_1.3.1 AnnotationHub_2.8.2 [3] bit64_0.9-7 splines_3.4.2 [5] gsubfn_0.6-6 shiny_1.0.5 [7] Formula_1.2-2 interactiveDisplayBase_1.14.0 [9] latticeExtra_0.6-28 RBGL_1.52.0 [11] blob_1.1.0 BSgenome_1.44.2 [13] GenomeInfoDbData_0.99.0 yaml_2.1.14 [15] RSQLite_2.0 backports_1.1.1 [17] lattice_0.20-35 biovizBase_1.24.0 [19] chron_2.3-51 digest_0.6.12 [21] RColorBrewer_1.1-2 checkmate_1.8.4 [23] colorspace_1.3-2 ggbio_1.24.1 [25] httpuv_1.3.5 htmltools_0.3.6 [27] Matrix_1.2-11 plyr_1.8.4 [29] OrganismDbi_1.18.1 XML_3.98-1.9 [31] biomaRt_2.32.1 zlibbioc_1.22.0 [33] xtable_1.8-2 scales_0.5.0 [35] BiocParallel_1.10.1 htmlTable_1.9 [37] tibble_1.3.4 AnnotationFilter_1.0.0 [39] sqldf_0.4-11 nnet_7.3-12 [41] lazyeval_0.2.0 proto_1.0.0 [43] mime_0.5 survival_2.41-3 [45] magrittr_1.5 memoise_1.1.0 [47] GGally_1.3.2 foreign_0.8-69 [49] graph_1.54.0 BiocInstaller_1.26.1 [51] tools_3.4.2 data.table_1.10.4-2 [53] stringr_1.2.0 munsell_0.4.3 [55] cluster_2.0.6 ensembldb_2.0.4 [57] compiler_3.4.2 rlang_0.1.2 [59] grid_3.4.2 RCurl_1.95-4.8 [61] dichromat_2.0-0 VariantAnnotation_1.22.3 [63] htmlwidgets_0.9 bitops_1.0-6 [65] base64enc_0.1-3 gtable_0.2.0 [67] curl_3.0 DBI_0.7 [69] reshape_0.8.7 R6_2.2.2 [71] reshape2_1.4.2 gridExtra_2.3 [73] knitr_1.17 rtracklayer_1.36.6 [75] bit_1.1-12 Hmisc_4.0-3 [77] ProtGenerics_1.8.0 stringi_1.1.5 [79] Rcpp_0.12.13 rpart_4.1-11 [81] acepack_1.4.1