Hi, I have usually used keepSeqlevels()
to subset the list of transcript or exon (GRangesList built by transciptBy()
and exonsBy()
) and keep the transcripts of interest. Not until recently did I find out that many of genes that I am interested in are missing after using keepSeqlevels()
and those missing genes are meant to be kept. Let me give an example:
library(TxDb.Hsapiens.UCSC.hg38.knownGene) txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene") tx <- keepSeqlevels(txbygene, "chr1")
I thought doing this, the resulted tx
would have all the genes in chromosome 1, but it is not true. Gene "343068" is missing. I believe it is because this gene contains transcript from both "chr1" and "chr1_KI270766v1_alt" and therefore keepSeqlevels() excludes it. Please see below:
> txbygene["343068"] GRangesList object of length 1: $343068 GRanges object with 3 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr1 [13254212, 13263314] + | 773 uc001auu.2 [2] chr1_KI270766v1_alt [ 115913, 206761] + | 182686 uc061fwy.1 [3] chr1_KI270766v1_alt [ 197666, 206903] + | 182687 uc032ncd.1 > tx["343068"] Error: subscript contains invalid names
I found it by surprised since I have not found such information or warnings in the man page. I don't know if this would be changed in the future, but at least a warning in the man page would be helpful.
Thanks,
Chao-Jen
> sessionInfo() R Under development (unstable) (2016-11-01 r71616) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0 [2] GenomicFeatures_1.26.0 [3] AnnotationDbi_1.37.0 [4] DESeq2_1.14.0 [5] SummarizedExperiment_1.4.0 [6] Biobase_2.35.0 [7] GenomicRanges_1.26.1 [8] GenomeInfoDb_1.10.0 [9] IRanges_2.8.0 [10] S4Vectors_0.12.0 [11] BiocGenerics_0.21.0 [12] BiocInstaller_1.24.0 loaded via a namespace (and not attached): [1] genefilter_1.56.0 locfit_1.5-9.1 splines_3.4.0 [4] lattice_0.20-34 colorspace_1.2-7 htmltools_0.3.5 [7] rtracklayer_1.34.0 chron_2.3-47 survival_2.40-1 [10] XML_3.98-1.4 foreign_0.8-67 DBI_0.5-1 [13] BiocParallel_1.8.1 RColorBrewer_1.1-2 plyr_1.8.4 [16] stringr_1.1.0 zlibbioc_1.20.0 Biostrings_2.42.0 [19] munsell_0.4.3 gtable_0.2.0 latticeExtra_0.6-28 [22] knitr_1.14 geneplotter_1.52.0 biomaRt_2.30.0 [25] htmlTable_1.7 Rcpp_0.12.7 acepack_1.4.1 [28] xtable_1.8-2 scales_0.4.0 Hmisc_4.0-0 [31] annotate_1.52.0 XVector_0.14.0 Rsamtools_1.26.1 [34] gridExtra_2.2.1 ggplot2_2.1.0 digest_0.6.10 [37] stringi_1.1.2 grid_3.4.0 tools_3.4.0 [40] bitops_1.0-6 magrittr_1.5 RCurl_1.95-4.8 [43] RSQLite_1.0.0 Formula_1.2-1 cluster_2.0.5 [46] Matrix_1.2-7.1 data.table_1.9.6 rpart_4.1-10 [49] GenomicAlignments_1.10.0 nnet_7.3-12 >
Hi Chao-Jen,
Just a note to say we saw this and are discussing it. Currently the behavior of the higher-level keep/dropSeqlevels() is based on "seqlevels<-" which does not allow cherry picking from List elements with mixed seqlevels. This is documented on the ?seqlevels man page under the 'force' argument and the examples section 'DROP SEQLEVELS OF A LIST-LIKE OBJECT'.
Herve will probably add the option to keep selective ranges in mixed List elements and I'll propagate that change to the high level helpers (and improve the documentation). We'll post back when this is done.
Val