Question: keepSeqlevels for GRangesList: drop elements with multiple seqnames
gravatar for Chao-Jen Wong
3.1 years ago by
USA/Seattle/Fred Hutchinson Cancer Research Center
Chao-Jen Wong30 wrote:

Hi, I have usually used keepSeqlevels() to subset the list of transcript or exon (GRangesList built by transciptBy() and exonsBy()) and keep the transcripts of interest. Not until recently did I find out that many of genes that I am interested in are missing after using keepSeqlevels() and those missing genes are meant to be kept.  Let me give an example:

txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
tx <- keepSeqlevels(txbygene, "chr1")


I thought doing this, the resulted tx would have all the genes in chromosome 1, but it is not true.  Gene "343068"  is missing. I believe it is because this gene contains transcript from both "chr1" and "chr1_KI270766v1_alt" and therefore keepSeqlevels() excludes it. Please see below:

> txbygene["343068"]
GRangesList object of length 1:
GRanges object with 3 ranges and 2 metadata columns:
                 seqnames               ranges strand |     tx_id     tx_name
                    <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]                chr1 [13254212, 13263314]      + |       773  uc001auu.2
  [2] chr1_KI270766v1_alt [  115913,   206761]      + |    182686  uc061fwy.1
  [3] chr1_KI270766v1_alt [  197666,   206903]      + |    182687  uc032ncd.1

> tx["343068"]
Error: subscript contains invalid names

I found it by surprised since I have not found such information or warnings in the man page. I don't know if this would be changed in the future, but at least a warning in the man page would be helpful. 





> sessionInfo()
R Under development (unstable) (2016-11-01 r71616)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
 [2] GenomicFeatures_1.26.0
 [3] AnnotationDbi_1.37.0
 [4] DESeq2_1.14.0
 [5] SummarizedExperiment_1.4.0
 [6] Biobase_2.35.0
 [7] GenomicRanges_1.26.1
 [8] GenomeInfoDb_1.10.0
 [9] IRanges_2.8.0
[10] S4Vectors_0.12.0
[11] BiocGenerics_0.21.0
[12] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0        locfit_1.5-9.1           splines_3.4.0
 [4] lattice_0.20-34          colorspace_1.2-7         htmltools_0.3.5
 [7] rtracklayer_1.34.0       chron_2.3-47             survival_2.40-1
[10] XML_3.98-1.4             foreign_0.8-67           DBI_0.5-1
[13] BiocParallel_1.8.1       RColorBrewer_1.1-2       plyr_1.8.4
[16] stringr_1.1.0            zlibbioc_1.20.0          Biostrings_2.42.0
[19] munsell_0.4.3            gtable_0.2.0             latticeExtra_0.6-28
[22] knitr_1.14               geneplotter_1.52.0       biomaRt_2.30.0
[25] htmlTable_1.7            Rcpp_0.12.7              acepack_1.4.1
[28] xtable_1.8-2             scales_0.4.0             Hmisc_4.0-0
[31] annotate_1.52.0          XVector_0.14.0           Rsamtools_1.26.1
[34] gridExtra_2.2.1          ggplot2_2.1.0            digest_0.6.10
[37] stringi_1.1.2            grid_3.4.0               tools_3.4.0
[40] bitops_1.0-6             magrittr_1.5             RCurl_1.95-4.8
[43] RSQLite_1.0.0            Formula_1.2-1            cluster_2.0.5
[46] Matrix_1.2-7.1           data.table_1.9.6         rpart_4.1-10
[49] GenomicAlignments_1.10.0 nnet_7.3-12


genomeinfodb keepseqlevels • 1.1k views
ADD COMMENTlink modified 3.1 years ago by Valerie Obenchain6.7k • written 3.1 years ago by Chao-Jen Wong30

Hi Chao-Jen,

Just a note to say we saw this and are discussing it. Currently the behavior of the higher-level keep/dropSeqlevels() is based on "seqlevels<-" which does not allow cherry picking from List elements with mixed seqlevels. This is documented on the ?seqlevels man page under the 'force' argument and the examples section 'DROP SEQLEVELS OF A LIST-LIKE OBJECT'.

Herve will probably add the option to keep selective ranges in mixed List elements and I'll propagate that change to the high level helpers (and improve the documentation). We'll post back when this is done.



ADD REPLYlink written 3.1 years ago by Valerie Obenchain6.7k
Answer: keepSeqlevels for GRangesList: drop elements with multiple seqnames
gravatar for Valerie Obenchain
3.1 years ago by
United States
Valerie Obenchain6.7k wrote:


Herve has replaced 'force=TRUE' with 'pruning.mode="coarse"' in seqlevels() setter and these changes have been propagated to the high-level wrappers. The new 'pruning.mode' argument allows more flexibility in specifying which seqlevels should be kept/dropped. Here is a summary from the last svn commit:

r124181 | | 2016-11-15 22:04:23 -0800 (Tue, 15 Nov 2016) | 8 lines

Add 'pruning.mode' argument to the keepSeqlevels(), dropSeqlevels(), and
keepStandardChromosomes() functions. IMPORTANT NOTE: Like for the seqlevels()
setter, the default pruning mode is "error", which means that now these
functions fail when some of the seqlevels to drop from 'x' are in use. The
old behavior was to silently prune 'x' (doing "coarse" pruning). This is an
imporant change in behavior which was motivated by this post from Chao-Jen
Wong: keepSeqlevels for GRangesList: drop elements with multiple seqnames

See ?seqlevels for complete details. Changes are in GenomeInfoDb 1.11.6.


ADD COMMENTlink written 3.1 years ago by Valerie Obenchain6.7k

Thanks a lot for making the changes! 

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Chao-Jen Wong30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 226 users visited in the last hour