keepSeqlevels for GRangesList: drop elements with multiple seqnames
1
1
Entering edit mode
@chao-jen-wong-7035
Last seen 17 months ago
USA/Seattle/Fred Hutchinson Cancer Rese…

Hi, I have usually used keepSeqlevels() to subset the list of transcript or exon (GRangesList built by transciptBy() and exonsBy()) and keep the transcripts of interest. Not until recently did I find out that many of genes that I am interested in are missing after using keepSeqlevels() and those missing genes are meant to be kept.  Let me give an example:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
tx <- keepSeqlevels(txbygene, "chr1")

 

I thought doing this, the resulted tx would have all the genes in chromosome 1, but it is not true.  Gene "343068"  is missing. I believe it is because this gene contains transcript from both "chr1" and "chr1_KI270766v1_alt" and therefore keepSeqlevels() excludes it. Please see below:

> txbygene["343068"]
GRangesList object of length 1:
$343068
GRanges object with 3 ranges and 2 metadata columns:
                 seqnames               ranges strand |     tx_id     tx_name
                    <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]                chr1 [13254212, 13263314]      + |       773  uc001auu.2
  [2] chr1_KI270766v1_alt [  115913,   206761]      + |    182686  uc061fwy.1
  [3] chr1_KI270766v1_alt [  197666,   206903]      + |    182687  uc032ncd.1

> tx["343068"]
Error: subscript contains invalid names

I found it by surprised since I have not found such information or warnings in the man page. I don't know if this would be changed in the future, but at least a warning in the man page would be helpful. 

 

Thanks,

Chao-Jen

 

> sessionInfo()
R Under development (unstable) (2016-11-01 r71616)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
 [2] GenomicFeatures_1.26.0
 [3] AnnotationDbi_1.37.0
 [4] DESeq2_1.14.0
 [5] SummarizedExperiment_1.4.0
 [6] Biobase_2.35.0
 [7] GenomicRanges_1.26.1
 [8] GenomeInfoDb_1.10.0
 [9] IRanges_2.8.0
[10] S4Vectors_0.12.0
[11] BiocGenerics_0.21.0
[12] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0        locfit_1.5-9.1           splines_3.4.0
 [4] lattice_0.20-34          colorspace_1.2-7         htmltools_0.3.5
 [7] rtracklayer_1.34.0       chron_2.3-47             survival_2.40-1
[10] XML_3.98-1.4             foreign_0.8-67           DBI_0.5-1
[13] BiocParallel_1.8.1       RColorBrewer_1.1-2       plyr_1.8.4
[16] stringr_1.1.0            zlibbioc_1.20.0          Biostrings_2.42.0
[19] munsell_0.4.3            gtable_0.2.0             latticeExtra_0.6-28
[22] knitr_1.14               geneplotter_1.52.0       biomaRt_2.30.0
[25] htmlTable_1.7            Rcpp_0.12.7              acepack_1.4.1
[28] xtable_1.8-2             scales_0.4.0             Hmisc_4.0-0
[31] annotate_1.52.0          XVector_0.14.0           Rsamtools_1.26.1
[34] gridExtra_2.2.1          ggplot2_2.1.0            digest_0.6.10
[37] stringi_1.1.2            grid_3.4.0               tools_3.4.0
[40] bitops_1.0-6             magrittr_1.5             RCurl_1.95-4.8
[43] RSQLite_1.0.0            Formula_1.2-1            cluster_2.0.5
[46] Matrix_1.2-7.1           data.table_1.9.6         rpart_4.1-10
[49] GenomicAlignments_1.10.0 nnet_7.3-12
>

 

GenomeInfoDb keepSeqlevels • 3.5k views
ADD COMMENT
1
Entering edit mode

Hi Chao-Jen,

Just a note to say we saw this and are discussing it. Currently the behavior of the higher-level keep/dropSeqlevels() is based on "seqlevels<-" which does not allow cherry picking from List elements with mixed seqlevels. This is documented on the ?seqlevels man page under the 'force' argument and the examples section 'DROP SEQLEVELS OF A LIST-LIKE OBJECT'.

Herve will probably add the option to keep selective ranges in mixed List elements and I'll propagate that change to the high level helpers (and improve the documentation). We'll post back when this is done.

Val

 

ADD REPLY
3
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi,

Herve has replaced 'force=TRUE' with 'pruning.mode="coarse"' in seqlevels() setter and these changes have been propagated to the high-level wrappers. The new 'pruning.mode' argument allows more flexibility in specifying which seqlevels should be kept/dropped. Here is a summary from the last svn commit:

------------------------------------------------------------------------
r124181 | hpages@fhcrc.org | 2016-11-15 22:04:23 -0800 (Tue, 15 Nov 2016) | 8 lines

Add 'pruning.mode' argument to the keepSeqlevels(), dropSeqlevels(), and
keepStandardChromosomes() functions. IMPORTANT NOTE: Like for the seqlevels()
setter, the default pruning mode is "error", which means that now these
functions fail when some of the seqlevels to drop from 'x' are in use. The
old behavior was to silently prune 'x' (doing "coarse" pruning). This is an
imporant change in behavior which was motivated by this post from Chao-Jen
Wong: keepSeqlevels for GRangesList: drop elements with multiple seqnames


See ?seqlevels for complete details. Changes are in GenomeInfoDb 1.11.6.

Val

ADD COMMENT
0
Entering edit mode

Thanks a lot for making the changes! 

ADD REPLY

Login before adding your answer.

Traffic: 368 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6