Question

keepSeqlevels for GRangesList: drop elements with multiple seqnames

1

Entering edit mode

Chao-Jen Wong ▴ 40

@chao-jen-wong-7035

Last seen 21 months ago

USA/Seattle/Fred Hutchinson Cancer Rese…

Hi, I have usually used keepSeqlevels() to subset the list of transcript or exon (GRangesList built by transciptBy() and exonsBy()) and keep the transcripts of interest. Not until recently did I find out that many of genes that I am interested in are missing after using keepSeqlevels() and those missing genes are meant to be kept. Let me give an example:

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txbygene <- transcriptsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
tx <- keepSeqlevels(txbygene, "chr1")

I thought doing this, the resulted tx would have all the genes in chromosome 1, but it is not true. Gene "343068" is missing. I believe it is because this gene contains transcript from both "chr1" and "chr1_KI270766v1_alt" and therefore keepSeqlevels() excludes it. Please see below:

> txbygene["343068"]
GRangesList object of length 1:
$343068
GRanges object with 3 ranges and 2 metadata columns:
                 seqnames               ranges strand |     tx_id     tx_name
                    <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]                chr1 [13254212, 13263314]      + |       773  uc001auu.2
  [2] chr1_KI270766v1_alt [  115913,   206761]      + |    182686  uc061fwy.1
  [3] chr1_KI270766v1_alt [  197666,   206903]      + |    182687  uc032ncd.1

> tx["343068"]
Error: subscript contains invalid names

I found it by surprised since I have not found such information or warnings in the man page. I don't know if this would be changed in the future, but at least a warning in the man page would be helpful.

Thanks,

Chao-Jen

> sessionInfo()
R Under development (unstable) (2016-11-01 r71616)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.4.0
 [2] GenomicFeatures_1.26.0
 [3] AnnotationDbi_1.37.0
 [4] DESeq2_1.14.0
 [5] SummarizedExperiment_1.4.0
 [6] Biobase_2.35.0
 [7] GenomicRanges_1.26.1
 [8] GenomeInfoDb_1.10.0
 [9] IRanges_2.8.0
[10] S4Vectors_0.12.0
[11] BiocGenerics_0.21.0
[12] BiocInstaller_1.24.0

loaded via a namespace (and not attached):
 [1] genefilter_1.56.0        locfit_1.5-9.1           splines_3.4.0
 [4] lattice_0.20-34          colorspace_1.2-7         htmltools_0.3.5
 [7] rtracklayer_1.34.0       chron_2.3-47             survival_2.40-1
[10] XML_3.98-1.4             foreign_0.8-67           DBI_0.5-1
[13] BiocParallel_1.8.1       RColorBrewer_1.1-2       plyr_1.8.4
[16] stringr_1.1.0            zlibbioc_1.20.0          Biostrings_2.42.0
[19] munsell_0.4.3            gtable_0.2.0             latticeExtra_0.6-28
[22] knitr_1.14               geneplotter_1.52.0       biomaRt_2.30.0
[25] htmlTable_1.7            Rcpp_0.12.7              acepack_1.4.1
[28] xtable_1.8-2             scales_0.4.0             Hmisc_4.0-0
[31] annotate_1.52.0          XVector_0.14.0           Rsamtools_1.26.1
[34] gridExtra_2.2.1          ggplot2_2.1.0            digest_0.6.10
[37] stringi_1.1.2            grid_3.4.0               tools_3.4.0
[40] bitops_1.0-6             magrittr_1.5             RCurl_1.95-4.8
[43] RSQLite_1.0.0            Formula_1.2-1            cluster_2.0.5
[46] Matrix_1.2-7.1           data.table_1.9.6         rpart_4.1-10
[49] GenomicAlignments_1.10.0 nnet_7.3-12
>

GenomeInfoDb keepSeqlevels • 3.7k views

ADD COMMENT • link updated 8.4 years ago by Valerie Obenchain ★ 6.8k • written 8.4 years ago by Chao-Jen Wong ▴ 40

1

Entering edit mode

Hi Chao-Jen,

Just a note to say we saw this and are discussing it. Currently the behavior of the higher-level keep/dropSeqlevels() is based on "seqlevels<-" which does not allow cherry picking from List elements with mixed seqlevels. This is documented on the ?seqlevels man page under the 'force' argument and the examples section 'DROP SEQLEVELS OF A LIST-LIKE OBJECT'.

Herve will probably add the option to keep selective ranges in mixed List elements and I'll propagate that change to the high level helpers (and improve the documentation). We'll post back when this is done.

Val

ADD REPLY • link 8.4 years ago Valerie Obenchain ★ 6.8k

score 3 · Answer 1 · 2016-11-16

Hi,

Herve has replaced 'force=TRUE' with 'pruning.mode="coarse"' in seqlevels() setter and these changes have been propagated to the high-level wrappers. The new 'pruning.mode' argument allows more flexibility in specifying which seqlevels should be kept/dropped. Here is a summary from the last svn commit:

------------------------------------------------------------------------
r124181 | hpages@fhcrc.org | 2016-11-15 22:04:23 -0800 (Tue, 15 Nov 2016) | 8 lines

Add 'pruning.mode' argument to the keepSeqlevels(), dropSeqlevels(), and
keepStandardChromosomes() functions. IMPORTANT NOTE: Like for the seqlevels()
setter, the default pruning mode is "error", which means that now these
functions fail when some of the seqlevels to drop from 'x' are in use. The
old behavior was to silently prune 'x' (doing "coarse" pruning). This is an
imporant change in behavior which was motivated by this post from Chao-Jen
Wong: keepSeqlevels for GRangesList: drop elements with multiple seqnames

See ?seqlevels for complete details. Changes are in GenomeInfoDb 1.11.6.

Val