Hi
I wonder if the following might be a bug.
As in my previous post, I start with
> library( GenomicRanges ) > library( TxDb.Hsapiens.UCSC.hg19.knownGene ) > transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )
This object contains all kinds of extra contigs:
> seqlevels(transcripts) [1] "chr1" "chr2" "chr3" [4] "chr4" "chr5" "chr6" [7] "chr7" "chr8" "chr9" [...] [22] "chr22" "chrX" "chrY" [25] "chrM" "chr1_gl000191_random" "chr1_gl000192_random" [28] "chr4_ctg9_hap1" "chr4_gl000193_random" "chr4_gl000194_random" [31] "chr6_apd_hap1" "chr6_cox_hap2" "chr6_dbb_hap3" [34] "chr6_mann_hap4" "chr6_mcf_hap5" "chr6_qbl_hap6" [37] "chr6_ssto_hap7" "chr7_gl000195_random" "chr8_gl000196_random" [...]
I keep only the useful ones
> realchroms <- seqlevels(transcripts)[1:25] > realchroms [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY" "chrM"
For a GRanges object, I can now remove all ranges on the extra contigs. For example (using a gene from the HLA):
> hlab <- transcripts[["3106"]]
> hlab
GRanges object with 36 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr6 [31237743, 31324989] - | 26240 uc031snl.1
[2] chr6 [31238850, 31324989] - | 26241 uc003ntf.2
[3] chr6 [31321649, 31324450] - | 26243 uc003ntg.1
[4] chr6 [31321649, 31324989] - | 26244 uc003nth.2
[5] chr6 [31322256, 31323369] - | 26245 uc003nti.1
... ... ... ... ... ... ...
[32] chr6_ssto_hap7 [2655408, 2658208] - | 82522 uc011jii.2
[33] chr6_ssto_hap7 [2655408, 2658748] - | 82523 uc011jij.2
[34] chr6_ssto_hap7 [2656015, 2657128] - | 82524 uc011jik.1
[35] chr6_ssto_hap7 [2656853, 2657905] - | 82525 uc011jil.1
[36] chr6_ssto_hap7 [2657058, 2658492] - | 82526 uc011jim.1
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> seqlevels( hlab, force=TRUE ) <- realchroms
> hlab
GRanges object with 7 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr6 [31237743, 31324989] - | 26240 uc031snl.1
[2] chr6 [31238850, 31324989] - | 26241 uc003ntf.2
[3] chr6 [31321649, 31324450] - | 26243 uc003ntg.1
[4] chr6 [31321649, 31324989] - | 26244 uc003nth.2
[5] chr6 [31322256, 31323369] - | 26245 uc003nti.1
[6] chr6 [31323094, 31324145] - | 26246 uc010jsn.1
[7] chr6 [31323299, 31324734] - | 26247 uc010jso.2
-------
seqinfo: 25 sequences (1 circular) from hg19 genome
If I do the same on the whole GRangesList object, however, the whole gene vanishes, not only the ranges those from the alternative contigs:
> seqlevels( transcripts, force=TRUE ) <- realchroms
> transcripts
GRangesList object of length 23191:
$1
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr19 [58858172, 58864865] - | 70455 uc002qsd.4
[2] chr19 [58859832, 58874214] - | 70456 uc002qsf.2
$10
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chr8 [18248755, 18258723] + | 31944 uc003wyw.1
$100
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
[1] chr20 [43248163, 43280376] - | 72132 uc002xmj.3
...
<23188 more elements>
-------
seqinfo: 25 sequences (1 circular) from hg19 genome
> transcripts[["3106"]]
NULL
So: Is this intended or a bug? And: How do I now get rid of the extra sequences?
Thanks
Simon
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.3 LTS locale: [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 [2] GenomicFeatures_1.22.8 [3] AnnotationDbi_1.32.3 [4] Biobase_2.30.0 [5] GenomicRanges_1.22.3 [6] GenomeInfoDb_1.6.1 [7] IRanges_2.4.6 [8] S4Vectors_0.8.7 [9] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] XVector_0.10.0 zlibbioc_1.16.0 [3] GenomicAlignments_1.6.3 BiocParallel_1.4.3 [5] tools_3.2.2 SummarizedExperiment_1.0.2 [7] DBI_0.3.1 lambda.r_1.1.7 [9] futile.logger_1.4.1 rtracklayer_1.30.1 [11] futile.options_1.0.0 bitops_1.0-6 [13] RCurl_1.95-4.7 biomaRt_2.26.1 [15] RSQLite_1.0.0 Biostrings_2.38.3 [17] Rsamtools_1.22.0 XML_3.98-1.3

Looks like a bug to me, but just wanted to point out that
keepStandardChromosomes(transcripts)might be more convenient.Thanks. This is more convenient.
(How would I have found about this, by the way? I mean, if I started reading from the GenomicRanges and the AnnotationDbi vignettes.)
Probably best to start reading the high-level workflows, like this one, instead of individual package vignettes, which will only discuss features of that package. There are over a dozen packages constituting the infrastructure now.