GRangesList and `seqlevels<-` with force=TRUE
1
0
Entering edit mode
Simon Anders ★ 3.7k
@simon-anders-3855
Last seen 3.7 years ago
Zentrum für Molekularbiologie, Universi…

Hi

I wonder if the following might be a bug.

As in my previous post, I start with

> library( GenomicRanges )
> library( TxDb.Hsapiens.UCSC.hg19.knownGene )
> transcripts <- transcriptsBy( TxDb.Hsapiens.UCSC.hg19.knownGene )

This object contains all kinds of extra contigs:

> seqlevels(transcripts)
 [1] "chr1"                  "chr2"                  "chr3"                 
 [4] "chr4"                  "chr5"                  "chr6"                 
 [7] "chr7"                  "chr8"                  "chr9"                 
[...]
[22] "chr22"                 "chrX"                  "chrY"                 
[25] "chrM"                  "chr1_gl000191_random"  "chr1_gl000192_random" 
[28] "chr4_ctg9_hap1"        "chr4_gl000193_random"  "chr4_gl000194_random" 
[31] "chr6_apd_hap1"         "chr6_cox_hap2"         "chr6_dbb_hap3"        
[34] "chr6_mann_hap4"        "chr6_mcf_hap5"         "chr6_qbl_hap6"        
[37] "chr6_ssto_hap7"        "chr7_gl000195_random"  "chr8_gl000196_random" 
[...]

I keep only the useful ones

> realchroms <- seqlevels(transcripts)[1:25]
> realchroms
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

For a GRanges object, I can now remove all ranges on the extra contigs. For example (using a gene from the HLA):

> hlab <- transcripts[["3106"]]
> hlab
GRanges object with 36 ranges and 2 metadata columns:
             seqnames               ranges strand   |     tx_id     tx_name
                <Rle>            <IRanges>  <Rle>   | <integer> <character>
   [1]           chr6 [31237743, 31324989]      -   |     26240  uc031snl.1
   [2]           chr6 [31238850, 31324989]      -   |     26241  uc003ntf.2
   [3]           chr6 [31321649, 31324450]      -   |     26243  uc003ntg.1
   [4]           chr6 [31321649, 31324989]      -   |     26244  uc003nth.2
   [5]           chr6 [31322256, 31323369]      -   |     26245  uc003nti.1
   ...            ...                  ...    ... ...       ...         ...
  [32] chr6_ssto_hap7   [2655408, 2658208]      -   |     82522  uc011jii.2
  [33] chr6_ssto_hap7   [2655408, 2658748]      -   |     82523  uc011jij.2
  [34] chr6_ssto_hap7   [2656015, 2657128]      -   |     82524  uc011jik.1
  [35] chr6_ssto_hap7   [2656853, 2657905]      -   |     82525  uc011jil.1
  [36] chr6_ssto_hap7   [2657058, 2658492]      -   |     82526  uc011jim.1
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> seqlevels( hlab, force=TRUE ) <- realchroms
> hlab
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr6 [31237743, 31324989]      - |     26240  uc031snl.1
  [2]     chr6 [31238850, 31324989]      - |     26241  uc003ntf.2
  [3]     chr6 [31321649, 31324450]      - |     26243  uc003ntg.1
  [4]     chr6 [31321649, 31324989]      - |     26244  uc003nth.2
  [5]     chr6 [31322256, 31323369]      - |     26245  uc003nti.1
  [6]     chr6 [31323094, 31324145]      - |     26246  uc010jsn.1
  [7]     chr6 [31323299, 31324734]      - |     26247  uc010jso.2
  -------
  seqinfo: 25 sequences (1 circular) from hg19 genome

If I do the same on the whole GRangesList object, however, the whole gene vanishes, not only the ranges those from the alternative contigs:

> seqlevels( transcripts, force=TRUE ) <- realchroms
> transcripts
GRangesList object of length 23191:
$1
GRanges object with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |     tx_id     tx_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]    chr19 [58858172, 58864865]      - |     70455  uc002qsd.4
  [2]    chr19 [58859832, 58874214]      - |     70456  uc002qsf.2

$10
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | tx_id    tx_name
  [1]     chr8 [18248755, 18258723]      + | 31944 uc003wyw.1

$100
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | tx_id    tx_name
  [1]    chr20 [43248163, 43280376]      - | 72132 uc002xmj.3

...
<23188 more elements>
-------
seqinfo: 25 sequences (1 circular) from hg19 genome

> transcripts[["3106"]]
NULL

So: Is this intended or a bug? And: How do I now get rid of the extra sequences?

Thanks

   Simon

 

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.3 LTS

locale:
 [1] LC_CTYPE=fi_FI.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[2] GenomicFeatures_1.22.8                 
[3] AnnotationDbi_1.32.3                   
[4] Biobase_2.30.0                         
[5] GenomicRanges_1.22.3                   
[6] GenomeInfoDb_1.6.1                     
[7] IRanges_2.4.6                          
[8] S4Vectors_0.8.7                        
[9] BiocGenerics_0.16.1                    

loaded via a namespace (and not attached):
 [1] XVector_0.10.0             zlibbioc_1.16.0           
 [3] GenomicAlignments_1.6.3    BiocParallel_1.4.3        
 [5] tools_3.2.2                SummarizedExperiment_1.0.2
 [7] DBI_0.3.1                  lambda.r_1.1.7            
 [9] futile.logger_1.4.1        rtracklayer_1.30.1        
[11] futile.options_1.0.0       bitops_1.0-6              
[13] RCurl_1.95-4.7             biomaRt_2.26.1            
[15] RSQLite_1.0.0              Biostrings_2.38.3         
[17] Rsamtools_1.22.0           XML_3.98-1.3              

 

GRangesList GenomicRanges • 1.8k views
ADD COMMENT
2
Entering edit mode

Looks like a bug to me, but just wanted to point out that keepStandardChromosomes(transcripts) might be more convenient.

ADD REPLY
0
Entering edit mode

Thanks. This is more convenient.

(How would I have found about this, by the way? I mean, if I started reading from the GenomicRanges and the AnnotationDbi vignettes.)

ADD REPLY
0
Entering edit mode

Probably best to start reading the high-level workflows, like this one, instead of individual package vignettes, which will only discuss features of that package. There are over a dozen packages constituting the infrastructure now.

ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi Simon,

The vanishing of the whole gene is intended. This is the semantic of the seqlevels() setter on List objects. See B. DROP SEQLEVELS OF A LIST-LIKE OBJECT in the examples section of ?seqlevels for more information. It explains how to get rid of feature components located on given chromosomes without getting rid of the whole feature.

The reason for the current semantic of seqlevels<- on List objects is to not break the structure of the feature. The structure is not important when the feature is a gene and the components are transcripts (your case) but VERY important when the feature is a transcript and the components are exons (e.g. exonsBy(txdb, "tx")).

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Thanks. That makes sense.

ADD REPLY

Login before adding your answer.

Traffic: 796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6