Question

Cannot extract TxFeatures for a defined GRanges object

1

Entering edit mode

oolongoni ▴ 40

@oolongoni-2289

Last seen 8.0 years ago

California

I wish to extract a TxFeatures object for the MYC gene. I've successfully used this code for another gene (CD19). I've read the Genomic Ranges and TxDB vignettes through but missing the subtleties...

Perhaps I do not understand the clearly `TxFeatures` object? Help is greatly appreciated!

## make a EnsDb class object
library(EnsDb.Hsapiens.v86)
edb <- EnsDb.Hsapiens.v86

### Option 1 GRanges objects derived from ensembldb
MYC <- genes(edb, filter = list(GenenameFilter("MYC"), GeneBiotypeFilter("protein_coding")))
seqlevels(MYC)
> MYC
GRanges object with 1 range and 6 metadata columns:
                  seqnames                 ranges strand |         gene_id   gene_name    entrezid
                     <Rle>              <IRanges>  <Rle> |     <character> <character> <character>
  ENSG00000136997        8 [127735434, 127741434]      + | ENSG00000136997         MYC        4609
                    gene_biotype seq_coord_system      symbol
                     <character>      <character> <character>
  ENSG00000136997 protein_coding       chromosome         MYC
  -------
  seqinfo: 1 sequence from GRCh38 genome

## Option 2: construct GRange object by hand
myc <- GRanges(
      seqnames = Rle(8, 1),
      ranges = IRanges(start=127735434, end=127741434, names=1),
      strand = Rle(strand("+"), 1)
)
> myc
GRanges object with 1 range and 0 metadata columns:
    seqnames                 ranges strand
       <Rle>              <IRanges>  <Rle>
  1        8 [127735434, 127741434]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## make a TxDb object #####
library(SGSeq)
hsaEnsembl <- makeTxDbFromBiomart(dataset="hsapiens_gene_ensembl")
seqlevels(hsaEnsembl) <- "8"
## extract TxFeatures from TxFeatures subset
txf <- convertToTxFeatures(hsaEnsembl)

Apparently nothing overlapped?

> txf_MYC <- txf[txf %over% MYC]
> txf_MYC

TxFeatures object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand     type          txName        geneName
      <Rle> <IRanges>  <Rle> <factor> <CharacterList> <CharacterList>
  -------
  seqinfo: 555 sequences (1 circular) from an unspecified genome

> txf_myc <- txf[txf %over% myc]

> txf_myc
TxFeatures object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand     type          txName        geneName
      <Rle> <IRanges>  <Rle> <factor> <CharacterList> <CharacterList>
  -------
  seqinfo: 555 sequences (1 circular) from an unspecified genome

If I test with another GRanges for another gene, it works fine.

bcl6 <- GRanges(
  seqnames=Rle(3,1),
  ranges=IRanges(start=187721377, end=187745727,names=1),
  strand=Rle(strand("-"),1)
)
> bcl6
GRanges object with 1 range and 0 metadata columns:
    seqnames                 ranges strand
       <Rle>              <IRanges>  <Rle>
  1        3 [187721377, 187745727]      -
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Resetting seqlevels my TxDb object
> seqlevels(hsaEnsembl) <- seqlevels0(hsaEnsembl)
> txf <- convertToTxFeatures(hsaEnsembl)
> txf_bcl6 <- txf[txf %over% bcl6]
> txf_bcl6
TxFeatures object with 44 ranges and 0 metadata columns:
       seqnames                 ranges strand     type                                              txName        geneName
          <Rle>              <IRanges>  <Rle> <factor>                                     <CharacterList> <CharacterList>
   [1]        3 [187721377, 187722601]      -        L                     ENST00000621333,ENST00000406870 ENSG00000113916
   [2]        3 [187721387, 187722601]      -        L                                     ENST00000419510 ENSG00000113916
   [3]        3 [187722398, 187722601]      -        L                                     ENST00000232014 ENSG00000113916
   [4]        3 [187722432, 187722601]      -        L                                     ENST00000450123 ENSG00000113916
   [5]        3 [187722601, 187724941]      -        J ENST00000621333,ENST00000406870,ENST00000419510,... ENSG00000113916
   ...      ...                    ...    ...      ...                                                 ...             ...
  [40]        3 [187745410, 187745437]      -        F                                     ENST00000480458 ENSG00000113916
  [41]        3 [187745410, 187745459]      -        F                                     ENST00000496823 ENSG00000113916
  [42]        3 [187745410, 187745472]      -        F                                     ENST00000470319 ENSG00000113916
  [43]        3 [187745410, 187745725]      -        F                                     ENST00000621333 ENSG00000113916
  [44]        3 [187745410, 187745727]      -        F                                     ENST00000406870 ENSG00000113916
  -------
  seqinfo: 1 sequence from an unspecified genome

TxDB genomicranges genomicfeatures • 2.4k views

ADD COMMENT • link updated 8.2 years ago by Hervé Pagès 16k • written 8.2 years ago by oolongoni ▴ 40

score 1 · Answer 1 · 2017-10-30

Hi,

As discussed on the bioc-devel mailing list in this thread:

https://stat.ethz.ch/pipermail/bioc-devel/2017-October/012215.html

there seems to be a problem with the BioMart service that causes part of the data requested by makeTxDbFromBiomart() to get lost while on its way from Ensembl to the TxDb object.

As announced in the same thread, I just added makeTxDbFromEnsembl() to GenomicFeatures. Can be used instead of makeTxDbFromBiomart() to make a TxDb object from Ensembl data. makeTxDbFromEnsembl() bypasses BioMart by querying directly the Ensembl MySQL server. Only lightly tested but so far it seems faster and more reliable than makeTxDbFromBiomart(). See the thread on bioc-devel for more information.

Cheers,

H.

score 0 · Answer 2 · 2017-10-25

0

Entering edit mode

Leonard Goldstein ▴ 260

@leonard-goldstein-6845

Last seen 19 months ago

Australia

It looks like MYC is missing from the TxDb object you created, so this is a question about the makeTxDbFromBiomart function in the GenomicFeatures package. Maybe repost your question and add GenomicFeatures as one of the tags, and hopefully someone can help. Alternatively you can try using one of the existing TxDb packages.

ADD COMMENT • link 8.2 years ago Leonard Goldstein ▴ 260

score 0 · Answer 3 · 2017-10-25

It doesn't appear to be the TxDb:

> tx <- makeTxDbFromBiomart()

> txs <- transcriptsBy(tx)
> subsetByOverlaps(txs, GRanges("8", IRanges(127735434,127741434)))
GRangesList object of length 1:
$ENSG00000136997
GRanges object with 9 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id         tx_name
         <Rle>              <IRanges>  <Rle> | <integer>     <character>
  [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
  [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
  [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
  [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
  [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
  [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
  [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
  [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
  [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] ensembldb_2.1.10       AnnotationFilter_1.1.3 GenomicFeatures_1.29.7
 [4] AnnotationDbi_1.39.1   Biobase_2.37.2         GenomicRanges_1.29.6  
 [7] GenomeInfoDb_1.13.4    IRanges_2.11.7         S4Vectors_0.15.5      
[10] BiocGenerics_0.23.0    edgeR_3.19.3           limma_3.33.3          

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.7.5    progress_1.1.2               
 [3] locfit_1.5-9.1                lattice_0.20-35              
 [5] htmltools_0.3.6               rtracklayer_1.37.2           
 [7] yaml_2.1.14                   interactiveDisplayBase_1.15.0
 [9] blob_1.1.0                    XML_3.98-1.9                 
[11] rlang_0.1.1                   DBI_0.7                      
[13] BiocParallel_1.11.4           bit64_0.9-7                  
[15] matrixStats_0.52.2            GenomeInfoDbData_0.99.1      
[17] ProtGenerics_1.9.0            zlibbioc_1.23.0              
[19] Biostrings_2.45.2             memoise_1.1.0                
[21] biomaRt_2.33.3                httpuv_1.3.5                 
[23] BiocInstaller_1.26.1          curl_2.7                     
[25] Rcpp_0.12.11                  xtable_1.8-2                 
[27] DelayedArray_0.3.16           XVector_0.17.0               
[29] mime_0.5                      bit_1.1-12                   
[31] Rsamtools_1.29.0              AnnotationHub_2.9.5          
[33] digest_0.6.12                 shiny_1.0.3                  
[35] grid_3.4.1                    tools_3.4.1                  
[37] bitops_1.0-6                  magrittr_1.5                 
[39] RCurl_1.95-4.8                lazyeval_0.2.0               
[41] tibble_1.3.3                  RSQLite_2.0                  
[43] pkgconfig_2.0.1               Matrix_1.2-10                
[45] prettyunits_1.0.2             assertthat_0.2.0             
[47] httr_1.2.1                    R6_2.2.2                     
[49] GenomicAlignments_1.13.2      compiler_3.4.1

score 0 · Answer 4 · 2017-10-25

exonsBy seems to be dropping a lot of genes, including MYC:

> tx <- makeTxDbFromBiomart()
> txs_by_gene <- transcriptsBy(tx, "gene")
> exs_by_gene <- exonsBy(tx, "gene")
> length(txs_by_gene)
[1] 63967
> length(exs_by_gene)
[1] 36751
> subsetByOverlaps(txs_by_gene, GRanges("8", IRanges(127735434,127741434)))
GRangesList object of length 1:
$ENSG00000136997 
GRanges object with 9 ranges and 2 metadata columns:
      seqnames                 ranges strand |     tx_id         tx_name
         <Rle>              <IRanges>  <Rle> | <integer>     <character>
  [1]        8 [127735434, 127740477]      + |    101876 ENST00000259523
  [2]        8 [127735473, 127735817]      + |    101877 ENST00000641252
  [3]        8 [127735519, 127738772]      + |    101878 ENST00000517291
  [4]        8 [127736046, 127736612]      + |    101879 ENST00000641036
  [5]        8 [127736069, 127741434]      + |    101880 ENST00000621592
  [6]        8 [127736084, 127741434]      + |    101881 ENST00000377970
  [7]        8 [127736220, 127741372]      + |    101882 ENST00000524013
  [8]        8 [127736231, 127738475]      + |    101883 ENST00000520751
  [9]        8 [127736594, 127740958]      + |    101884 ENST00000613283

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome
> subsetByOverlaps(exs_by_gene, GRanges("8", IRanges(127735434,127741434)))
GRangesList object of length 0:
<0 elements>

-------
seqinfo: 555 sequences (1 circular) from an unspecified genome
> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] GenomicFeatures_1.28.5 AnnotationDbi_1.38.2   Biobase_2.36.2        
[4] GenomicRanges_1.28.6   GenomeInfoDb_1.12.3    IRanges_2.10.5        
[7] S4Vectors_0.14.7       BiocGenerics_0.22.1   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13               XVector_0.16.0            
 [3] GenomicAlignments_1.12.2   zlibbioc_1.22.0           
 [5] BiocParallel_1.10.1        bit_1.1-12                
 [7] lattice_0.20-35            rlang_0.1.2               
 [9] blob_1.1.0                 tools_3.4.1               
[11] grid_3.4.1                 SummarizedExperiment_1.6.5
[13] DBI_0.7                    matrixStats_0.52.2        
[15] bit64_0.9-7                digest_0.6.12             
[17] tibble_1.3.4               Matrix_1.2-11             
[19] GenomeInfoDbData_0.99.0    rtracklayer_1.36.6        
[21] bitops_1.0-6               RCurl_1.95-4.8            
[23] biomaRt_2.32.1             memoise_1.1.0             
[25] RSQLite_2.0                DelayedArray_0.2.7        
[27] compiler_3.4.1             Rsamtools_1.28.0          
[29] Biostrings_2.44.2          XML_3.98-1.9              
[31] pkgconfig_2.0.1           
>