Question

Splitting a TxDB or GRanges by Chromosome

0

Entering edit mode

jon.klonowski ▴ 40

@8f2f6acd

Last seen 5 months ago

United States

Hi, I am trying to split a TxDB or GRanges by chromsome.

The TxDB object is made from gencode (https://www.gencodegenes.org/human/) using:

txdb <- makeTxDbFromGFF()

Looking at the manual, to just use chr1 data for example, I would just use:

seqlevels(txdb) <- "chr1"

However, when I run:

seqlevels(txdb)

I get an error:

Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'seqlevels': invalid DB file

Ok fine, whatever, lets try to use the GRanges object I made from this TxDB object instead:

#Exons by transcript
Exons <- exonsBy(Gencode, by = "tx", use.names=TRUE)

Where running seqlevels(Exons) outputs the names of 22 chromosomes, chrX, chrY and chrM. Perfect. However, since this is now not a txDB object, I cant use seqlevels(txdb) <- "chr1" Looking at the web I saw that people recommended using split():

ExonsSplitByChr <- split(Exons, seqlevels(Exons))

While I now have a list of lists for each chromosome, the slits each have ~4100 entries. I doubt that the genome has a perfectly matching amount of transcripts per chromosome....

Thank you for the help!

``` #Session Info:

R version 4.0.5 (2021-03-31) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 10.16

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] GenomicFeatures_1.42.3 AnnotationDbi_1.52.0 Biobase_2.50.0
[4] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7 IRanges_2.24.1
[7] S4Vectors_0.28.1 BiocGenerics_0.36.1

loaded via a namespace (and not attached): [1] Rcpp_1.0.7 here_1.0.1
[3] lattice_0.20-45 png_0.1-7
[5] prettyunits_1.1.1 Rsamtools_2.6.0
[7] Biostrings_2.58.0 rprojroot_2.0.2
[9] assertthat_0.2.1 utf8_1.2.2
[11] BiocFileCache_1.14.0 R6_2.5.1
[13] RSQLite_2.2.9 httr_1.4.2
[15] pillar_1.6.4 zlibbioc_1.36.0
[17] rlang_0.4.12 progress_1.2.2
[19] curl_4.3.2 rstudioapi_0.13
[21] blob_1.2.2 Matrix_1.3-4
[23] reticulate_1.22 BiocParallel_1.24.1
[25] stringr_1.4.0 RCurl_1.98-1.5
[27] bit_4.0.4 biomaRt_2.46.3
[29] tinytex_0.35 DelayedArray_0.16.3
[31] compiler_4.0.5 rtracklayer_1.50.0
[33] xfun_0.28 pkgconfig_2.0.3
[35] askpass_1.1 openssl_1.4.5
[37] tidyselect_1.1.1 SummarizedExperiment_1.20.0 [39] tibble_3.1.6 GenomeInfoDbData_1.2.4
[41] matrixStats_0.61.0 XML_3.99-0.8
[43] fansi_0.5.0 crayon_1.4.2
[45] dplyr_1.0.7 dbplyr_2.1.1
[47] GenomicAlignments_1.26.0 bitops_1.0-7
[49] rappdirs_0.3.3 grid_4.0.5
[51] jsonlite_1.7.2 lifecycle_1.0.1
[53] DBI_1.1.1 magrittr_2.0.1
[55] stringi_1.7.6 cachem_1.0.6
[57] XVector_0.30.0 xml2_1.3.3
[59] ellipsis_0.3.2 generics_0.1.1
[61] vctrs_0.3.8 tools_4.0.5
[63] bit64_4.0.5 glue_1.5.1
[65] purrr_0.3.4 hms_1.1.1
[67] MatrixGenerics_1.2.1 fastmap_1.1.0
[69] memoise_2.0.1

```

GenomicRanges split chromosome TxDb • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Hervé Pagès 16k • written 2.3 years ago by jon.klonowski ▴ 40

0

Entering edit mode

Exons is a GRangesList object and calling split() on a GRangesList object is not a good idea. I don't even know if that works properly or what that is supposed to return.

If all you want to do is mask all transcripts that are not on chr1, then seqlevels(txdb) <- "chr1" is definitely the way to go. So we should probably focus on the first problem you had and try to figure out why seqlevels(txdb) does not work on your TxDb object. Care to share the details of how you got txdb in the first place? Seeing the exact code you used to generate that object would help. Thanks

ADD REPLY • link 2.3 years ago Hervé Pagès 16k

0

Entering edit mode

yes definitely!

So i download the basic gene model from here: https://www.gencodegenes.org/human/

then I

Gencode <- makeTxDbFromGFF("./gencode.v39.basic.annotation.gtf.gz")
saveDb(Gencode, file="gencode.v39.basic.annotation.sqlite")
Gencode <- loadDb("gencode.v39.basic.annotation.sqlite")

and I use it from there.

ADD REPLY • link 2.3 years ago jon.klonowski ▴ 40

0

Entering edit mode

No problem here:

gtf_file <- "gencode.v39.basic.annotation.gtf.gz"
url <- paste0("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/", gtf_file)
download.file(url, destfile=gtf_file)

library(GenomicFeatures)

txdb <- makeTxDbFromGFF(gtf_file)
# Import genomic features from the file as a GRanges object ... OK
# Prepare the 'metadata' data frame ... OK
# Make the TxDb object ... OK
# Warning message:
# In .get_cds_IDX(mcols0$type, mcols0$phase) :
#   The "phase" metadata column contains non-NA values for features of type
#   stop_codon. This information was ignored.

seqlevels(txdb)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

seqlevels(txdb) <- "chr1"
seqlevels(txdb)

ex_by_tx <- exonsBy(txdb, by="tx", use.names=TRUE)

seqlevels(ex_by_tx)
# [1] "chr1"

ex_by_tx
# GRangesList object of length 10466:
# $ENST00000456328.2
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames      ranges strand |   exon_id         exon_name exon_rank
#          <Rle>   <IRanges>  <Rle> | <integer>       <character> <integer>
#   [1]     chr1 11869-12227      + |         1 ENSE00002234944.1         1
#   [2]     chr1 12613-12721      + |         5 ENSE00003582793.1         2
#   [3]     chr1 13221-14409      + |         8 ENSE00002312635.1         3
#   -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# $ENST00000450305.2
# GRanges object with 6 ranges and 3 metadata columns:
#       seqnames      ranges strand |   exon_id         exon_name exon_rank
#          <Rle>   <IRanges>  <Rle> | <integer>       <character> <integer>
#   [1]     chr1 12010-12057      + |         2 ENSE00001948541.1         1
#   [2]     chr1 12179-12227      + |         3 ENSE00001671638.2         2
#   [3]     chr1 12613-12697      + |         4 ENSE00001758273.2         3
#   [4]     chr1 12975-13052      + |         6 ENSE00001799933.2         4
#   [5]     chr1 13221-13374      + |         7 ENSE00001746346.2         5
#   [6]     chr1 13453-13670      + |         9 ENSE00001863096.1         6
#   -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
# $ENST00000473358.1
# GRanges object with 3 ranges and 3 metadata columns:
#       seqnames      ranges strand |   exon_id         exon_name exon_rank
#          <Rle>   <IRanges>  <Rle> | <integer>       <character> <integer>
#   [1]     chr1 29554-30039      + |        10 ENSE00001947070.1         1
#   [2]     chr1 30564-30667      + |        13 ENSE00001922571.1         2
#   [3]     chr1 30976-31097      + |        14 ENSE00001827679.1         3
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths
# 
# ...
# <10463 more elements>

Please show your sessionInfo(). Here is mine.

sessionInfo():

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.1/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomicFeatures_1.46.1 AnnotationDbi_1.56.2   Biobase_2.54.0        
[4] GenomicRanges_1.46.1   GenomeInfoDb_1.30.0    IRanges_2.28.0        
[7] S4Vectors_0.32.3       BiocGenerics_0.40.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.10.0            Biostrings_2.62.0          
 [7] assertthat_0.2.1            digest_0.6.29              
 [9] utf8_1.2.2                  BiocFileCache_2.2.0        
[11] R6_2.5.1                    RSQLite_2.2.9              
[13] httr_1.4.2                  pillar_1.6.4               
[15] zlibbioc_1.40.0             rlang_0.4.12               
[17] progress_1.2.2              curl_4.3.2                 
[19] blob_1.2.2                  Matrix_1.4-0               
[21] BiocParallel_1.28.3         stringr_1.4.0              
[23] RCurl_1.98-1.5              bit_4.0.4                  
[25] biomaRt_2.50.1              DelayedArray_0.20.0        
[27] RMariaDB_1.2.0              compiler_4.1.1             
[29] rtracklayer_1.54.0          pkgconfig_2.0.3            
[31] tidyselect_1.1.1            KEGGREST_1.34.0            
[33] SummarizedExperiment_1.24.0 tibble_3.1.6               
[35] GenomeInfoDbData_1.2.7      matrixStats_0.61.0         
[37] XML_3.99-0.8                fansi_0.5.0                
[39] crayon_1.4.2                dplyr_1.0.7                
[41] dbplyr_2.1.1                GenomicAlignments_1.30.0   
[43] bitops_1.0-7                rappdirs_0.3.3             
[45] grid_4.1.1                  lifecycle_1.0.1            
[47] DBI_1.1.1                   magrittr_2.0.1             
[49] stringi_1.7.6               cachem_1.0.6               
[51] XVector_0.34.0              xml2_1.3.3                 
[53] ellipsis_0.3.2              filelock_1.0.2             
[55] generics_0.1.1              vctrs_0.3.8                
[57] rjson_0.2.20                restfulr_0.0.13            
[59] tools_4.1.1                 bit64_4.0.5                
[61] glue_1.5.1                  purrr_0.3.4                
[63] MatrixGenerics_1.6.0        hms_1.1.1                  
[65] parallel_4.1.1              fastmap_1.1.0              
[67] yaml_2.2.1                  memoise_2.0.1              
[69] BiocIO_1.4.0

ADD REPLY • link 2.3 years ago Hervé Pagès 16k

0

Entering edit mode

Ok so it was because I used GTF and not the GFF3. The vinaigrettes said the command makeTxDbFromGFF() worked for both GTF and GFF3 so I did not think it mattered but I guess the command cant pick upcertain things if it is a GTF instead of a GFF3

ADD REPLY • link 2.3 years ago jon.klonowski ▴ 40

score 3 · Accepted Answer · 2021-12-14

Everything should work the same with either the GTF or GFF3 file, including the seqlevels() getter and setter (makes no difference):

gtf_file <- "gencode.v39.basic.annotation.gtf.gz"
gff3_file <- "gencode.v39.basic.annotation.gff3.gz"

url <- paste0("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/", gtf_file)
download.file(url, destfile=gtf_file)

url <- paste0("https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/", gff3_file)
download.file(url, destfile=gff3_file)

library(GenomicFeatures)
txdb1 <- makeTxDbFromGFF(gtf_file)
txdb2 <- makeTxDbFromGFF(gff3_file)

seqlevels(txdb1)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
seqlevels(txdb2)
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 

tx1 <- transcripts(txdb1)
tx2 <- transcripts(txdb2)

## Same number of transcripts:
length(tx1)
# [1] 114090
length(tx2)
# [1] 114090

## The transcripts are the same:
all(tx1 == tx2)
# [1] TRUE

seqlevels(txdb1) <- seqlevels(txdb2) <- "chr1"
seqlevels(txdb1)
# [1] "chr1"
seqlevels(txdb2)
# [1] "chr1"

ex_by_tx1 <- exonsBy(txdb1, by="tx", use.names=TRUE)
ex_by_tx2 <- exonsBy(txdb2, by="tx", use.names=TRUE)

## Comparing the exons in each transcript (takes about 2.5 min on my laptop):
ok <- mapply(function(x, y) all(x == y), ex_by_tx1, ex_by_tx2)
all(ok)
# [1] TRUE

Cheers,

H.

sessionInfo():

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 21.10

Matrix products: default
BLAS:   /home/hpages/R/R-4.1.1/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.1.1/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] GenomicFeatures_1.46.1 AnnotationDbi_1.56.2   Biobase_2.54.0        
[4] GenomicRanges_1.46.1   GenomeInfoDb_1.30.0    IRanges_2.28.0        
[7] S4Vectors_0.32.3       BiocGenerics_0.40.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] Rsamtools_2.10.0            Biostrings_2.62.0          
 [7] assertthat_0.2.1            digest_0.6.29              
 [9] utf8_1.2.2                  BiocFileCache_2.2.0        
[11] R6_2.5.1                    RSQLite_2.2.9              
[13] httr_1.4.2                  pillar_1.6.4               
[15] zlibbioc_1.40.0             rlang_0.4.12               
[17] progress_1.2.2              curl_4.3.2                 
[19] blob_1.2.2                  Matrix_1.4-0               
[21] BiocParallel_1.28.3         stringr_1.4.0              
[23] RCurl_1.98-1.5              bit_4.0.4                  
[25] biomaRt_2.50.1              DelayedArray_0.20.0        
[27] compiler_4.1.1              rtracklayer_1.54.0         
[29] pkgconfig_2.0.3             tidyselect_1.1.1           
[31] KEGGREST_1.34.0             SummarizedExperiment_1.24.0
[33] tibble_3.1.6                GenomeInfoDbData_1.2.7     
[35] matrixStats_0.61.0          XML_3.99-0.8               
[37] fansi_0.5.0                 crayon_1.4.2               
[39] dplyr_1.0.7                 dbplyr_2.1.1               
[41] GenomicAlignments_1.30.0    bitops_1.0-7               
[43] rappdirs_0.3.3              grid_4.1.1                 
[45] lifecycle_1.0.1             DBI_1.1.1                  
[47] magrittr_2.0.1              stringi_1.7.6              
[49] cachem_1.0.6                XVector_0.34.0             
[51] xml2_1.3.3                  ellipsis_0.3.2             
[53] filelock_1.0.2              generics_0.1.1             
[55] vctrs_0.3.8                 rjson_0.2.20               
[57] restfulr_0.0.13             tools_4.1.1                
[59] bit64_4.0.5                 glue_1.5.1                 
[61] purrr_0.3.4                 MatrixGenerics_1.6.0       
[63] hms_1.1.1                   parallel_4.1.1             
[65] fastmap_1.1.0               yaml_2.2.1                 
[67] memoise_2.0.1               BiocIO_1.4.0