as.data.table method for GRanges modifies the original object by reference
0
1
Entering edit mode
alex.gos90 ▴ 10
@alexgos90-13597
Last seen 22 months ago
Germany

When using the as.data.table() function from the data.table package to create a data.table out of a GRanges object,
the default method is called (https://github.com/Rdatatable/data.table/blob/master/R/as.data.table.R#L8-L10).

This leads to the problem that the new data.table modifies the original data inside the GRanges by reference, leading to unexpected behavior. 

I have a minimal example showing this: 

    library(GenomicRanges)
    library(data.table)

    ## create small Granges object
    gr <- GRanges(seqnames = "chr10",
              ranges = IRanges(start = c(1:10),end = c(11:20)),
              strand = "*")

    ## coerce to data.table (should return a copy of original data, shouldn't it?)
    dt <- as.data.table(gr)

    ## modify using functional form assignment by reference 
    dt[1:3, `:=`(seqnames=unique(seqnames),
                             start=min(start),
                             end=max(end),
                             #width=2,
                             strand = "*")]

    ## the data.table is changed,
    dt


#    seqnames start end width strand
#1:    chr10     1  13    11      *
#2:    chr10     1  13    11      *
#3:    chr10     1  13    11      *
#4:    chr10     4  14    11      *
#5:    chr10     5  15    11      *
#6:    chr10     6  16    11      *
#7:    chr10     7  17    11      *
#8:    chr10     8  18    11      *
#9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *
    ## but Granges is modified as well (somehow end(gr[1:3]) equals dt[1:3]$width).
    gr

#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
    ## this gets even clearer when modifying dt$width
    dt[4:6, `:=`(#seqnames=unique(seqnames),
             #start=min(start),
             #end=max(end),
             width=2#,
             #strand = "*"
             )]

    ## changing the width in data.table adapts the end of the gr
    gr
#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4,  5]      *
#   [5]    chr10  [ 5,  6]      *
#   [6]    chr10  [ 6,  7]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

Right now my solution is to create a copy before continuing to modify the data.table:

    gr2 <- GRanges(seqnames = "chr10",
              ranges = IRanges(start = c(1:10),end = c(11:20)),
              strand = "*")

    dt2 <- copy(as.data.table(gr))

    dt2[4:6, `:=`(width=2)]

    dt2
#    seqnames start end width strand
# 1:    chr10     1  11    11      *
# 2:    chr10     1  11    11      *
# 3:    chr10     1  11    11      *
# 4:    chr10     4   5     2      *
# 5:    chr10     5   6     2      *
# 6:    chr10     6   7     2      *
# 7:    chr10     7  17    11      *
# 8:    chr10     8  18    11      *
# 9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *

    gr2

 

#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 2, 12]      *
#   [3]    chr10  [ 3, 13]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

But since not everybody is aware of this behavior I wanted to post this here, 
maybe there is an easy way to prevent this behavior. 

 

 

 

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4    genomation_1.8.0     methylKit_1.2.0      GenomicRanges_1.28.4 GenomeInfoDb_1.12.2  IRanges_2.10.2       S4Vectors_0.14.3    
[8] BiocGenerics_0.22.0 

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.3 qvalue_2.8.0               gtools_3.5.0               reshape2_1.4.2             splines_3.4.0             
 [6] lattice_0.20-35            seqPattern_1.8.0           colorspace_1.3-2           rtracklayer_1.36.4         XML_3.98-1.9              
[11] rlang_0.1.1                R.oo_1.21.0                R.utils_2.5.0              BiocParallel_1.10.1        fastseg_1.22.0            
[16] matrixStats_0.52.2         GenomeInfoDbData_0.99.0    plyr_1.8.4                 stringr_1.2.0              zlibbioc_1.22.0           
[21] Biostrings_2.44.2          munsell_0.4.3              gtable_0.2.0               R.methodsS3_1.7.1          coda_0.19-1               
[26] Biobase_2.36.2             Rcpp_0.12.11               KernSmooth_2.23-15         readr_1.1.1                scales_0.4.1              
[31] BSgenome_1.44.0            limma_3.32.3               plotrix_3.6-5              DelayedArray_0.2.7         XVector_0.16.0            
[36] Rsamtools_1.28.0           impute_1.50.1              hms_0.3                    ggplot2_2.2.1              stringi_1.1.5             
[41] numDeriv_2016.8-1          tools_3.4.0                bitops_1.0-6               bbmle_1.0.19               magrittr_1.5              
[46] lazyeval_0.2.0             RCurl_1.95-4.8             tibble_1.3.3               MASS_7.3-47                Matrix_1.2-9              
[51] gridBase_0.4-7             emdbook_1.3.9              R6_2.2.2                   mclust_5.3                 GenomicAlignments_1.12.1  
[56] compiler_3.4.0   
genomicranges data.table • 2.4k views
ADD COMMENT

Login before adding your answer.

Traffic: 674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6