Search
Question: as.data.table method for GRanges modifies the original object by reference
1
gravatar for alex.gos90
3 months ago by
alex.gos9010
alex.gos9010 wrote:

When using the as.data.table() function from the data.table package to create a data.table out of a GRanges object,
the default method is called (https://github.com/Rdatatable/data.table/blob/master/R/as.data.table.R#L8-L10).

This leads to the problem that the new data.table modifies the original data inside the GRanges by reference, leading to unexpected behavior. 

I have a minimal example showing this: 

    library(GenomicRanges)
    library(data.table)

    ## create small Granges object
    gr <- GRanges(seqnames = "chr10",
              ranges = IRanges(start = c(1:10),end = c(11:20)),
              strand = "*")

    ## coerce to data.table (should return a copy of original data, shouldn't it?)
    dt <- as.data.table(gr)

    ## modify using functional form assignment by reference 
    dt[1:3, `:=`(seqnames=unique(seqnames),
                             start=min(start),
                             end=max(end),
                             #width=2,
                             strand = "*")]

    ## the data.table is changed,
    dt


#    seqnames start end width strand
#1:    chr10     1  13    11      *
#2:    chr10     1  13    11      *
#3:    chr10     1  13    11      *
#4:    chr10     4  14    11      *
#5:    chr10     5  15    11      *
#6:    chr10     6  16    11      *
#7:    chr10     7  17    11      *
#8:    chr10     8  18    11      *
#9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *
    ## but Granges is modified as well (somehow end(gr[1:3]) equals dt[1:3]$width).
    gr

#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
    ## this gets even clearer when modifying dt$width
    dt[4:6, `:=`(#seqnames=unique(seqnames),
             #start=min(start),
             #end=max(end),
             width=2#,
             #strand = "*"
             )]

    ## changing the width in data.table adapts the end of the gr
    gr
#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 1, 11]      *
#   [3]    chr10  [ 1, 11]      *
#   [4]    chr10  [ 4,  5]      *
#   [5]    chr10  [ 5,  6]      *
#   [6]    chr10  [ 6,  7]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

Right now my solution is to create a copy before continuing to modify the data.table:

    gr2 <- GRanges(seqnames = "chr10",
              ranges = IRanges(start = c(1:10),end = c(11:20)),
              strand = "*")

    dt2 <- copy(as.data.table(gr))

    dt2[4:6, `:=`(width=2)]

    dt2
#    seqnames start end width strand
# 1:    chr10     1  11    11      *
# 2:    chr10     1  11    11      *
# 3:    chr10     1  11    11      *
# 4:    chr10     4   5     2      *
# 5:    chr10     5   6     2      *
# 6:    chr10     6   7     2      *
# 7:    chr10     7  17    11      *
# 8:    chr10     8  18    11      *
# 9:    chr10     9  19    11      *
#10:    chr10    10  20    11      *

    gr2

 

#GRanges object with 10 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]    chr10  [ 1, 11]      *
#   [2]    chr10  [ 2, 12]      *
#   [3]    chr10  [ 3, 13]      *
#   [4]    chr10  [ 4, 14]      *
#   [5]    chr10  [ 5, 15]      *
#   [6]    chr10  [ 6, 16]      *
#   [7]    chr10  [ 7, 17]      *
#   [8]    chr10  [ 8, 18]      *
#   [9]    chr10  [ 9, 19]      *
#  [10]    chr10  [10, 20]      *
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

 

But since not everybody is aware of this behavior I wanted to post this here, 
maybe there is an easy way to prevent this behavior. 

 

 

 

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4    genomation_1.8.0     methylKit_1.2.0      GenomicRanges_1.28.4 GenomeInfoDb_1.12.2  IRanges_2.10.2       S4Vectors_0.14.3    
[8] BiocGenerics_0.22.0 

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.6.3 qvalue_2.8.0               gtools_3.5.0               reshape2_1.4.2             splines_3.4.0             
 [6] lattice_0.20-35            seqPattern_1.8.0           colorspace_1.3-2           rtracklayer_1.36.4         XML_3.98-1.9              
[11] rlang_0.1.1                R.oo_1.21.0                R.utils_2.5.0              BiocParallel_1.10.1        fastseg_1.22.0            
[16] matrixStats_0.52.2         GenomeInfoDbData_0.99.0    plyr_1.8.4                 stringr_1.2.0              zlibbioc_1.22.0           
[21] Biostrings_2.44.2          munsell_0.4.3              gtable_0.2.0               R.methodsS3_1.7.1          coda_0.19-1               
[26] Biobase_2.36.2             Rcpp_0.12.11               KernSmooth_2.23-15         readr_1.1.1                scales_0.4.1              
[31] BSgenome_1.44.0            limma_3.32.3               plotrix_3.6-5              DelayedArray_0.2.7         XVector_0.16.0            
[36] Rsamtools_1.28.0           impute_1.50.1              hms_0.3                    ggplot2_2.2.1              stringi_1.1.5             
[41] numDeriv_2016.8-1          tools_3.4.0                bitops_1.0-6               bbmle_1.0.19               magrittr_1.5              
[46] lazyeval_0.2.0             RCurl_1.95-4.8             tibble_1.3.3               MASS_7.3-47                Matrix_1.2-9              
[51] gridBase_0.4-7             emdbook_1.3.9              R6_2.2.2                   mclust_5.3                 GenomicAlignments_1.12.1  
[56] compiler_3.4.0   
ADD COMMENTlink modified 3 months ago • written 3 months ago by alex.gos9010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 141 users visited in the last hour