When using the as.data.table() function from the data.table package to create a data.table out of a GRanges object,
the default method is called (https://github.com/Rdatatable/data.table/blob/master/R/as.data.table.R#L8-L10).
This leads to the problem that the new data.table modifies the original data inside the GRanges by reference, leading to unexpected behavior.
I have a minimal example showing this:
library(GenomicRanges) library(data.table) ## create small Granges object gr <- GRanges(seqnames = "chr10", ranges = IRanges(start = c(1:10),end = c(11:20)), strand = "*") ## coerce to data.table (should return a copy of original data, shouldn't it?) dt <- as.data.table(gr) ## modify using functional form assignment by reference dt[1:3, `:=`(seqnames=unique(seqnames), start=min(start), end=max(end), #width=2, strand = "*")] ## the data.table is changed, dt # seqnames start end width strand #1: chr10 1 13 11 * #2: chr10 1 13 11 * #3: chr10 1 13 11 * #4: chr10 4 14 11 * #5: chr10 5 15 11 * #6: chr10 6 16 11 * #7: chr10 7 17 11 * #8: chr10 8 18 11 * #9: chr10 9 19 11 * #10: chr10 10 20 11 *
## but Granges is modified as well (somehow end(gr[1:3]) equals dt[1:3]$width). gr
#GRanges object with 10 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr10 [ 1, 11] * # [2] chr10 [ 1, 11] * # [3] chr10 [ 1, 11] * # [4] chr10 [ 4, 14] * # [5] chr10 [ 5, 15] * # [6] chr10 [ 6, 16] * # [7] chr10 [ 7, 17] * # [8] chr10 [ 8, 18] * # [9] chr10 [ 9, 19] * # [10] chr10 [10, 20] * # ------- # seqinfo: 1 sequence from an unspecified genome; no seqlengths
## this gets even clearer when modifying dt$width dt[4:6, `:=`(#seqnames=unique(seqnames), #start=min(start), #end=max(end), width=2#, #strand = "*" )] ## changing the width in data.table adapts the end of the gr gr
#GRanges object with 10 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr10 [ 1, 11] * # [2] chr10 [ 1, 11] * # [3] chr10 [ 1, 11] * # [4] chr10 [ 4, 5] * # [5] chr10 [ 5, 6] * # [6] chr10 [ 6, 7] * # [7] chr10 [ 7, 17] * # [8] chr10 [ 8, 18] * # [9] chr10 [ 9, 19] * # [10] chr10 [10, 20] * # ------- # seqinfo: 1 sequence from an unspecified genome; no seqlengths
Right now my solution is to create a copy before continuing to modify the data.table:
gr2 <- GRanges(seqnames = "chr10", ranges = IRanges(start = c(1:10),end = c(11:20)), strand = "*") dt2 <- copy(as.data.table(gr)) dt2[4:6, `:=`(width=2)] dt2
# seqnames start end width strand # 1: chr10 1 11 11 * # 2: chr10 1 11 11 * # 3: chr10 1 11 11 * # 4: chr10 4 5 2 * # 5: chr10 5 6 2 * # 6: chr10 6 7 2 * # 7: chr10 7 17 11 * # 8: chr10 8 18 11 * # 9: chr10 9 19 11 * #10: chr10 10 20 11 *
gr2
#GRanges object with 10 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr10 [ 1, 11] * # [2] chr10 [ 2, 12] * # [3] chr10 [ 3, 13] * # [4] chr10 [ 4, 14] * # [5] chr10 [ 5, 15] * # [6] chr10 [ 6, 16] * # [7] chr10 [ 7, 17] * # [8] chr10 [ 8, 18] * # [9] chr10 [ 9, 19] * # [10] chr10 [10, 20] * # ------- # seqinfo: 1 sequence from an unspecified genome; no seqlengths
But since not everybody is aware of this behavior I wanted to post this here,
maybe there is an easy way to prevent this behavior.
> sessionInfo() R version 3.4.0 (2017-04-21) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core) Matrix products: default BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] data.table_1.10.4 genomation_1.8.0 methylKit_1.2.0 GenomicRanges_1.28.4 GenomeInfoDb_1.12.2 IRanges_2.10.2 S4Vectors_0.14.3 [8] BiocGenerics_0.22.0 loaded via a namespace (and not attached): [1] SummarizedExperiment_1.6.3 qvalue_2.8.0 gtools_3.5.0 reshape2_1.4.2 splines_3.4.0 [6] lattice_0.20-35 seqPattern_1.8.0 colorspace_1.3-2 rtracklayer_1.36.4 XML_3.98-1.9 [11] rlang_0.1.1 R.oo_1.21.0 R.utils_2.5.0 BiocParallel_1.10.1 fastseg_1.22.0 [16] matrixStats_0.52.2 GenomeInfoDbData_0.99.0 plyr_1.8.4 stringr_1.2.0 zlibbioc_1.22.0 [21] Biostrings_2.44.2 munsell_0.4.3 gtable_0.2.0 R.methodsS3_1.7.1 coda_0.19-1 [26] Biobase_2.36.2 Rcpp_0.12.11 KernSmooth_2.23-15 readr_1.1.1 scales_0.4.1 [31] BSgenome_1.44.0 limma_3.32.3 plotrix_3.6-5 DelayedArray_0.2.7 XVector_0.16.0 [36] Rsamtools_1.28.0 impute_1.50.1 hms_0.3 ggplot2_2.2.1 stringi_1.1.5 [41] numDeriv_2016.8-1 tools_3.4.0 bitops_1.0-6 bbmle_1.0.19 magrittr_1.5 [46] lazyeval_0.2.0 RCurl_1.95-4.8 tibble_1.3.3 MASS_7.3-47 Matrix_1.2-9 [51] gridBase_0.4-7 emdbook_1.3.9 R6_2.2.2 mclust_5.3 GenomicAlignments_1.12.1 [56] compiler_3.4.0