Question

updateObject() fails for old serialized GRanges objects

0

Entering edit mode

Jeff Johnston ▴ 90

@jeff-johnston-6497

Last seen 7.9 years ago

United States

We have a number of serialized GRanges objects in RData format that cannot seem to be handled by updateObject() in Bioconductor 3.0. These all seem to have been created with Bioconductor 2.8. The version 2.9 objects appear to be working fine. Here's the output from calling updateObject():

> load("samples/toll10b_k27ac_2to4h_1.ranges.RData")
> updateObject(toll10b_k27ac_2to4h_1.ranges, verbose=TRUE)
updateObject(object = 'GRanges')
Error in names(ans) <- seqnames(x) :
  'names' attribute [12] must be the same length as the vector [0]

I can provide a download link for the saved object if necessary. The seqnames(), start(), end() and strand() accessors all work on the loaded object, so I am able to recreate it. But this issue has prompted me to re-examine whether serializing Bioconductor objects to disk and expecting them to be accessible in all future versions is realistic. We prefer saving sequencing results as serialized GRanges because it is extremely fast to load them back into R, as opposed to re-importing them from their source BAM. Also, we often perform some read filtering so that the resulting GRanges differ from the source BAM files. Since our projects span multiple Bioconductor releases, we easily end up with collections of GRanges objects from various versions of Bioconductor over time. Until now updateObject() prevented us from running into any issues, and this first issue might be due to a simple bug, but I would like to hear if anyone thinks there might be a better format for storing GRanges-type information on-disk over the long term. We really only have two major requirements:

1. The format can be quickly loaded into R as a GRanges object

2. The resulting GRanges object has the correct seqlengths() set

Thanks,

Jeff

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicRanges_1.18.1 GenomeInfoDb_1.2.2   IRanges_2.0.0        S4Vectors_0.4.0      BiocGenerics_0.12.0  setwidth_1.0-3      

loaded via a namespace (and not attached):
[1] XVector_0.6.0

genomicranges • 1.3k views

ADD COMMENT • link updated 11.0 years ago by Hervé Pagès 16k • written 11.0 years ago by Jeff Johnston ▴ 90

score 3 · Accepted Answer · 2014-11-13

Hi Jeff,

Please provide a download link for the saved object.

Serializing objects to disk and expecting them to be accessible in all future versions is probably a reasonable expectation for standard R objects like atomic vectors (character, numeric, etc...), factors, list, data.frame, and also for S3 objects. (But even that statement might need confirmation from the R core team.) For S4 objects in general, you cannot expect this. Bioconductor objects are mostly S4 objects and keeping them compatible with future versions of Bioconductor requires long-term commitment from the maintainer of the class. Core classes like eSet, GRanges, SummarizedExperiment, DNAStringSet are maintained by core members of the project who are committed to keeping them compatible with as many future versions as possible. However note that it has already happened in the past that some classes that were considered core at some point (e.g. RangedData, RangedDataList, GenomeData, GenomeDataList) are not anymore because they've been superseded by other classes (e.g. GRanges, GRangesList). Hence they will go away at some point in the future.

IMO a good practice is to always keep around the source data and recipe that were used to generate the serialized object. The recipe is more important than the object itself. It not only allows you to regenerate the object when the class definition has changed (after maybe some adjustments to the recipe) but it's also the ultimate reference for knowing exactly what went into the object (e.g. what kind of filtering was applied to the data).

Cheers,

H.