In a function for plotting pairwise genome alignments, I am concatenating all sequence levels of some GRanges objects. This can create seqlengths that are longer than what R's integer format can support. Unfortunately, my attempts to use the numeric format instead did not work as seqlengths() coerces its input to integer format as in the example below.
> gr <- GRanges("chr1:1-2")
> seqlengths(gr) <- 3e9
Warning message:
In .normarg_seqlengths(value, seqnames(x)) :
NAs introduced by coercion to integer range
And I could not cheat the system as a validation function will block me from putting other things than integers in the S4 slot.
> gr@seqinfo@seqlengths <- 3e9
Error in (function (cl, name, valueClass) :
assignment of an object of class “numeric” is not valid for @‘seqlengths’ in an object of class “Seqinfo”; is(value, "integer") is not TRUE
Would it be possible to allow integer numeric values as seqlengths, for instance after validating them with the is.wholenumber workaround suggested in help(integer)?
is.wholenumber <-
function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
That change would be easy but is not desirable. The implications of allowing seqlengths greater than 2^31 are deep. Once you allow this, there's no reason to not allow genomic ranges that are also longer than 2^31 nucleotides. For example people would expect to be able to store the "chr1:1-3e9" range in their GRanges object. Unfortunately this is not possible at the moment because the start/end/width of a GRanges/IRanges object must be integer vectors. Changing this would require revisiting/adapting a lot of C code. It would also mean that things like Biostrings::getSeq() would now have the ability to return a DNAStringSet object where some sequences are longer than 2^31. And this in turn would require revisiting/adapting most of Biostrings C code as well as supporting long Vector derivatives in general, which would be a huge endeavor!
Do you really need to store the seqlengths of the concatenated sequences? If not, you could you just set them to NAs.
If you really need to cheat the system, use slot(gr@seqinfo, "seqlengths", check=FALSE) <- 3e9. All bets are off when you do things like this e.g. depending on what you're going to do with this invalid object, you might get all kinds of errors, even crashes. So use at your own risks!
Thanks Hervé for the detailed answer. Since the purpose is only to produce a plot, I will transform the GRanges to a transient data structure instead of modifying it.
Thanks Hervé for the detailed answer. Since the purpose is only to produce a plot, I will transform the GRanges to a transient data structure instead of modifying it.