Search
Question: problem with @ranges@start in DNAStringSet
0
2.6 years ago by
France
TimothéeFlutre70 wrote:

I am writing a unitary test with testthat for a function handling VCF files. However, I encounter a problem with ref(vcf), which comes from DNAStringSet@ranges@start.

Edit: more specifically, expect_equal(as.character(ref(observed)), as.character(ref(expected))) passes, but expect_equal(ref(observed), ref(expected)) fails, with the following message

Attributes: < Component "ranges": Attributes: < Component "start": Mean relative difference: 3 > >

To understand how this structure works, here is the example given at the bottom of the CollapsedVCF help page.

vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))

When I do str(ref(vcf)), here is what I get:

Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
.. .. ..@ xp_list                    :List of 1
.. .. .. ..$:<externalptr> .. .. ..@ .link_to_cached_object_list:List of 1 .. .. .. ..$ :<environment: 0x2f8b6d0>
..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
.. .. ..@ group          : int [1:4] 1 1 1 1
.. .. ..@ start          : int [1:4] 1 2 4 5
.. .. ..@ width          : int [1:4] 1 2 1 1
.. .. ..@ NAMES          : NULL
.. .. ..@ elementType    : chr "integer"
.. .. ..@ metadata       : list()
..@ elementType    : chr "DNAString"


Can someone explain to me what does the vector 1 2 4 5 from @ranges@start correspond to, as it doesn't correspond to the start coordinates of the variants?

Edit: thanks to the comments by Martin Morgan below, I may have found the problem. When I filter some records like this, vcf[-c(1)], the @ranges@start vector becomes 2 4 5 whereas I would have expected it to become 1 3 4. That is, the first element was correctly removed, but the DNAString wasn't updated. This doesn't seem to be the case when using VariantAnnotation::filterVcf(), hence creating the difference between the two objects in my test, observed and expected. Is there a way to avoid this?

modified 2.6 years ago • written 2.6 years ago by TimothéeFlutre70

Generally, the internal structure of an object (anything accessed by @) should be considered 'none of your business'; in your unit test and code, use the accessors to extract relevant information.

I understand. But when I test two VCF objects for equality with expect_equal from testthat, it fails because of a difference in @ranges@start. I thought that a few hints about what @ranges@start corresponds to may be enough to help me. Or do you prefer me to show the whole code of the tested function as well as the test?

1

It isn't really helpful to show the full code, but rather to figure out how to illustrate the problem with a minimal example (e.g., by starting with the full code and simplifying it); often in the process of creating this simple reproducible example the problem becomes apparent.

I'd explore the object using it's API, e.g., comparing ref(vcf) and then as.character(ref(vcf)).

The underlying model of a DNAStringSet is of a single "DNAString" that has offsets corresponding to each element in the DNAStringSet. So in the example the DNAString is "GAATG" and the interpretation is that this DNAString represents a DNAStringSet with elements starting at position 1, 2, 4, 5. This is actually a fundamental scheme underlying the Compressed*List class (e.g., IntegerList(), GRangesList()), where there is a single underlying object that obeys a 'vector-like' interface, and then a partitioning of that vector into elements; I believe the implementation differs between DNAStringSet and the *List, but the fact that I don't know is the beauty of using the API rather than relying on the underlying structure.

Thanks for your explanations. Yours as well as the one from Hervé helped me a lot.

2
2.6 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

expect_equal(as.character(ref(observed)),
as.character(ref(expected)))


passes and that's all what matters. If you worry about performance, you could replace it with testing that all(ref(observed) == ref(expected)) is TRUE, which is going to be much more efficient because it avoids the costly coercion from DNAStringSet to character. However, ref(observed) == ref(expected) doesn't compare the names or the metadata columns, only the DNA sequences in the 2 objects, so these are things you would need to check explicitly if they matter to you.

Now the fact that expect_equal(ref(observed), ref(expected)) fails is irrelevant because there is more than 1 possible internal representation for the same DNAStringSet object. For example:

dna1 <- DNAStringSet(c("AC", "AC"))
dna2 <- as(matchPattern("AC", subseq(DNAString("TTTACAC"), start=2)),
"DNAStringSet")

dna1 and dna2 represent the same thing (i.e. they are indistinguishable from a user point of view):

dna1
#   A DNAStringSet instance of length 2
#     width seq
# [1]     2 AC
# [2]     2 AC
dna2
#   A DNAStringSet instance of length 2
#     width seq
# [1]     2 AC
# [2]     2 AC

However their internal representations differ:

dna1@ranges@start
# [1] 1 3
dna2@ranges@start
# [1] 4 6

The bottom line is that it's not a good idea to compare equality of internal representations in your tests. It's better to check that your results look the expected way by inspecting them thru the user interface. More generally speaking, and as Martin said, the internal representation of S4 objects is "none of the end-user business", and the less you know about it the better.

Hope this makes sense,

Cheers,

H.

As shown in my edit at the very end, it indeed seems to correspond to what you're talking about the internal state of objects. I thought that using expect_equal instead of expect_identical would be enough. From now on, I will try to be less stringent. Thanks a lot for the explanations!

0
2.6 years ago by
b.nota290
Netherlands
b.nota290 wrote:

I think the @start vector 1 2 4 5 are the start positions of your variants, right? They do correspond to your @width, starting from 1.

The start coordinates of the variants are given by 1:4*3 (inside IRanges on the first line of code), that is 3 6 9 12, not 1 2 4 5.