problem with @ranges@start in DNAStringSet
2
0
Entering edit mode
@timotheeflutre-6727
Last seen 5.6 years ago
France

I am writing a unitary test with testthat for a function handling VCF files. However, I encounter a problem with ref(vcf), which comes from DNAStringSet@ranges@start.

Edit: more specifically, expect_equal(as.character(ref(observed)), as.character(ref(expected))) passes, but expect_equal(ref(observed), ref(expected)) fails, with the following message

Attributes: < Component "ranges": Attributes: < Component "start": Mean relative difference: 3 > >

To understand how this structure works, here is the example given at the bottom of the CollapsedVCF help page.

vcf <- VCF(rowRanges = GRanges("chr1", IRanges(1:4*3, width=c(1, 2, 1, 1))))
ref(vcf) <- DNAStringSet(c("G", c("AA"), "T", "G"))

When I do str(ref(vcf)), here is what I get:

Formal class 'DNAStringSet' [package "Biostrings"] with 5 slots
  ..@ pool           :Formal class 'SharedRaw_Pool' [package "XVector"] with 2 slots
  .. .. ..@ xp_list                    :List of 1
  .. .. .. ..$ :<externalptr>
  .. .. ..@ .link_to_cached_object_list:List of 1
  .. .. .. ..$ :<environment: 0x2f8b6d0>
  ..@ ranges         :Formal class 'GroupedIRanges' [package "XVector"] with 7 slots
  .. .. ..@ group          : int [1:4] 1 1 1 1
  .. .. ..@ start          : int [1:4] 1 2 4 5
  .. .. ..@ width          : int [1:4] 1 2 1 1
  .. .. ..@ NAMES          : NULL
  .. .. ..@ elementType    : chr "integer"
  .. .. ..@ elementMetadata: NULL
  .. .. ..@ metadata       : list()
  ..@ elementType    : chr "DNAString"
  ..@ elementMetadata: NULL
  ..@ metadata       : list()

Can someone explain to me what does the vector 1 2 4 5 from @ranges@start correspond to, as it doesn't correspond to the start coordinates of the variants?

Edit: thanks to the comments by Martin Morgan below, I may have found the problem. When I filter some records like this, vcf[-c(1)], the @ranges@start vector becomes 2 4 5 whereas I would have expected it to become 1 3 4. That is, the first element was correctly removed, but the DNAString wasn't updated. This doesn't seem to be the case when using VariantAnnotation::filterVcf(), hence creating the difference between the two objects in my test, observed and expected. Is there a way to avoid this?

vcf testthat dnastringset • 2.0k views
ADD COMMENT
0
Entering edit mode

Generally, the internal structure of an object (anything accessed by @) should be considered 'none of your business'; in your unit test and code, use the accessors to extract relevant information.

ADD REPLY
0
Entering edit mode

I understand. But when I test two VCF objects for equality with expect_equal from testthat, it fails because of a difference in @ranges@start. I thought that a few hints about what @ranges@start corresponds to may be enough to help me. Or do you prefer me to show the whole code of the tested function as well as the test?

ADD REPLY
1
Entering edit mode

It isn't really helpful to show the full code, but rather to figure out how to illustrate the problem with a minimal example (e.g., by starting with the full code and simplifying it); often in the process of creating this simple reproducible example the problem becomes apparent.

I'd explore the object using it's API, e.g., comparing ref(vcf) and then as.character(ref(vcf)).

The underlying model of a DNAStringSet is of a single "DNAString" that has offsets corresponding to each element in the DNAStringSet. So in the example the DNAString is "GAATG" and the interpretation is that this DNAString represents a DNAStringSet with elements starting at position 1, 2, 4, 5. This is actually a fundamental scheme underlying the Compressed*List class (e.g., IntegerList(), GRangesList()), where there is a single underlying object that obeys a 'vector-like' interface, and then a partitioning of that vector into elements; I believe the implementation differs between DNAStringSet and the *List, but the fact that I don't know is the beauty of using the API rather than relying on the underlying structure.

ADD REPLY
0
Entering edit mode

Thanks for your explanations. Yours as well as the one from Hervé helped me a lot.

ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 6 days ago
Seattle, WA, United States

Hi,

expect_equal(as.character(ref(observed)),
             as.character(ref(expected)))

passes and that's all what matters. If you worry about performance, you could replace it with testing that all(ref(observed) == ref(expected)) is TRUE, which is going to be much more efficient because it avoids the costly coercion from DNAStringSet to character. However, ref(observed) == ref(expected) doesn't compare the names or the metadata columns, only the DNA sequences in the 2 objects, so these are things you would need to check explicitly if they matter to you. 

Now the fact that expect_equal(ref(observed), ref(expected)) fails is irrelevant because there is more than 1 possible internal representation for the same DNAStringSet object. For example:

dna1 <- DNAStringSet(c("AC", "AC"))
dna2 <- as(matchPattern("AC", subseq(DNAString("TTTACAC"), start=2)),
           "DNAStringSet")

dna1 and dna2 represent the same thing (i.e. they are indistinguishable from a user point of view):

dna1
#   A DNAStringSet instance of length 2
#     width seq
# [1]     2 AC
# [2]     2 AC
dna2
#   A DNAStringSet instance of length 2
#     width seq
# [1]     2 AC
# [2]     2 AC

However their internal representations differ:

dna1@ranges@start
# [1] 1 3
dna2@ranges@start
# [1] 4 6

The bottom line is that it's not a good idea to compare equality of internal representations in your tests. It's better to check that your results look the expected way by inspecting them thru the user interface. More generally speaking, and as Martin said, the internal representation of S4 objects is "none of the end-user business", and the less you know about it the better.

Hope this makes sense,

Cheers,

H.

ADD COMMENT
0
Entering edit mode

As shown in my edit at the very end, it indeed seems to correspond to what you're talking about the internal state of objects. I thought that using expect_equal instead of expect_identical would be enough. From now on, I will try to be less stringent. Thanks a lot for the explanations!

ADD REPLY
0
Entering edit mode
b.nota ▴ 370
@bnota-7379
Last seen 4.3 years ago
Netherlands

I think the @start vector 1 2 4 5 are the start positions of your variants, right? They do correspond to your @width, starting from 1.

 

ADD COMMENT
0
Entering edit mode

The start coordinates of the variants are given by 1:4*3 (inside IRanges on the first line of code), that is 3 6 9 12, not 1 2 4 5.

ADD REPLY

Login before adding your answer.

Traffic: 595 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6