VariantAnnotation produces duplicate names
1
0
Entering edit mode
@daniel-cameron-10086
Last seen 2.7 years ago
Australia

VCF requires unique row names, but the formula used to generate placeholder row names can produce duplicates (Such as for pindel with default parameters run on the NA12878 Illumina platinum genomic WGS data). Is this intentional, or just an oversight due to the lack of structural variant data out there?

Reproduction test case:

library(VariantAnnotation)

library(testthat)

#' loads a VCF containing the given records #' @param record string vector of record to write .testrecord <- function(record) { filename=tempfile(fileext=".vcf") write(paste0(c( "##fileformat=VCFv4.2", "##ALT=<ID=DEL,Description=\"Deletion\">", "##INFO=<ID=SVLEN,Number=.,Type=Integer,Description=\"Difference in length between REF and ALT alleles\">", "##contig=<ID=chr1,length=249250621>", "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT", record), collapse="\n"), file=filename) vcf <- readVcf(.testfile(filename), "") # readVcf holds a file handle open so we won't be able to delete the # file even if we tried #file.remove(filename) return(vcf) }

# the forum will likely mangle the TABs in the in-line VCF lines test_that("Placeholder names are unique", { expect_false(any(duplicated(row.names(.testrecord(c(

"chr1 100 . A <DEL> . . SVLEN=-1", "chr1 100 . A <DEL> . . SVLEN=-2")))))) })

 

sessionInfo()

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                       LC_TIME=English_Australia.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] StructuralVariantAnnotation_0.1 VariantAnnotation_1.16.4        Rsamtools_1.22.0                Biostrings_2.38.4               XVector_0.10.0                  SummarizedExperiment_1.0.2     
 [7] Biobase_2.30.0                  GenomicRanges_1.22.4            GenomeInfoDb_1.6.3              IRanges_2.4.8                   S4Vectors_0.8.11                BiocGenerics_0.16.1            
[13] assertthat_0.1                  roxygen2_5.0.1                  testthat_0.11.0                 devtools_1.10.0                

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4                       futile.logger_1.4.1               GenomicFeatures_1.22.13           bitops_1.0-6                      futile.options_1.0.0              tools_3.2.2                      
 [7] zlibbioc_1.16.0                   biomaRt_2.26.1                    digest_0.6.9                      memoise_1.0.0                     RSQLite_1.0.0                     BSgenome_1.38.0                  
[13] DBI_0.3.1                         rtracklayer_1.30.4                withr_1.0.1                       stringr_1.0.0                     BSgenome.Hsapiens.UCSC.hg19_1.4.0 AnnotationDbi_1.32.3             
[19] XML_3.98-1.4                      BiocParallel_1.4.3                lambda.r_1.1.7                    magrittr_1.5                      GenomicAlignments_1.6.3           stringi_1.0-1                    
[25] RCurl_1.95-4.8                    crayon_1.3.1

variantannotation • 1.9k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 3.0 years ago
United States

Hi,

I don't have .testfile() so I've modified this line in .testrecord():

vcf <- readVcf(.testfile(filename), "")

to this:

vcf <- readVcf(filename, "")

With that change we create the test vcf which emits a warning:

> testvcf1 <- 
+ .testrecord(c("chr1    100    .    A    <DEL>    .    .    SVLEN=-1",
+               "chr1    100    .    A    <DEL>    .    .    SVLEN=-2"))
Warning message:
In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)

The VCF rowRanges() data is not parsed properly. Everything from 'chr1' to 'SVLEN' has been parsed into the seqnames:

> rowRanges(testvcf1)
GRanges object with 2 ranges and 5 metadata columns:
                                               seqnames    ranges strand |
                                                  <Rle> <IRanges>  <Rle> |
   chr1    100    .    A    <DEL>    .    .    SVLEN=-1   [0, -1]      * |
   chr1    100    .    A    <DEL>    .    .    SVLEN=-2   [0, -1]      * |
   paramRangeID            REF                ALT      QUAL      FILTER
       <factor> <DNAStringSet> <DNAStringSetList> <numeric> <character>
           <NA>                                           0           
           <NA>                                           0           
  -------
  seqinfo: 3 sequences from  genome


fixed() is also not parsed properly:

> fixed(testvcf1)
DataFrame with 2 rows and 4 columns
             REF                ALT      QUAL      FILTER
  <DNAStringSet> <DNAStringSetList> <numeric> <character>
1                                           0           
2                                           0   

To avoid format issues I'd recommend writeVcf() to make a test file but if you must use .testrecord() you need to add the tabs in the character string.

Create a new test vcf (no warnings this time):

> testvcf2 <-
+ .testrecord(c("chr1\t100\t.\tA\tG\t.\t.\tSVLEN=-1",
+               "chr1\t100\t.\tA\tG\t.\t.\tSVLEN=-2"))

Now we have a properly parsed VCF object with rownames and fixed(), info() etc.:

> rowRanges(testvcf2)
GRanges object with 2 ranges and 5 metadata columns:
               seqnames     ranges strand | paramRangeID            REF
                  <Rle>  <IRanges>  <Rle> |     <factor> <DNAStringSet>
  chr1:100_A/G     chr1 [100, 100]      * |         <NA>              A
  chr1:100_A/G     chr1 [100, 100]      * |         <NA>              A
                              ALT      QUAL      FILTER
               <DNAStringSetList> <numeric> <character>
  chr1:100_A/G                  G      <NA>           .
  chr1:100_A/G                  G      <NA>           .
  -------
  seqinfo: 1 sequence from  genome
> fixed(testvcf2)
DataFrame with 2 rows and 4 columns
             REF                ALT      QUAL      FILTER
  <DNAStringSet> <DNAStringSetList> <numeric> <character>
1              A                  G        NA           .
2              A                  G        NA           .

Back to your question about duplicate rownames on the object. It's not illegal to have duplicate rownames or ranges in a GRanges object:

> GRanges("chr1", IRanges(rep(1, 3), width=1, names=rep("FOO", 3)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  FOO     chr1    [1, 1]      *
  FOO     chr1    [1, 1]      *
  FOO     chr1    [1, 1]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Before GRanges supported duplicate rownames that restriction was carried over to VCF. I think it's been several releases now that duplicates have been allowed. Maybe you're remembering this restriction from an older version? If you're getting an error that says VCF data must have unique rownames please let me know and show a small example.

Thanks.

Valerie

ADD COMMENT
0
Entering edit mode

The issue is that VariantAnnotation automatically creates names for the VCF rows that were not assigned any name. If these names are are used to write a VCF file, the resultant VCF will not be valid, because uniqueness is required by the VCF specifications.

In my code, I'm using the VCF ID as a unique record identifier as it is guaranteed to be unique. I added code to generate unique IDs if the ID was NA/"."/"" then ran into this unexpected VariantAnnotation behaviour. As far as I can tell, VariantAnnotation doesn't have an API that will tell the user whether the ID for any given record was from the VCF file itself, or created by the VariantAnnotation package.

Is it possible to either a) not assign any name when a VCF record has a "." in the ID column, b) assign names that conform to the VCF ID field specifications, and/or c) expose the ID field of the original VCF?

Thanks

ADD REPLY
0
Entering edit mode

(a) No, unfortunately not. It's more of an all or nothing thing. You probably know readVcf() has a 'row.names' arg and it's TRUE by default. If the ID is missing a rowname is created from CHROM,POS,REF,ALT. This was an attempt to create unique rownames/identifiers but as you've found doesn't work with some structural variants (and possibly others). The behavior is documented under the 'rowRanges' section of ?readVcf:

     rowRanges: The CHROM, POS, ID and REF fields are used to create a
          ‘GRanges’ object. Ranges are created using POS as the start
          value and width of the reference allele (REF). The IDs become
          the rownames. If they are missing (i.e., ‘.’) a string of
          CHROM:POS_REF/ALT is used instead.  The ‘genome’ argument is
          stored in the seqinfo of the ‘GRanges’ and can be accessed
          with ‘genome(<VCF>)’.

Setting 'row.names=FALSE' makes all rownames NULL whether the ID was missing or not. So, this is only useful if all the IDs are missing.

(b) and (c) When 'row.names=TRUE', the ID field in the original VCF becomes the rownames on rowRanges() so this information is retained and exposed.

Because (I think) you want to keep ID's that are not missing, one option is to just replace the constructed rownames with "." (or whatever). The rownames for missing ID fields will always have a colon, underscore and forward slash. As long as your valid IDs don't have this you could do something like

names(vcf)[grepl(":", names(vcf)), fixed=TRUE] <- "."

 

Valerie

 

ADD REPLY
0
Entering edit mode

Thanks very much for your informative response. Are you open to patch submission on the VariantAnnotation package? In this particular case, make.unique() on just the variants generated by VariantAnnotation would given uniqueness without breaking the existing naming convention. I've done this in my own code, but it would be nice to offer such a solution for inclusion in VariantAnnotation itself.

ADD REPLY
0
Entering edit mode

Sure, always open to patches. Please include a unit test and send to me at valerie.obenchain@roswellpark.org.

Valerie

ADD REPLY
0
Entering edit mode

https://samtools.github.io/hts-specs/VCFv4.2.pdf

1.4.1.3 ID - identifier: Semi-colon separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no white-space or semi-colons permitted) 

ADD REPLY
0
Entering edit mode

>If you're getting an error that says VCF data must have unique rownames please let me know and show a small example.

Here is a small example showing the problem:

writeVcf(.testrecord(c("chr1\t100\t.\tA\t<DEL>\t.\t.\tSVLEN=-1", "chr1\t100\t.\tA\t<DEL>\t.\t.\tSVLEN=-100")), "test.vcf")

Round-tripping a valid VCF through VariantAnnotation results in an output VCF that not only is not identical to the input (not a problem since VCF does not define INFO column ordering or newlines), but does not conform to the VCF specifications since the IDs written are not unique.

ADD REPLY

Login before adding your answer.

Traffic: 502 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6