Duplicated rows in readVcf output
1
0
Entering edit mode
@nicolasorode-10096
Last seen 8.0 years ago

I am trying to extract two rows from a vcf (available here) and it seems that readVcf duplicates some of the rows in the output vcf. Here is an example:

library(VariantAnnotation)

## Defines the positions of the two rows to extract
chr1.gr <- GRanges("chr1", IRanges(c(83921054,83921055), width=1))
params <- ScanVcfParam(which=chr1.gr)

## Creates a vcf object
vcf <- readVcf(TabixFile(file.bgz), "hg19", params)
dim(vcf)

Which shows that there are three rows (two at positions 83921054 and one at position 83921055). I realize that this is an indel which might mess things up, but I would expect only two rows still.

Is-there a way to avoid this duplication or alternatively to remove the duplicated lines in the output vcf object?

variantannotation readvcf • 1.6k views
ADD COMMENT
0
Entering edit mode

This presumably happens because that row overlaps both of the ranges you provide, so it is retrieved twice.

I don't know if its the recommended way to handle this, but you can remove the duplicate rows by matching on the row names.  Something like this:

vcf.reduced <- vcf[ -duplicated( rownames(vcf) ) ]
> dim(vcf.reduced)
[1]  2 11

 

ADD REPLY
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States

The GRanges you're using has 2 ranges. When multiple ranges are given
in a 'param' readVcf() returns all records that overlap each range. This is
your 'which':

chr1.gr <- GRanges("chr1", IRanges(c(83921054,83921055), width=1))
> chr1.gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr1 [83921054, 83921054]      *
  [2]     chr1 [83921055, 83921055]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

To see this read in all the data

vcf <- readVcf("file.bgz", "")

then overlap with chr1.gr. The first range hits record 177 and
the second hits both 177 and 178. This is where the 'duplication' is
coming from.

findOverlaps(chr1.gr, rowRanges(vcf))
> findOverlaps(chr1.gr, rowRanges(vcf))
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1         177
  [2]         2         177
  [3]         2         178
  -------
  queryLength: 2 / subjectLength: 1158

You probably want a single range spanning those positions:

gr <- GRanges("chr1", IRanges(start=83921054, end=83921055))
> gr
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr1 [83921054, 83921055]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> findOverlaps(gr, rowRanges(vcf))
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1         177
  [2]         1         178
  -------
  queryLength: 1 / subjectLength: 1158

So in the end

param <- ScanVcfParam(which = gr)
vcf <- readVcf("file.bgz", "", param=param)
> rowRanges(vcf)
GRanges object with 2 ranges and 5 metadata columns:
                      seqnames               ranges strand | paramRangeID
                         <Rle>            <IRanges>  <Rle> |     <factor>
  chr1:83921054_TAG/T     chr1 [83921054, 83921056]      * |         <NA>
    chr1:83921055_A/G     chr1 [83921055, 83921055]      * |         <NA>
                                 REF                ALT      QUAL      FILTER
                      <DNAStringSet> <DNAStringSetList> <numeric> <character>
  chr1:83921054_TAG/T            TAG                  T  11633.23           .
    chr1:83921055_A/G              A                  G   4896.58           .
  -------
  seqinfo: 1 sequence from  genome

Valerie

ADD COMMENT

Login before adding your answer.

Traffic: 947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6