Question

Duplicated rows in readVcf output

0

Entering edit mode

nicolas.o.rode • 0

@nicolasorode-10096

Last seen 9.2 years ago

I am trying to extract two rows from a vcf (available here) and it seems that readVcf duplicates some of the rows in the output vcf. Here is an example:

library(VariantAnnotation)

## Defines the positions of the two rows to extract
chr1.gr <- GRanges("chr1", IRanges(c(83921054,83921055), width=1))
params <- ScanVcfParam(which=chr1.gr)

## Creates a vcf object
vcf <- readVcf(TabixFile(file.bgz), "hg19", params)
dim(vcf)

Which shows that there are three rows (two at positions 83921054 and one at position 83921055). I realize that this is an indel which might mess things up, but I would expect only two rows still.

Is-there a way to avoid this duplication or alternatively to remove the duplicated lines in the output vcf object?

variantannotation readvcf • 1.9k views

ADD COMMENT • link updated 9.2 years ago by Valerie Obenchain ★ 6.8k • written 9.2 years ago by nicolas.o.rode • 0

0

Entering edit mode

This presumably happens because that row overlaps both of the ranges you provide, so it is retrieved twice.

I don't know if its the recommended way to handle this, but you can remove the duplicate rows by matching on the row names. Something like this:

vcf.reduced <- vcf[ -duplicated( rownames(vcf) ) ]

> dim(vcf.reduced)
[1]  2 11

ADD REPLY • link 9.2 years ago Mike Smith ★ 6.6k

score 0 · Answer 1 · 2016-04-14

The GRanges you're using has 2 ranges. When multiple ranges are given
in a 'param' readVcf() returns all records that overlap each range. This is
your 'which':

chr1.gr <- GRanges("chr1", IRanges(c(83921054,83921055), width=1))
> chr1.gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr1 [83921054, 83921054]      *
  [2]     chr1 [83921055, 83921055]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

To see this read in all the data

vcf <- readVcf("file.bgz", "")

then overlap with chr1.gr. The first range hits record 177 and
the second hits both 177 and 178. This is where the 'duplication' is
coming from.

findOverlaps(chr1.gr, rowRanges(vcf))
> findOverlaps(chr1.gr, rowRanges(vcf))
Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1         177
  [2]         2         177
  [3]         2         178
  -------
  queryLength: 2 / subjectLength: 1158

You probably want a single range spanning those positions:

gr <- GRanges("chr1", IRanges(start=83921054, end=83921055))
> gr
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]     chr1 [83921054, 83921055]      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> findOverlaps(gr, rowRanges(vcf))
Hits object with 2 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1         177
  [2]         1         178
  -------
  queryLength: 1 / subjectLength: 1158

So in the end

param <- ScanVcfParam(which = gr)
vcf <- readVcf("file.bgz", "", param=param)
> rowRanges(vcf)
GRanges object with 2 ranges and 5 metadata columns:
                      seqnames               ranges strand | paramRangeID
                         <Rle>            <IRanges>  <Rle> |     <factor>
  chr1:83921054_TAG/T     chr1 [83921054, 83921056]      * |         <NA>
    chr1:83921055_A/G     chr1 [83921055, 83921055]      * |         <NA>
                                 REF                ALT      QUAL      FILTER
                      <DNAStringSet> <DNAStringSetList> <numeric> <character>
  chr1:83921054_TAG/T            TAG                  T  11633.23           .
    chr1:83921055_A/G              A                  G   4896.58           .
  -------
  seqinfo: 1 sequence from  genome

Valerie