Import or Subset VCF PASS Variants
1
1
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 1 day ago
Australia

How can I do the simple task of subsetting a VCF object to only the PASS entries (without converting to a user-friendly VRanges object)? subset looks like the function to use, but the description of it sounds a lot like developer documentation rather than end-user documentation and the Examples section has no example of its usage. I don't know how to use it. I also thought about filtering at the importation stage using ScanVcfParam, but the which parameter can only be a GRanges object, not a condition like FILTER == PASS.

VariantAnnotation subset ScanVcfParam • 1.7k views
ADD COMMENT
2
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

After running

example(readVcf)

I have an object (note that it has two dimensions, with 5 variants and 3 samples)

> vcf
class: CollapsedVCF
dim: 5 3
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DFrame with 1 column: AF
info(header(vcf)):
      Number Type  Description
   AF A      Float Allele Frequency
geno(vcf):
  SimpleList of length 1: HQ
geno(header(vcf)):
      Number Type    Description
   HQ 2      Integer Haplotype Quality

As suggested by the display, the FILTER field is accessible as rowRanges(), and then with, e.g., $FILTER

> rowRanges(vcf)
GRanges object with 5 ranges and 5 metadata columns:
                 seqnames          ranges strand | paramRangeID            REF
                    <Rle>       <IRanges>  <Rle> |     <factor> <DNAStringSet>
       rs6054257       20           14370      * |        geneA              G
    20:17330_T/A       20           17330      * |        geneA              T
       rs6040355       20         1110696      * |        geneB              A
  20:1230237_T/.       20         1230237      * |        geneB              T
       microsat1       20 1234567-1234569      * |        geneB            GTC
                                ALT      QUAL      FILTER
                 <DNAStringSetList> <numeric> <character>
       rs6054257                  A        29        PASS
    20:17330_T/A                  A         3         q10
       rs6040355                G,T        67        PASS
  20:1230237_T/.                           47        PASS
       microsat1             G,GTCT        50        PASS
  -------
  seqinfo: 1 sequence from hg19 genome
> rowRanges(vcf)$FILTER
[1] "PASS" "q10"  "PASS" "PASS" "PASS"

We'd like to keep features (variants) with rowRanges(vcf)$FILTER == "PASS", and all samples, expecting 4 rows and 3 column

> vcf[ rowRanges(vcf)$FILTER == "PASS",]
class: CollapsedVCF
dim: 4 3
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DFrame with 1 column: AF
info(header(vcf)):
      Number Type  Description
   AF A      Float Allele Frequency
geno(vcf):
  SimpleList of length 1: HQ
geno(header(vcf)):
      Number Type    Description
   HQ 2      Integer Haplotype Quality

Is that what you were looking for?

subset is a convenience around the elements that occur exactly once on each row

> subset(vcf, FILTER == "PASS")
class: CollapsedVCF
dim: 4 3
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DFrame with 1 column: AF
info(header(vcf)):
      Number Type  Description
   AF A      Float Allele Frequency
geno(vcf):
  SimpleList of length 1: HQ
geno(header(vcf)):
      Number Type    Description
   HQ 2      Integer Haplotype Quality
ADD COMMENT
0
Entering edit mode

Thanks, that makes it perfectly clear.

ADD REPLY

Login before adding your answer.

Traffic: 933 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6