Question

readVcfAsVRanges + FilterRules

1

Entering edit mode

Sigve Nakken ▴ 50

@sigve-nakken-6575

Last seen 18 months ago

Norway

Hi,

I have successfully used the VariantAnnotation (1.12.9 now) for processing my multi-sample VCF files for some time now. Especially, I have utilized the combination of filterVcf (with prefilters) and readVcfAsVRanges to retrieve sample variant calls from my sequencing projects. I have not been able to filter on sample genotype data (e.g. read depth etc.) with these routines, so I implemented this routine as a third step.

Lately, as the number of calls and number of samples have increased, one step is getting substantially slow and eating up memory; readVcfAsVRanges. I guess the reason for this is the large set of variants that pass the preFilter (> 100,000 PASSed variants), and that the filters in filterVcf (which according to the tutorial could be applied on the genotype data) would not make any sense for multi-sample VCF files (correct me if am wrong).

Is there any way I can utilize readVcfAsVRanges in a more efficient manner, enabling me to skip (not load into memory) samples with a 'NULL' genotype?

best,

Sigve

variantannotation filterVcf readvcfasvranges filterRules • 2.0k views

ADD COMMENT • link updated 9.2 years ago by Martin Morgan 25k • written 9.2 years ago by Sigve Nakken ▴ 50

score 2 · Answer 1 · 2015-02-26

I'm not sure if I fully understand the problem, but perhaps GenomicFiles::reduceByYield is helpful? In psuedo-code:

X = TabixFile(..., yieldSize=...)
YIELD = function(x)
    ## 'x' is the TabixFile
    readVcfAsVRanges(x, "hg19", param=...)
MAP = function(x) {
    ## 'x' is the (moderately sized) VRanges object from YIELD
    keep = ... # keep interesting genotypes
    v[keep]
}
REDUCE = c           ## concatenate the output of successive calls to MAP
DONE = function(x)
    ## 'x' is the output of YIELD; DONE signals when the entire file is processed
    length(x) == 0

and then put it together as

vr = reduceByYield(X, YIELD, MAP, REDUCE, DONE)

(REDUCE and DONE are actually default behaviors, so can be omitted). The end result is a modest-sized VRanges object.

score 1 · Answer 2 · 2015-02-25

Hi Sigve,

One approach for reducing memory use is to set 'yieldSize' in a TabixFile object (VcfFile object in VariantAnnotation devel). Using a TabixFile as the 'x' arg in readVcfAsVRanges() will iterate through the file in chunks specified by 'yieldSize'.

> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> TabixFile(fl, index=character(), yieldSize=25000)
class: TabixFile 
path: /home/vobencha/R/R-rel/R-3-1-branch/library/VariantAnnotation/e.../ex2.vcf
isOpen: FALSE 
yieldSize: 25000

As for the filtering, filterVcf() operates by removing rows of the VCF that don't meet the criteria. The filters produce a logical vector, the same length as nrow(VCF), which is used as a pass/fail indicator. This row-based filtering is well suited for FIXED or INFO fields or single-sample files. You could create a filter that tested for a NULL genotype (or read depth cutoff) but ultimately need TRUE/FALSE for each row; if any sample fails the whole row is removed. If I understand correctly you want to remove samples with a NULL genotype for any of the variants. Unfortunately sample-specific filtering can't be done with the current filterVcf(). To filter by sample use ScanVcfParam(sample = c("samp1", "samp25", "samp50")) as the 'param' argument to readVcf() or readVcfAsVRanges().

As an fyi, in VariantAnnotation in devel the default 'param' has changed from VRangesScanVcfParam() which imports in minimal fields to ScanVcfParam() which imports in all fields. Since you're using 1.12.9 you should be getting minimal fields, just a heads up for future use.

If the 'yieldSize' approach doesn't work for you please show how you're calling the function and we'll go from there.

Valerie

score 0 · Answer 3 · 2015-02-26

Hi Valerie,

Thanks for the rapid response! Processing one sample at a time, using samples in readVcfAsVRanges, appears to be the approach that works for me. I tried the tabix-approach too, but that one gave strange results (not sure why).

I understand the row-based filtering logic used in filterVcf(), my point was also to note that while the second pass of filtering (VCF instance-based) in filterVcf() appears very useful (e.g. isSnp(), lowCoverage() etc.), it cannot really be utilized if the VCF file contains more than one sample (which is often the case for many users, I would guess). If you have a VCF with 10,000 variants and 100 sample columns, and where (on average) each variant has proper genotype data for two-three samples, readVcfAsVRanges loads a huge number of sample calls that ideally should have been skipped. From your response, I think you understand this case.

Sigve