Question: readVcfAsVRanges + FilterRules
1
4.6 years ago by
Sigve Nakken50
Norway
Sigve Nakken50 wrote:

Hi,

I have successfully used the VariantAnnotation (1.12.9 now) for processing my multi-sample VCF files for some time now. Especially, I have utilized the combination of filterVcf (with prefilters) and readVcfAsVRanges to retrieve sample variant calls from my sequencing projects. I have not been able to filter on sample genotype data (e.g. read depth etc.) with these routines, so I implemented this routine as a third step.

Lately, as the number of calls and number of samples have increased, one step is getting substantially slow and eating up memory; readVcfAsVRanges. I guess the reason for this is the large set of variants that pass the preFilter (> 100,000 PASSed variants), and that the filters in filterVcf (which according to the tutorial could be applied on the genotype data) would not make any sense for multi-sample VCF files (correct me if am wrong).

Is there any way I can utilize readVcfAsVRanges in a more efficient manner, enabling me to skip (not load into memory) samples with a 'NULL' genotype?

best,

Sigve

modified 4.6 years ago by Martin Morgan ♦♦ 23k • written 4.6 years ago by Sigve Nakken50
2
4.6 years ago by
Martin Morgan ♦♦ 23k
United States
Martin Morgan ♦♦ 23k wrote:

I'm not sure if I fully understand the problem, but perhaps GenomicFiles::reduceByYield is helpful? In psuedo-code:

X = TabixFile(..., yieldSize=...)
YIELD = function(x)
## 'x' is the TabixFile
MAP = function(x) {
## 'x' is the (moderately sized) VRanges object from YIELD
keep = ... # keep interesting genotypes
v[keep]
}
REDUCE = c           ## concatenate the output of successive calls to MAP
DONE = function(x)
## 'x' is the output of YIELD; DONE signals when the entire file is processed
length(x) == 0

and then put it together as

vr = reduceByYield(X, YIELD, MAP, REDUCE, DONE)

(REDUCE and DONE are actually default behaviors, so can be omitted). The end result is a modest-sized VRanges object.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Martin Morgan ♦♦ 23k
1
4.6 years ago by
United States
Valerie Obenchain6.7k wrote:

Hi Sigve,

One approach for reducing memory use is to set 'yieldSize' in a TabixFile object (VcfFile object in VariantAnnotation devel). Using a TabixFile as the 'x' arg in readVcfAsVRanges() will iterate through the file in chunks specified by 'yieldSize'.

> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> TabixFile(fl, index=character(), yieldSize=25000)
class: TabixFile
path: /home/vobencha/R/R-rel/R-3-1-branch/library/VariantAnnotation/e.../ex2.vcf
isOpen: FALSE
yieldSize: 25000 

As for the filtering, filterVcf() operates by removing rows of the VCF that don't meet the criteria. The filters produce a logical vector, the same length  as nrow(VCF), which is used as a pass/fail indicator. This row-based filtering is well suited for FIXED or INFO fields or single-sample files. You could create a filter that tested for a NULL genotype (or read depth cutoff) but ultimately need TRUE/FALSE for each row; if any sample fails the whole row is removed. If I understand correctly you want to remove samples with a NULL genotype for any of the variants. Unfortunately sample-specific filtering can't be done with the current filterVcf(). To filter by sample use ScanVcfParam(sample = c("samp1", "samp25", "samp50")) as the 'param' argument to readVcf() or readVcfAsVRanges().

As an fyi, in VariantAnnotation in devel the default 'param' has changed from VRangesScanVcfParam() which imports in minimal fields to ScanVcfParam() which imports in all fields. Since you're using 1.12.9 you should be getting minimal fields, just a heads up for future use.

If the 'yieldSize' approach doesn't work for you please show how you're calling the function and we'll go from there.

Valerie

0
4.6 years ago by
Sigve Nakken50
Norway
Sigve Nakken50 wrote:

Hi Valerie,

Thanks for the rapid response! Processing one sample at a time, using samples in readVcfAsVRanges, appears to be the approach that works for me. I tried the tabix-approach too, but that one gave strange results (not sure why).

I understand the row-based filtering logic used in filterVcf(), my point was also to note that while the second pass of filtering (VCF instance-based) in filterVcf() appears very useful (e.g. isSnp(), lowCoverage() etc.), it cannot really be utilized if the VCF file contains more than one sample (which is often the case for many users, I would guess). If you have a VCF with 10,000 variants and 100 sample columns, and where (on average) each variant has proper genotype data for two-three samples, readVcfAsVRanges loads a huge number of sample calls that ideally should have been skipped. From your response, I think you understand this case.

Sigve

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Sigve Nakken50

I'm glad the sample approach worked. If you want to send the results of the tabix run (off line or here) I can take a look; that approach should work too.

I agree, most genotype filters are only meaningful for single-sample files (I'll add this to the docs). The case where genotype filtering does work for multiple samples is when the criteria is applied across samples and not to the individual, e.g.,  'keep rows where all samples have DP > 10' .

Valerie