Hi,
I have successfully used the VariantAnnotation (1.12.9 now) for processing my multi-sample VCF files for some time now. Especially, I have utilized the combination of filterVcf (with prefilters) and readVcfAsVRanges to retrieve sample variant calls from my sequencing projects. I have not been able to filter on sample genotype data (e.g. read depth etc.) with these routines, so I implemented this routine as a third step.
Lately, as the number of calls and number of samples have increased, one step is getting substantially slow and eating up memory; readVcfAsVRanges. I guess the reason for this is the large set of variants that pass the preFilter (> 100,000 PASSed variants), and that the filters in filterVcf (which according to the tutorial could be applied on the genotype data) would not make any sense for multi-sample VCF files (correct me if am wrong).
Is there any way I can utilize readVcfAsVRanges in a more efficient manner, enabling me to skip (not load into memory) samples with a 'NULL' genotype?
best,
Sigve
I'm glad the sample approach worked. If you want to send the results of the tabix run (off line or here) I can take a look; that approach should work too.
I agree, most genotype filters are only meaningful for single-sample files (I'll add this to the docs). The case where genotype filtering does work for multiple samples is when the criteria is applied across samples and not to the individual, e.g., 'keep rows where all samples have DP > 10' .
Valerie