Hi all,
I got from my collaborators a VCF file which contains all SNVs from all samples. The rows depict the SNVs and the columns depict the mutational status in all samples. Naturally, it is a sparse matrix. Here is an example line:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT do9157 do9158 do9159 do9160 do9161 do9162 do9163 chr1 3016583 chr1:3016583_A/G A G . PASS QSS=68;TQSS=1;NT=ref;QSS_NT=68;TQSS_NT=1;SGT=AA->AG;SOMATIC;VEP=intergenic_variant;CONTEXT=AGTxCCA;TXNSTR=N GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU .:.:.:.:.:.,0:.,0:.,0:.,0 .:.:.:.:.:.,0:.,0:.,0:.,0 .:.:.:.:.:.,0:.,0:.,0:.,0 .:.:.:.:.:.,0:.,0:.,0:.,0 .:.:.:.:.:.,0:.,0:.,0:.,0 0/1:65:1:0:0:41,42:0,0:23,24:0,0 .:.:.:.:.:.,0:.,0:.,0:.,0
My goal is to have a VRanges object for mutational signature analysis.
The whole file is about 2GB and contains 5,000,000 SNVs in 80 samples. Standard calls of readVcf or readVcfAsVRanges die as the amount of needed RAM exceeds my 32GB.
Method #1:
One problem is, that readVcfAsVRanges multiplies each line for each of the 80 samples. This is not always necessary as only every other line contains more than one entries. But each line contains at least one entry. Is there a way to evaluate the GT field for existence of "." or "0/1" while reading the file?
Method #2:
I tried one method, where I split the file parsing steps into sample-specific processes. This works well and I produce a list containing 80 VRanges objects. Here, I fail to concatenate the list to one VRanges object in the end.
Method #3:
File reading via TabixFile but the manual is quite unclear of how to proceed here.
tl;dr:
How can I create a VRanges object from a large file which contains all mutational statuses of all samples, so mostly dots and zeros I don't need?
Did you use
ScanVcfParam()
to selectively import only the information that you need? Also, for smoe use cases there arereadGeno()
and similar.Thanks for the quick answer!
Method#2: Yes, I basically filter for "." in the GT field:
But then I cannot append or merge the files from the list of VRanges objects together.
Currently, I'm having a look at the GenomicFiles package. Thank your for pointing this out to me. Can I ask you for a minimal example of how to loop over a file by yieldSize?