Best way to create a VRanges object from a large VCF file?
2
0
Entering edit mode
@rubendrews-12149
Last seen 5.1 years ago

Hi all,

I got from my collaborators a VCF file which contains all SNVs from all samples. The rows depict the SNVs and the columns depict the mutational status in all samples. Naturally, it is a sparse matrix. Here is an example line:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  do9157 do9158 do9159 do9160 do9161 do9162 do9163
chr1    3016583 chr1:3016583_A/G        A       G       .       PASS    QSS=68;TQSS=1;NT=ref;QSS_NT=68;TQSS_NT=1;SGT=AA->AG;SOMATIC;VEP=intergenic_variant;CONTEXT=AGTxCCA;TXNSTR=N     GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       0/1:65:1:0:0:41,42:0,0:23,24:0,0        .:.:.:.:.:.,0:.,0:.,0:.,0

My goal is to have a VRanges object for mutational signature analysis.

The whole file is about 2GB and contains 5,000,000 SNVs in 80 samples. Standard calls of readVcf or readVcfAsVRanges die as the amount of needed RAM exceeds my 32GB.

Method #1:

One problem is, that readVcfAsVRanges multiplies each line for each of the 80 samples. This is not always necessary as only every other line contains more than one entries. But each line contains at least one entry. Is there a way to evaluate the GT field for existence of "."  or "0/1" while reading the file?

Method #2:

I tried one method, where I split the file parsing steps into sample-specific processes. This works well and I produce a list containing 80 VRanges objects. Here, I fail to concatenate the list to one VRanges object in the end.

Method #3:

File reading via TabixFile but the manual is quite unclear of how to proceed here.

tl;dr:

How can I create a VRanges object from a large file which contains all mutational statuses of all samples, so mostly dots and zeros I don't need?

0
Entering edit mode

Did you use ScanVcfParam() to selectively import only the information that you need? Also, for smoe use cases there are readGeno() and similar.

0
Entering edit mode

Method#2: Yes, I basically filter for "." in the GT field:

vcfTemp=readVcfAsVRanges(vcfFile, genome="mm10", param)

vcfTempFiltered=vcfTemp[which(vcfTemp$GT!=".")]  But then I cannot append or merge the files from the list of VRanges objects together. Currently, I'm having a look at the GenomicFiles package. Thank your for pointing this out to me. Can I ask you for a minimal example of how to loop over a file by yieldSize? ADD REPLY 1 Entering edit mode @michael-lawrence-3846 Last seen 5 months ago United States You really don't want to concatenate iteratively in this case, as there is a lot of overhead to each contatenation (and if you really want to iterate with side effects, use a for() loop for clarity). You should use lapply() when reading, not sapply(), since you really just want a list. And you can cast the list to a VRangesList, which you can then stack to a VRanges with properly populated sample names: vcfBig <- stackSamples(List(vcf)) But why not use GenomicFiles? I've never used it before, but it's probably something like this: reader <- function(vcf) readVcfAsVRanges(vcf, genome="mm10", param) filter <- function(vr) vr[vr$GT!="."]
reducer <- function(vrl) stackSamples(List(vrl))
vcfBig <- reduceByYield(VcfFile(vcfFile, yieldSize=1e5), reader, filter, reducer, iterate=FALSE)

Btw, since I doubt you're dealing with haploid mice, the missing value for the genotypes should be "./.", not ".". So your VCF is incorrect.

0
Entering edit mode
@rubendrews-12149
Last seen 5.1 years ago

Decided to do it the quick-and-dirty way. For the sake of this community, I'll post my solution but it's not pretty. For my 5,000,000x80 it takes around 20-25 minutes.

inputLargeVcfFile=function(vcfFile){

# Get samples from VCF file
sampleNames=samples(hvcf)

vcf=VRanges()
vcf=sapply(sampleNames, function(sampleName) {
print(sampleName)
param=ScanVcfParam(samples=sampleName, trimEmpty=TRUE, info=c("CONTEXT", "SGT"), geno=c("GT", "AU", "CU", "GU", "TU"))
vcfTempFiltered=vcfTemp[which(vcfTemp\$GT!=".")]
}
)

vcfBig=VRanges()
vcfTemp=sapply(vcf, function(vcfSmall) vcfBig <<- append(vcfBig, vcfSmall))
rm(vcfTemp)

return(vcfBig)
}