Search
Question: Best way to create a VRanges object from a large VCF file?
0
gravatar for ruben.drews
6 months ago by
ruben.drews0 wrote:

Hi all,

I got from my collaborators a VCF file which contains all SNVs from all samples. The rows depict the SNVs and the columns depict the mutational status in all samples. Naturally, it is a sparse matrix. Here is an example line:

 

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  do9157 do9158 do9159 do9160 do9161 do9162 do9163
chr1    3016583 chr1:3016583_A/G        A       G       .       PASS    QSS=68;TQSS=1;NT=ref;QSS_NT=68;TQSS_NT=1;SGT=AA->AG;SOMATIC;VEP=intergenic_variant;CONTEXT=AGTxCCA;TXNSTR=N     GT:DP:FDP:SDP:SUBDP:AU:CU:GU:TU .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       .:.:.:.:.:.,0:.,0:.,0:.,0       0/1:65:1:0:0:41,42:0,0:23,24:0,0        .:.:.:.:.:.,0:.,0:.,0:.,0

 

My goal is to have a VRanges object for mutational signature analysis.

The whole file is about 2GB and contains 5,000,000 SNVs in 80 samples. Standard calls of readVcf or readVcfAsVRanges die as the amount of needed RAM exceeds my 32GB. 

 

Method #1:

One problem is, that readVcfAsVRanges multiplies each line for each of the 80 samples. This is not always necessary as only every other line contains more than one entries. But each line contains at least one entry. Is there a way to evaluate the GT field for existence of "."  or "0/1" while reading the file?

 

Method #2:

 

I tried one method, where I split the file parsing steps into sample-specific processes. This works well and I produce a list containing 80 VRanges objects. Here, I fail to concatenate the list to one VRanges object in the end.

 

Method #3:

File reading via TabixFile but the manual is quite unclear of how to proceed here.

 

tl;dr:

How can I create a VRanges object from a large file which contains all mutational statuses of all samples, so mostly dots and zeros I don't need?

 

ADD COMMENTlink modified 6 months ago by Michael Lawrence9.3k • written 6 months ago by ruben.drews0

Did you use ScanVcfParam() to selectively import only the information that you need? Also, for smoe use cases there are readGeno() and similar.

ADD REPLYlink written 6 months ago by Martin Morgan ♦♦ 20k

Thanks for the quick answer!

 

Method#2: Yes, I basically filter for "." in the GT field:

vcfTemp=readVcfAsVRanges(vcfFile, genome="mm10", param)

vcfTempFiltered=vcfTemp[which(vcfTemp$GT!=".")]

But then I cannot append or merge the files from the list of VRanges objects together.

 

Currently, I'm having a look at the GenomicFiles package. Thank your for pointing this out to me. Can I ask you for a minimal example of how to loop over a file by yieldSize? 

 

 

ADD REPLYlink written 6 months ago by ruben.drews0
1
gravatar for Michael Lawrence
6 months ago by
United States
Michael Lawrence9.3k wrote:

You really don't want to concatenate iteratively in this case, as there is a lot of overhead to each contatenation (and if you really want to iterate with side effects, use a for() loop for clarity). You should use lapply() when reading, not sapply(), since you really just want a list. And you can cast the list to a VRangesList, which you can then stack to a VRanges with properly populated sample names:

vcfBig <- stackSamples(List(vcf))

But why not use GenomicFiles? I've never used it before, but it's probably something like this:

reader <- function(vcf) readVcfAsVRanges(vcf, genome="mm10", param)
filter <- function(vr) vr[vr$GT!="."]
reducer <- function(vrl) stackSamples(List(vrl))
vcfBig <- reduceByYield(VcfFile(vcfFile, yieldSize=1e5), reader, filter, reducer, iterate=FALSE)

Btw, since I doubt you're dealing with haploid mice, the missing value for the genotypes should be "./.", not ".". So your VCF is incorrect.

 

ADD COMMENTlink written 6 months ago by Michael Lawrence9.3k
0
gravatar for ruben.drews
6 months ago by
ruben.drews0 wrote:

Decided to do it the quick-and-dirty way. For the sake of this community, I'll post my solution but it's not pretty. For my 5,000,000x80 it takes around 20-25 minutes. 

 

inputLargeVcfFile=function(vcfFile){

    # Get samples from VCF file
    hvcf=scanVcfHeader(vcfFile)
    sampleNames=samples(hvcf)

    vcf=VRanges()
    vcf=sapply(sampleNames, function(sampleName) { 
        print(sampleName)
        param=ScanVcfParam(samples=sampleName, trimEmpty=TRUE, info=c("CONTEXT", "SGT"), geno=c("GT", "AU", "CU", "GU", "TU"))
        vcfTemp=readVcfAsVRanges(vcfFile, genome="mm10", param)
        vcfTempFiltered=vcfTemp[which(vcfTemp$GT!=".")]
        }
    )

    vcfBig=VRanges()
    vcfTemp=sapply(vcf, function(vcfSmall) vcfBig <<- append(vcfBig, vcfSmall))
    rm(vcfTemp)
    
    return(vcfBig)
}

 

Thanks Michael for your help!

ADD COMMENTlink written 6 months ago by ruben.drews0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 149 users visited in the last hour