VariantAnnotation - filterVcf - Prints multiple vcf headers
1
0
Entering edit mode
@stephen-sefick-3922
Last seen 8.0 years ago

Hello,

I am new to the VariantAnnotation package. I am trying to filter a 4-species vcf with the below code. I am having 2 problems: 1) I am not recovering the one site in my test file where all 4 species have RGQ>=30 (along with with variant sites that are not filtered), and 2) the header is of the vcf file is being printed multiple times (~55 times). I will provide any further information that would be helpful.

many thanks,

Stephen

code:

library("VariantAnnotation")

ref <- "ref"
x <- readVcf("test.vcf.gz", genome=ref)

##prefilter
##only those that have RGQ
isRGQ <- function(x) {
    grepl("RGQ", x, fixed=TRUE)
}

##filters
RGQ_filter <- function(x, value=30) {
    ## Filter on RGQ > value; default is 30

    test <- apply(geno(x)$RGQ>=value, 1, sum)==4
    out <- is.na(test) & test  
    out[is.na(out)] <- TRUE
    return(out)
}

prefilters <- FilterRules(list(RGQ_prefilter=isRGQ))
filters <- FilterRules(list(RGQ_filter=RGQ_filter))

file.gz <- "test.vcf.gz"
file.gz.tbi <- "test.vcf.gz.tbi"
destination.file <- "test2.vcf.gz"
size=1
tabix.file <- TabixFile(file.gz, yieldSize=size)
destination.file <- tempfile()

filterVcf(file=tabix.file, genome=ref, destination=destination.file,
          filters=filters, verbose=TRUE)
variantannotation • 1.3k views
ADD COMMENT
0
Entering edit mode

You'll want to use a yieldSize much larger than 1, say 10,000 - 100,000, so that you benefit from R's vectorized calculations.

ADD REPLY
0
Entering edit mode

Hello Martin,

When I used the TabixFile function, the vcf header was being written into the file multiple times. The function is running, but it is slow. Is there a way around this?

 

There is a report of this here: https://stat.ethz.ch/pipermail/bioc-devel/2015-August/007877.html.

 

I have 1.16.4 installed. Maybe a regression? Thank you for the help.

ADD REPLY
0
Entering edit mode

When I look at the filterVcf,character-method, I see

> selectMethod(filterVcf, "character")
Method Definition:

function (file, genome, destination, ..., verbose = TRUE, index = FALSE, 
    prefilters = FilterRules(), filters = FilterRules(), param = ScanVcfParam()) 
{
    if (file.exists(destination)) 
        stop(sprintf("file '%s' exists and will not be over-written", 
            destination))
    tbx <- open(TabixFile(file, yieldSize = 1e+05))
    on.exit(close(tbx))
    filterVcf(tbx, genome = genome, destination = destination, 
        ..., verbose = verbose, index = index, prefilters = prefilters, 
        filters = filters, param = param)
}
<environment: namespace:VariantAnnotation>

Signatures:
        file       
target  "character"
defined "character"

so actually 'under the hood' it is using TabixFile; it seems like the key is to open it before calling filterVcf(). But maybe it's good enough to just use file.gz -- is there a reason that you want to use TabixFile()explicitly? And again, using yieldSize=1 is very inefficient.

Also, I believe that you can use the param argument and ScanVcfParam() to select just the fields required for the filter

param <- ScanVcfParam(fixed=NA, info=NA, geno="RGQ")
filterVcf(..., param=param)
ADD REPLY
0
Entering edit mode

I was using yield size 1 to test. I am still experiencing the multiple header problem. Thank you for all of the help.

Relevant code snipette:

size=10

tabix.file <- open(TabixFile(file.gz, yieldSize=size))
    

#destination.file <- tempfile()

filterVcf(file=tabix.file, genome=ref, destination=destination.file, filters=filters, verbose=TRUE)

close(tabix.file)

ADD REPLY
0
Entering edit mode

Popping out to top level comment to give some more room. For a reproducible example I did this, modified from ?filterVcf

library(VariantAnnotation)

fl <- system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")
lowCoverageExomeSNP <- function(x) grepl("LOWCOV,EXOME", x, fixed=TRUE)
pre <- FilterRules(list(lowCoverageExomeSNP = lowCoverageExomeSNP))
filt <- FilterRules(list(VTisSNP = function(x) info(x)$VT == "SNP"))

tbx <- open(TabixFile(fl, yieldSize=700))
out <- filterVcf(tbx, "hg19", tempfile(), prefilters=pre, filters=filt)
close(tbx)

The output (abbreviate) was

> out <- filterVcf(tbx, "hg19", tempfile(), prefilters=pre, filters=filt)
starting prefilter
prefiltering 200 records
prefiltering 400 records
...
prefiltering 10200 records
prefiltering 10376 records
prefiltered to /tmp/RtmpTixzNP/file4e7f38381211
prefilter compressing and indexing '/tmp/RtmpTixzNP/file4e7f38381211'
starting filter
filtering 200 records
filtering 400 records
filtering 600 records
filtering 794 records
completed filtering
>

which happened very quickly; there were no duplicate headers. Can you perform the same? If you experience problems, can you add the output of sessionInfo() to your post? Also, if you could provide your VCF file (or a small portion of it, on dropbox or similar, or to me at martin.morgan at roswellpark.org) then I will look into this further.

A speed-up is to replace the apply() function with rowSums()

ADD REPLY
0
Entering edit mode

This works fine. My script is close to working. I updated Bioconductor, and utilized your suggestions. I appreciate all of your help.

ADD REPLY
0
Entering edit mode
@stephen-sefick-3922
Last seen 8.0 years ago

This is now acting as expected. I stopped using the tabix.file <- TabixFile(file.gz, yieldSize=size), and use the actual file name.

filter has been modified to work as expected:

RGQ_filter <- function(x, value=30) {
    ## Filter on RGQ > value; default is 30
    ##test <- apply(geno(x)$RGQ>=value, 1, sum)==4
    ##out <- is.na(test) & test  
    out <- apply(geno(x)$RGQ>=value, 1, sum)==4
    out[is.na(out)] <- TRUE
    return(out)
}

 

HTH,

Stephen

ADD COMMENT

Login before adding your answer.

Traffic: 702 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6