Annotate VCFs with Cosmic (ExAC etc.) fields
1
0
Entering edit mode
@markusriester-9875
Last seen 2.5 years ago
United States

Hi,

I'm trying to annotate VCFs with info fields stored in other VCFs, for example provided by Cosmic or ExAC. Since this is a very common workflow, I was wondering if I missed functionality already provided in Bioconductor.

The code below is the solution I came up with, which seems rather over-complicated. For example, is it possible to make ScanVcfParam check the ALT fields automatically?

data(package = "COSMIC.67")
data(cosmic_67, package = "COSMIC.67")
dbVcfFile <- system.file("vcf", "cosmic_67.vcf.gz", package = "COSMIC.67")

vcf <- readVcf(example.vcf.file, 'hg19')

.annotateVcf <- function(vcf, dbVcfFile, infoFields, infoTypes, infoDescriptions, infoPrefix="") {

    # make sure chromsome styles are the same
    dbSeqStyle <- seqlevelsStyle(headerTabix(
        TabixFile(dbVcfFile))$seqnames)

    vcfRenamedSL <- vcf
    if (!length(intersect(seqlevelsStyle(vcf), dbSeqStyle))) {
        seqlevelsStyle(vcfRenamedSL) <- dbSeqStyle[1]
    }

    # query variants
    db.vcf <- readVcf(dbVcfFile, genome=genome(vcf)[1],
        ScanVcfParam(which = rowRanges(vcfRenamedSL),
            info=infoFields,
            fixed="ALT",
            geno=NA
        )
    )
    ov <- match(rowRanges(vcfRenamedSL), rowRanges(db.vcf))

    # annotate header
    field <- paste0(infoPrefix,infoFields)
    newInfo <- DataFrame(
        Number=1, Type=infoTypes,
        Description=infoDescriptions,
        row.names=field)
    info(header(vcf)) <- rbind(info(header(vcf)), newInfo)

    # add info fields
    for (i in seq_along(infoFields)) {
        info(vcf)[[field[i]]] <- info(db.vcf)[[infoFields[1]]][ov]
    }

    vcf
}

vcfAnnotated <- .annotateVcf(vcf, dbVcfFile, "CNT", "Integer", "Number of Cosmic hits", "Cosmic.")

Update: updated code with suggestions.

Thank you,

Markus

variantannotation cosmic exac • 3.6k views
ADD COMMENT
2
Entering edit mode

If you have your variants as VRanges objects, methods like match, %in%, %over% and friends also consider the alternative allele (see the help page):

'match(x)’: Like GRanges ‘match’, except matches on the combination of chromosome, start, width, and *alt*.

Personally, and just in case you are not exclusively bound to R for your entire analysis, I prefer running bcftools annotate externally for this type of task, especially when dealing with large VCF files. You can then import the annotated VCF into R/Bioc and process as normal.

ADD REPLY
0
Entering edit mode

Thanks a lot Julian, I didn't know that "match" works on VRanges, but indeed does. And yes, the code is pretty slow for larger VCFs, is there something obvious I can do to speed it up?

ADD REPLY
0
Entering edit mode

Avoid re-inventing the wheel, the tools doing what you want already exist. As suggested by Julian, bcftools annotate is one of those.

ADD REPLY
0
Entering edit mode

No need for patronizing. Upstream and downstream steps happen in R, so third party dependencies wouldn't be ideal either.

ADD REPLY
0
Entering edit mode

Scanning your code, I would guess (a) if there are many ranges, then it is often faster to read-then-subset (e.g., subsetByOverlaps(db.vcf, rowRanges(vcfRenamedSL)), rather than using the which= argument to ScanVcfParam() and (b) in the for loop at the end, collect all results then do the info update (info(vcf) <- ...) only once to avoid copying the vcf object on each iteration. It can be very helpful to simply debug() and then step through the function to see where the slow points are.

Actually it seems like most of the for loop is captured without iteration as

field <- paste0(infoPrefix,infoFields)
newInfo <- DataFrame(
    Number=1, Type=infoTypes,
    Description=infoDescriptions,
    row.names=field)
info(header(vcf)) <- rbind(info(header(vcf)), newInfo)
info(vcf)[[field]] <- NA

It seems like the remaining iteration can also be made to iterate only on small data

fld = info(vcf)[[field]]
for (i in seq_along(infoFields))
    fld[queryHits(ov)] <- info(db.vcf[subjectHits(ov)])[[infoFields[i]]]
info(vcf)[[field]] = fld

and even vectorized, but it would help to see a complete (toy) example of inputs and outputs.

ADD REPLY
0
Entering edit mode

Thanks a lot Martin! I updated the code. The bottleneck is the readVcf part, but I forgot to mention that these can be very large files and I assume they are bgzipped and tabixed. It's not terribly slow, but if you see ways for improving, that would be awesome.

ADD REPLY
1
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States

I wrote annotateVcf as follows, assuming that both vcf (the vcf to be annotated) and db.vcf were already in memory.

annotateVcf <- function(vcf, db.vcf, header) {
    if (!length(intersect(seqlevelsStyle(vcf), seqlevelsStyle(db.vcf))))
        seqlevelsStyle(vcfRenamedSL) <- seqlevelsStyle(db.vcf)[1]

    info(header(vcf)) <- rbind(info(header(vcf)), header)

    idx <- match(rowRanges(vcf), rowRanges(db.vcf))
     df <- info(db.vcf)[idx, infoFields, drop=FALSE]
    names(df) <- rownames(db.vcf_header)
    info(vcf)[, names(df)] <- df
    vcf
}

Maybe the only speedup is to avoid the iteration over infoFields, so the vcf is updated only once. I used it like this

infoFields <- "CNT"
infoPrefix <- "COSMIC"
dbVcfFile <- system.file(package = "COSMIC.67", "vcf", "cosmic_67.vcf.gz")

db.vcf <- readVcf(dbVcfFile, genome=genome(vcf)[1],
                  ScanVcfParam(info=infoFields, fixed="ALT", geno=NA))
header <- info(header(db.vcf))[infoFields, , drop=FALSE]
rownames(header) <- paste(infoPrefix, rownames(header), sep=".")

example.vcf.file <- system.file(package="VariantAnnotation",
                                "extdata", "chr22.vcf.gz")
vcf <- readVcf(example.vcf.file, 'hg19')
vcf <- annotateVcf(vcf, db.vcf, header)

and to deal with large files (needing annotation) I iterated through in chunks using VcfFile(yieldSize=) and GenomicFiles::reduceByYield()

out <- tempfile()
con <- file(out, open="a")              # append
reduceByYield(
    VcfFile(example.vcf.file, yieldSize=100000),
    readVcf,
    function(vcf) {
      vcf <- annotateVcf(vcf, db.vcf, header)
      writeVcf(vcf, con)
      length(vcf)
    })
close(con)

 

ADD COMMENT
0
Entering edit mode

Note that this does not check that the ALT is the same between the two VCFs. Could use VRanges to handle that.
 

ADD REPLY
0
Entering edit mode

Ah, right. Thanks for pointing this out Michael. And thanks again Martin, really nice code. That helped a lot.

ADD REPLY

Login before adding your answer.

Traffic: 564 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6