Matching Variants Between Two VCF Objects
2
0
Entering edit mode
Dario Strbenac ★ 1.5k
@dario-strbenac-5916
Last seen 1 day ago
Australia

If I have two VCF objects, how can I find the matches of one in the other, considering both the location and the alternative allele? I used match, but I find that it only matches on position and ignores whether the alterative allele is different (row ranges not shown).

> fixedQuery
DataFrame with 1 row and 4 columns
             REF                    ALT      QUAL      FILTER
  <DNAStringSet>        <CharacterList> <numeric> <character>
1              T .AAAAAAAAAAAAAAAAAAA..   4115.21        PASS
> fixedMatch
DataFrame with 1 row and 4 columns
             REF                    ALT      QUAL      FILTER
  <DNAStringSet>        <CharacterList> <numeric> <character>
1              T .TTTCCCCTGATGAGTGTCT..   4824.64        PASS

It's possible to use ref and alt and match those too, but I thought there might be some higher-level solution.

VariantAnnotation • 1.8k views
ADD COMMENT
2
Entering edit mode
nick.eagles ▴ 100
@nickeagles-24264
Last seen 4 months ago
United States

Hello,

Are you referring to 2 CollapsedVCF objects each with 1 column (sample)? I assume this is the case, since a CollapsedVCF containing more than one sample may have more than one allele for a particular position (potentially different alleles in each sample). I don't know of a high-level function for your use-case, so the best I can provide is a custom function that should do the job. The below function returns a CollapsedVCF whose rows are the sites from vcf1 whose ranges and genotype calls are present and equal as in vcf2.

getEqualOverlaps = function(vcf1, vcf2) {
    #  This function only makes sense for VCF files with one sample!
    stopifnot(ncol(vcf1) == 1 && ncol(vcf2) == 1)

    #  First take the ranges which are present in both VCF objects
    temp_vcf1 = subsetByOverlaps(vcf1, vcf2)

    #  The indices of each range in vcf2 along the rows of vcf1
    vcf2_indices = match(ranges(temp_vcf1), ranges(vcf2))

    #  Subset those equal ranges to those whose genotype calls match
    temp_vcf1 = temp_vcf1[geno(temp_vcf1)$GT[,1] == geno(vcf2)$GT[vcf2_indices,1],]

    return(temp_vcf1)
}

Best,

-Nick

P.S. I'm currently learn to help others.

Session info when testing this function:

> library('sessioninfo')
> session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.0.3 Patched (2020-11-29 r79529)
 os       CentOS Linux 7 (Core)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       US/Eastern
 date     2020-12-02

─ Packages ───────────────────────────────────────────────────────────────────
 package              * version  date       lib source
 AnnotationDbi          1.52.0   2020-10-27 [2] Bioconductor
 askpass                1.1      2019-01-13 [2] CRAN (R 4.0.3)
 assertthat             0.2.1    2019-03-21 [2] CRAN (R 4.0.3)
 Biobase              * 2.50.0   2020-10-27 [2] Bioconductor
 BiocFileCache          1.14.0   2020-10-27 [1] Bioconductor
 BiocGenerics         * 0.36.0   2020-10-27 [2] Bioconductor
 BiocParallel           1.24.1   2020-11-06 [2] Bioconductor
 biomaRt                2.46.0   2020-10-27 [2] Bioconductor
 Biostrings           * 2.58.0   2020-10-27 [2] Bioconductor
 bit                    4.0.4    2020-08-04 [2] CRAN (R 4.0.3)
 bit64                  4.0.5    2020-08-30 [2] CRAN (R 4.0.3)
 bitops                 1.0-6    2013-08-17 [2] CRAN (R 4.0.3)
 blob                   1.2.1    2020-01-20 [2] CRAN (R 4.0.3)
 BSgenome               1.58.0   2020-10-27 [2] Bioconductor
 cli                    2.2.0    2020-11-20 [2] CRAN (R 4.0.3)
 crayon                 1.3.4    2017-09-16 [2] CRAN (R 4.0.3)
 curl                   4.3      2019-12-02 [2] CRAN (R 4.0.3)
 DBI                    1.1.0    2019-12-15 [2] CRAN (R 4.0.3)
 dbplyr                 2.0.0    2020-11-03 [2] CRAN (R 4.0.3)
 DelayedArray           0.16.0   2020-10-27 [2] Bioconductor
 digest                 0.6.27   2020-10-24 [2] CRAN (R 4.0.3)
 dplyr                  1.0.2    2020-08-18 [2] CRAN (R 4.0.3)
 ellipsis               0.3.1    2020-05-15 [2] CRAN (R 4.0.3)
 fansi                  0.4.1    2020-01-08 [2] CRAN (R 4.0.3)
 generics               0.1.0    2020-10-31 [2] CRAN (R 4.0.3)
 GenomeInfoDb         * 1.26.1   2020-11-20 [2] Bioconductor
 GenomeInfoDbData       1.2.4    2020-11-30 [2] Bioconductor
 GenomicAlignments      1.26.0   2020-10-27 [2] Bioconductor
 GenomicFeatures        1.42.1   2020-11-12 [2] Bioconductor
 GenomicRanges        * 1.42.0   2020-10-27 [2] Bioconductor
 glue                   1.4.2    2020-08-27 [2] CRAN (R 4.0.3)
 hms                    0.5.3    2020-01-08 [2] CRAN (R 4.0.3)
 httr                   1.4.2    2020-07-20 [2] CRAN (R 4.0.3)
 IRanges              * 2.24.0   2020-10-27 [2] Bioconductor
 lattice                0.20-41  2020-04-02 [3] CRAN (R 4.0.3)
 lifecycle              0.2.0    2020-03-06 [2] CRAN (R 4.0.3)
 magrittr               2.0.1    2020-11-17 [2] CRAN (R 4.0.3)
 Matrix                 1.2-18   2019-11-27 [3] CRAN (R 4.0.3)
 MatrixGenerics       * 1.2.0    2020-10-27 [2] Bioconductor
 matrixStats          * 0.57.0   2020-09-25 [2] CRAN (R 4.0.3)
 memoise                1.1.0    2017-04-21 [2] CRAN (R 4.0.3)
 openssl                1.4.3    2020-09-18 [2] CRAN (R 4.0.3)
 pillar                 1.4.7    2020-11-20 [2] CRAN (R 4.0.3)
 pkgconfig              2.0.3    2019-09-22 [2] CRAN (R 4.0.3)
 prettyunits            1.1.1    2020-01-24 [2] CRAN (R 4.0.3)
 progress               1.2.2    2019-05-16 [2] CRAN (R 4.0.3)
 purrr                  0.3.4    2020-04-17 [2] CRAN (R 4.0.3)
 R6                     2.5.0    2020-10-28 [2] CRAN (R 4.0.3)
 rappdirs               0.3.1    2016-03-28 [2] CRAN (R 4.0.3)
 Rcpp                   1.0.5    2020-07-06 [2] CRAN (R 4.0.3)
 RCurl                  1.98-1.2 2020-04-18 [2] CRAN (R 4.0.3)
 rlang                  0.4.9    2020-11-26 [2] CRAN (R 4.0.3)
 Rsamtools            * 2.6.0    2020-10-27 [2] Bioconductor
 RSQLite                2.2.1    2020-09-30 [2] CRAN (R 4.0.3)
 rtracklayer            1.50.0   2020-10-27 [2] Bioconductor
 S4Vectors            * 0.28.0   2020-10-27 [2] Bioconductor
 sessioninfo          * 1.1.1    2018-11-05 [2] CRAN (R 4.0.3)
 stringi                1.5.3    2020-09-09 [2] CRAN (R 4.0.3)
 stringr                1.4.0    2019-02-10 [2] CRAN (R 4.0.3)
 SummarizedExperiment * 1.20.0   2020-10-27 [1] Bioconductor
 tibble                 3.0.4    2020-10-12 [2] CRAN (R 4.0.3)
 tidyselect             1.1.0    2020-05-11 [2] CRAN (R 4.0.3)
 VariantAnnotation    * 1.36.0   2020-10-27 [2] Bioconductor
 vctrs                  0.3.5    2020-11-17 [2] CRAN (R 4.0.3)
 withr                  2.3.0    2020-09-22 [2] CRAN (R 4.0.3)
 XML                    3.99-0.5 2020-07-23 [2] CRAN (R 4.0.3)
 xml2                   1.3.2    2020-04-23 [2] CRAN (R 4.0.3)
 XVector              * 0.30.0   2020-10-27 [2] Bioconductor
 zlibbioc               1.36.0   2020-10-27 [2] Bioconductor

[1] /users/neagles/R/4.0.x
[2] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0.x/R/4.0.x/lib64/R/site-library
[3] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0.x/R/4.0.x/lib64/R/library
ADD COMMENT

Login before adding your answer.

Traffic: 621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6