Question

How to compare subset of samples from VCF files

0

Entering edit mode

Marco • 0

@dc86b518

Last seen 8 weeks ago

United States

Hi, I'm working on a SNP analysis project where the goal is to identify common/different polymorphisms between groups of samples. The VCF files I am using contains multiple samples with the different genotypes listed as comma separated alleles in the ALT field. I've been having a lot of challenges to subset the VCF into group of samples, trying to find common vs different alleles. Here's a simple mock sample exemplifying my challenges.

Here's a simple VCF for 3 individuals, Bob, Linda and Paul with different genotypes at 3 position on a given chromosome.

##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=100000>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  Bob Linda   Paul
chr1    10352   id_1    T   TA  .   .   DP=100  GT  1/1 1/1 0/0
chr1    13813   id_2    TT  TG,T    .   .   DP=100  GT  0/0 0/1 0/2
chr1    14464   id_3    A   T,C .   .   DP=100  GT  0/1 0/2 0/1

My goal is to find the SNPs common between Bob and Linda. I am using the VariantAnnotation package as it seems the most obvious choice but no matter what I tried, I can't seem to subscript correctly one sample only SNPs. Here's what I though would make sense:

> library(VariantAnnotation)
> vcf <- readVcf("example.vcf") ## example.vcf being the small VCF above

## This is not working as the vcf is a CollapsedVCF
> vcf[,'Bob'] %in% vcf[,'Linda']
Error in validObject(ans) : invalid class "CollapsedVCF" object:
    'elementMetadata' slot must contain a zero-column DataFrame at all time

So I tried using the ExpandedVCF format, but I am getting all the SNPs for the 3 individuals (even though I only have Bob's subset...):

> vcfEx <- expand(vcf[,'Bob'])

> length(vcfEx)
[1] 5

> rowRanges(vcfEx)
GRanges object with 5 ranges and 5 metadata columns:
      seqnames      ranges strand | paramRangeID            REF            ALT      QUAL      FILTER
         <Rle>   <IRanges>  <Rle> |     <factor> <DNAStringSet> <DNAStringSet> <numeric> <character>
  [1]     chr1       10352      * |           NA              T             TA        NA           .
  [2]     chr1 13813-13814      * |           NA             TT             TG        NA           .
  [3]     chr1 13813-13814      * |           NA             TT              T        NA           .
  [4]     chr1       14464      * |           NA              A              T        NA           .
  [5]     chr1       14464      * |           NA              A              C        NA           .
  -------
  seqinfo: 1 sequence from an unspecified genome

So I tried converting to a VRanges but got the same issues:

> as(vcf[,'Bob'],'VRanges')
VRanges object with 5 ranges and 3 metadata columns:
      seqnames      ranges strand         ref              alt     totalDepth       refDepth       altDepth   sampleNames
         <Rle>   <IRanges>  <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
  [1]     chr1       10352      *           T               TA           <NA>           <NA>           <NA>           Bob
  [2]     chr1 13813-13814      *          TT               TG           <NA>           <NA>           <NA>           Bob
  [3]     chr1 13813-13814      *          TT                T           <NA>           <NA>           <NA>           Bob
  [4]     chr1       14464      *           A                T           <NA>           <NA>           <NA>           Bob
  [5]     chr1       14464      *           A                C           <NA>           <NA>           <NA>           Bob
      softFilterMatrix |      QUAL        DP          GT
              <matrix> | <numeric> <integer> <character>
  [1]                  |        NA       100         1/1
  [2]                  |        NA       100         0/0
  [3]                  |        NA       100         0/0
  [4]                  |        NA       100         0/1
  [5]                  |        NA       100         0/1
  -------
  seqinfo: 1 sequence from an unspecified genome
  hardFilters: NULL

What is it that I'm missing? This sounds like a fairly common problem to ask when dealing with multi-samples VCF but I can't seem to find any patterns to achieve that goals. In addition, the comparison that I need to do is not between 2 individuals but between 2 groups of individuals (ie what are the SNPs common in the Smith's family vs the Jonhson's complicated the matter even more).

At the end the goal is to generate group wise VCF files that will be use to create genealogy trees using the fastreeR package. But this is really hindering my ability to move forward.

Thanks for help

> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8        LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8
 [6] LC_MESSAGES=C.UTF-8    LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C           LC_TELEPHONE=C
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] VariantAnnotation_1.50.0    Rsamtools_2.20.0            Biostrings_2.72.0           XVector_0.44.0
 [5] SummarizedExperiment_1.34.0 Biobase_2.64.0              GenomicRanges_1.56.0        GenomeInfoDb_1.40.1
 [9] IRanges_2.38.0              S4Vectors_0.42.0            MatrixGenerics_1.16.0       matrixStats_1.3.0
[13] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
 [1] SparseArray_1.4.8        bitops_1.0-7             RSQLite_2.3.7            lattice_0.22-6
 [5] grid_4.4.0               fastmap_1.2.0            blob_1.2.4               jsonlite_1.8.8
 [9] Matrix_1.7-0             AnnotationDbi_1.66.0     restfulr_0.0.15          DBI_1.2.2
[13] httr_1.4.7               BSgenome_1.72.0          UCSC.utils_1.0.0         XML_3.99-0.16.1
[17] codetools_0.2-20         abind_1.4-5              cli_3.6.2                rlang_1.1.3
[21] crayon_1.5.2             bit64_4.0.5              yaml_2.3.8               cachem_1.1.0
[25] DelayedArray_0.30.1      GenomicFeatures_1.56.0   S4Arrays_1.4.1           tools_4.4.0
[29] parallel_4.4.0           BiocParallel_1.38.0      memoise_2.0.1            GenomeInfoDbData_1.2.12
[33] curl_5.2.1               vctrs_0.6.5              R6_2.5.1                 png_0.1-8
[37] BiocIO_1.14.0            rtracklayer_1.64.0       zlibbioc_1.50.0          KEGGREST_1.44.0
[41] bit_4.0.5                GenomicAlignments_1.40.0 rjson_0.2.21             compiler_4.4.0

VCF VariantAnnotation • 1.9k views

ADD COMMENT • link updated 17 months ago by Gordon Smyth 53k • written 17 months ago by Marco • 0

score 1 · Answer 1 · 2024-06-03

> library(VariantAnnotation)
> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")

> vcf <- readVcf(fl, "hg19")

> ind <- geno(vcf)$GT[,1] == geno(vcf)$GT[,3]
> dim(vcf)
[1] 10376     5
> dim(vcf[ind,])
[1] 9005    5

Also please note that there is a help page for the CollapsedVCF class, which explains that the [ function subsets the object and returns an object of the same class. And it also makes no mention of the %in% function in the context of this class, so you should not assume that it will do the comparison you want to make.

Also, using %in% for something like this (where there are only a small number of choices for the comparison) isn't going to do what you want, in any case, as that function is literally asking if the values in the first vector are found in the second, which isn't what you are trying to do, which is identify identical values for each SNP in two subjects.

> huh <- letters[sample(1:26, 500, TRUE)]
> what <- letters[sample(1:26, 500, TRUE)]
> sum(huh %in% what)
[1] 500
> all.equal(huh, what)
[1] "475 string mismatches"
> sum(huh == what)
[1] 25