Hi, I'm working on a SNP analysis project where the goal is to identify common/different polymorphisms between groups of samples. The VCF files I am using contains multiple samples with the different genotypes listed as comma separated alleles in the ALT field. I've been having a lot of challenges to subset the VCF into group of samples, trying to find common vs different alleles. Here's a simple mock sample exemplifying my challenges.
Here's a simple VCF for 3 individuals, Bob, Linda and Paul with different genotypes at 3 position on a given chromosome.
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1,length=100000>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Bob Linda Paul
chr1 10352 id_1 T TA . . DP=100 GT 1/1 1/1 0/0
chr1 13813 id_2 TT TG,T . . DP=100 GT 0/0 0/1 0/2
chr1 14464 id_3 A T,C . . DP=100 GT 0/1 0/2 0/1
My goal is to find the SNPs common between Bob and Linda. I am using the VariantAnnotation package as it seems the most obvious choice but no matter what I tried, I can't seem to subscript correctly one sample only SNPs. Here's what I though would make sense:
> library(VariantAnnotation)
> vcf <- readVcf("example.vcf") ## example.vcf being the small VCF above
## This is not working as the vcf is a CollapsedVCF
> vcf[,'Bob'] %in% vcf[,'Linda']
Error in validObject(ans) : invalid class "CollapsedVCF" object:
'elementMetadata' slot must contain a zero-column DataFrame at all time
So I tried using the ExpandedVCF
format, but I am getting all the SNPs for the 3 individuals (even though I only have Bob's subset...):
> vcfEx <- expand(vcf[,'Bob'])
> length(vcfEx)
[1] 5
> rowRanges(vcfEx)
GRanges object with 5 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID REF ALT QUAL FILTER
<Rle> <IRanges> <Rle> | <factor> <DNAStringSet> <DNAStringSet> <numeric> <character>
[1] chr1 10352 * | NA T TA NA .
[2] chr1 13813-13814 * | NA TT TG NA .
[3] chr1 13813-13814 * | NA TT T NA .
[4] chr1 14464 * | NA A T NA .
[5] chr1 14464 * | NA A C NA .
-------
seqinfo: 1 sequence from an unspecified genome
So I tried converting to a VRanges
but got the same issues:
> as(vcf[,'Bob'],'VRanges')
VRanges object with 5 ranges and 3 metadata columns:
seqnames ranges strand ref alt totalDepth refDepth altDepth sampleNames
<Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle> <integerOrRle> <integerOrRle> <factorOrRle>
[1] chr1 10352 * T TA <NA> <NA> <NA> Bob
[2] chr1 13813-13814 * TT TG <NA> <NA> <NA> Bob
[3] chr1 13813-13814 * TT T <NA> <NA> <NA> Bob
[4] chr1 14464 * A T <NA> <NA> <NA> Bob
[5] chr1 14464 * A C <NA> <NA> <NA> Bob
softFilterMatrix | QUAL DP GT
<matrix> | <numeric> <integer> <character>
[1] | NA 100 1/1
[2] | NA 100 0/0
[3] | NA 100 0/0
[4] | NA 100 0/1
[5] | NA 100 0/1
-------
seqinfo: 1 sequence from an unspecified genome
hardFilters: NULL
What is it that I'm missing? This sounds like a fairly common problem to ask when dealing with multi-samples VCF but I can't seem to find any patterns to achieve that goals. In addition, the comparison that I need to do is not between 2 individuals but between 2 groups of individuals (ie what are the SNPs common in the Smith's family vs the Jonhson's complicated the matter even more).
At the end the goal is to generate group wise VCF files that will be use to create genealogy trees using the fastreeR package. But this is really hindering my ability to move forward.
Thanks for help
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.12.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
[6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] VariantAnnotation_1.50.0 Rsamtools_2.20.0 Biostrings_2.72.0 XVector_0.44.0
[5] SummarizedExperiment_1.34.0 Biobase_2.64.0 GenomicRanges_1.56.0 GenomeInfoDb_1.40.1
[9] IRanges_2.38.0 S4Vectors_0.42.0 MatrixGenerics_1.16.0 matrixStats_1.3.0
[13] BiocGenerics_0.50.0
loaded via a namespace (and not attached):
[1] SparseArray_1.4.8 bitops_1.0-7 RSQLite_2.3.7 lattice_0.22-6
[5] grid_4.4.0 fastmap_1.2.0 blob_1.2.4 jsonlite_1.8.8
[9] Matrix_1.7-0 AnnotationDbi_1.66.0 restfulr_0.0.15 DBI_1.2.2
[13] httr_1.4.7 BSgenome_1.72.0 UCSC.utils_1.0.0 XML_3.99-0.16.1
[17] codetools_0.2-20 abind_1.4-5 cli_3.6.2 rlang_1.1.3
[21] crayon_1.5.2 bit64_4.0.5 yaml_2.3.8 cachem_1.1.0
[25] DelayedArray_0.30.1 GenomicFeatures_1.56.0 S4Arrays_1.4.1 tools_4.4.0
[29] parallel_4.4.0 BiocParallel_1.38.0 memoise_2.0.1 GenomeInfoDbData_1.2.12
[33] curl_5.2.1 vctrs_0.6.5 R6_2.5.1 png_0.1-8
[37] BiocIO_1.14.0 rtracklayer_1.64.0 zlibbioc_1.50.0 KEGGREST_1.44.0
[41] bit_4.0.5 GenomicAlignments_1.40.0 rjson_0.2.21 compiler_4.4.0
Yup, thanks James, makes a lot of sense! Not sure why I ended up on that path to expand the VCF or to use
match()
withVRanges
... Keep it simple... keep it DRY!