How to convert 1000 genomes VCF files to the format of traditional SNP array format?
Hi, The 1000 genomes project provides the VCF format in VCF version 4.3. In the files, genotypes are encoded as diploid, multi-allelic. For example, values are like 0|0, 0|1, 0|2, 1|2 etc. In the SNP array format, values are encoded as 0, 1, 2 (such as HapMap data), which are bi-allelic. Is there a way to transfer the 1000 genomes VCF files to the format of 0, 1, 2? Thanks.

This is not quite a question related to any Bioconductor software package; however, take a look at PLINK's --recode options (hint: try --recode 12):

Even VCFtools has an option to do this (--012).


If the goal is to do this 'on the fly' then create a 'map' between current and desired encoding (I don't know whether 'map' is the correct format for your purposes)

v = outer(0:2, 0:2, paste, sep="|")
key = as.vector(v[upper.tri(v, diag = TRUE)])
value = seq_along(key) - 1L
map = setNames(value, key)

so that

> map
0/0 0/1 1/1 0/2 1/2 2/2
  0   1   2   3   4   5

If you're only interested in the genotype matrix, an efficient operation is

gt <- readGeno(vcf_file, "GT")
gt[] <- map[gt]

This also works on a full VCF object, e.g.,

vcf = readVcf(vcf_file)
geno(vcf)[["GT"]][] = map[ geno(vcf)[["GT"]] ]

vr = readVcfAsVRanges(vcf_file)
vr$GT[] <- map[ vr$GT ]

