How to convert 1000 genomes VCF files to the format of traditional SNP array format?
Entering edit mode
syrttgump ▴ 20
Last seen 16 months ago
United States

Hi, The 1000 genomes project provides the VCF format in VCF version 4.3. In the files, genotypes are encoded as diploid, multi-allelic. For example, values are like 0|0, 0|1, 0|2, 1|2 etc. In the SNP array format, values are encoded as 0, 1, 2 (such as HapMap data), which are bi-allelic. Is there a way to transfer the 1000 genomes VCF files to the format of 0, 1, 2? Thanks.

SNP 1000genome • 509 views
Entering edit mode
Last seen 10 hours ago
South Africa

This is not quite a question related to any Bioconductor software package; however, take a look at PLINK's --recode options (hint: try --recode 12):

Even VCFtools has an option to do this (--012).


Entering edit mode
Last seen 9 days ago
United States

If the goal is to do this 'on the fly' then create a 'map' between current and desired encoding (I don't know whether 'map' is the correct format for your purposes)

v = outer(0:2, 0:2, paste, sep="|")
key = as.vector(v[upper.tri(v, diag = TRUE)])
value = seq_along(key) - 1L
map = setNames(value, key)

so that

> map
0/0 0/1 1/1 0/2 1/2 2/2
  0   1   2   3   4   5

If you're only interested in the genotype matrix, an efficient operation is

gt <- readGeno(vcf_file, "GT")
gt[] <- map[gt]

This also works on a full VCF object, e.g.,

vcf = readVcf(vcf_file)
geno(vcf)[["GT"]][] = map[ geno(vcf)[["GT"]] ]

vr = readVcfAsVRanges(vcf_file)
vr$GT[] <- map[ vr$GT ]

Login before adding your answer.

Traffic: 392 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6