How to convert 1000 genomes VCF files to the format of traditional SNP array format?
2
0
Entering edit mode
syrttgump ▴ 20
@syrttgump-7367
Last seen 16 months ago
United States

Hi, The 1000 genomes project provides the VCF format in VCF version 4.3. In the files, genotypes are encoded as diploid, multi-allelic. For example, values are like 0|0, 0|1, 0|2, 1|2 etc. In the SNP array format, values are encoded as 0, 1, 2 (such as HapMap data), which are bi-allelic. Is there a way to transfer the 1000 genomes VCF files to the format of 0, 1, 2? Thanks.

SNP 1000genome • 509 views
ADD COMMENT
2
Entering edit mode
@kevin
Last seen 10 hours ago
South Africa

This is not quite a question related to any Bioconductor software package; however, take a look at PLINK's --recode options (hint: try --recode 12): https://www.cog-genomics.org/plink/1.9/data

Even VCFtools has an option to do this (--012).

Kevin

ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 9 days ago
United States

If the goal is to do this 'on the fly' then create a 'map' between current and desired encoding (I don't know whether 'map' is the correct format for your purposes)

v = outer(0:2, 0:2, paste, sep="|")
key = as.vector(v[upper.tri(v, diag = TRUE)])
value = seq_along(key) - 1L
map = setNames(value, key)

so that

> map
0/0 0/1 1/1 0/2 1/2 2/2
  0   1   2   3   4   5

If you're only interested in the genotype matrix, an efficient operation is

gt <- readGeno(vcf_file, "GT")
gt[] <- map[gt]

This also works on a full VCF object, e.g.,

vcf = readVcf(vcf_file)
geno(vcf)[["GT"]][] = map[ geno(vcf)[["GT"]] ]

vr = readVcfAsVRanges(vcf_file)
vr$GT[] <- map[ vr$GT ]
ADD COMMENT

Login before adding your answer.

Traffic: 392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6