Question: Subsetting info columns from a VCF and getting rsID for some row names?
0
gravatar for emily.mccann
3.5 years ago by
emily.mccann0 wrote:

I am working in R with the VariantAnnotation package and the vcf file from the ExAC consortium, ExAC.r0.3.1.sites.vep.vcf (release 0.3). My aim is to extract simple data frames for the "AC_Adj" and "AN_Adj" info columns. I have used the below code to achieve this:

 

# define paramaters on how to filter the vcf file
AC.adj.param <- ScanVcfParam(info="AC_Adj")
AN.adj.param <- ScanVcfParam(info="AN_Adj")

# load ALL allele counts (AN) and alt allele counts (AC) from exac vcf file 
raw.exac.AC.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AC.adj.param)
raw.exac.AN.adj. <- readVcf("/Volumes/Emilly\ 1TB/ExAC.r0.3.1.sites.vep.vcf", "hg19", param=AN.adj.param)

# extract just the allele counts and corressponding chr location with allele tags from exac vcf file - in dataframe/s4 class
sclass.exac.AC.adj <- (info(raw.exac.AC.adj.))
sclass.exac.AN.adj <- (info(raw.exac.AN.adj.))

# make the allele counts into a plain data frame - chr positions are only row names and not their own column
df.exac.AC.adj <- as.data.frame(sclass.exac.AC.adj)
df.exac.AN.adj <- as.data.frame(sclass.exac.AN.adj)

# make the chr positions an actual column and remove the row names to make it cleaner
df.exac.AC.adj$chr.position <- row.names(df.exac.AC.adj)
df.exac.AN.adj$chr.position <- row.names(df.exac.AN.adj)

row.names(df.exac.AC.adj) <- NULL
row.names(df.exac.AN.adj) <- NULL

 

However, while most of the row names are correct, some of the row names are returned are rsIDs rather than chr:position.

Essentially, a subset of the data returned looks like this:

   AC_Adj chr.position
70      2      1:69395
71      9      1:69409
72   1985  rs140739101
73      4      1:69438
74      7  rs142004627
75      1      1:69474
76      1      1:69478
77      1      1:69486
78      2      1:69487
79     10      1:69489
80     59  rs150690004

While I would like the returned data frame to look like this:

   AC_Adj chr.position
70      2      1:69395
71      9      1:69409
72   1985   1:69428
73      4      1:69438
74      7      1:69453
75      1      1:69474
76      1      1:69478
77      1      1:69486
78      2      1:69487
79     10      1:69489
80     59      1:69496

I believe there is some error in my code which is introducing the bug but have not been able to identify it and would greatly appreciate any input as to why this is happening. There may also be a more efficient way to achieve my goal that I have overlooked. Many thanks for any help anyone can provide.

ADD COMMENTlink modified 3.5 years ago by Michael Lawrence11k • written 3.5 years ago by emily.mccann0
Answer: Subsetting info columns from a VCF and getting rsID for some row names?
0
gravatar for Michael Lawrence
3.5 years ago by
United States
Michael Lawrence11k wrote:

The rownames are probably coming from the ID column in the VCF. When missing, the position is used. To get the positions, use the rowRanges() accessor. You can combine the INFO columns with the ranges. 

gr <- rowRanges(raw.exac.AC.adj.)
mcols(gr) <- info(raw.exac.AC.adj.)

An alternative would to use VRanges:

vr <- as(raw.exac.AC.adj., "VRanges")

If you really want a data.frame, you can coerce either of those, but please be aware that Bioconductor is more than a means of converting semantically rich files to semantically poor data structures.

df <- as.data.frame(gr)
df$chr.position <- as.character(gr)

 

 

ADD COMMENTlink written 3.5 years ago by Michael Lawrence11k

Thank you , that very elegantly solves my issue. I will keep that in mind in future.

ADD REPLYlink written 3.5 years ago by emily.mccann0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 356 users visited in the last hour