Question: readVCF (VariantAnnotation) problem with mismatched header and info fields in vcf file
0
gravatar for jls2282
4.9 years ago by
jls22820
United States
jls22820 wrote:

 

I am having a problem reading the info field on vcf files generated by

the NextGene software. Here is an example:

vcfF is the vcf report as exported from NextGene

>hdr=scanVcfHeader(vcfF)
> rownames(info(hdr))
 [1] "NS"                                            "DP"                                            "AF"                                            "SGACOV"                                       
 [5] "SGASCORE"                                      "SGGENEDIR"                                     "SGANNOT"                                       "SGREFAA"                                      
 [9] "SGAAC"                                         "SGMRNAID"                                      "SGCDSID"                                       "SGTI"                                         
[13] "SGPI"                                          "SGGI"                                          "SGSEGDES"                                      "SGSEGPOS"                                     
[17] "SGAMBGAINPEN"                                  "SGAMBLOSSPEN"                                  "SG_Cosmic(Cosmic_68)_68_COSMIC_ID"             "SG_Cosmic(Cosmic_68)_68_COSMIC_Count"         
[21] "SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score"            "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score"  "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred"   "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_score" 
[25] "SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HVAR_pred"   "SG_dbNSFP(dbNSFP_2.0)_2_LRT_score"             "SG_dbNSFP(dbNSFP_2.0)_2_LRT_pred"              "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_score"
[29] "SG_dbNSFP(dbNSFP_2.0)_2_Mutation_Taster_pred"  "SG_dbNSFP(dbNSFP_2.0)_2_GERP_NR"               "SG_dbNSFP(dbNSFP_2.0)_2_GERP_RS"               "SG_dbNSFP(dbNSFP_2.0)_2_PhyloP"               
[33] "SG_dbNSFP(dbNSFP_2.0)_2_29way_logOdds"         "SG_dbNSFP(dbNSFP_2.0)_2_LRT_Omega"             "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AF"            "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AFR_AF"       
[37] "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_EUR_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_AMR_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_1000Gp1_ASN_AF"        "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_AA_AF"        
[41] "SG_dbNSFP(dbNSFP_2.0)_2_ESP5400_EA_AF

Note that fields [19:41] are in small caps, which match the header of the vcfF. However, the field descriptors in the info section of the vcfF are all in upper case, e.g.:

NS=1;DP=1196;AF=0.048;SGACOV=58;SGASCORE=0.000;SGGENEDIR=-;SGANNOT=CDS;SGREFAA=G;SGAAC=E;SGMRNAID=8;SGCDSID=7;SGTI=NM_005235.2;SGPI=NP_005226.1;SGGI=ERBB4;SGSEGDES=NT_005403__1..84213159__Genome_Sequence__Chr2;SGSEGPOS=62796652;SGAMBGAINPEN=0.000;SGAMBLOSSPEN=0.000;SG_DBNSFP(DBNSFP_2.0)_2_SIFT_SCORE=0.0300;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_SCORE=1.0000;0.9810;0.9990;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HDIV_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_SCORE=1.0000;0.7430;0.8600;NA;SG_DBNSFP(DBNSFP_2.0)_2_POLYPHEN2_HVAR_PRED=D;P;.;SG_DBNSFP(DBNSFP_2.0)_2_LRT_SCORE=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_LRT_PRED=D;.;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_SCORE=NA;SG_DBNSFP(DBNSFP_2.0)_2_MUTATION_TASTER_PRED=.;SG_DBNSFP(DBNSFP_2.0)_2_GERP_NR=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_GERP_RS=5.8300;SG_DBNSFP(DBNSFP_2.0)_2_PHYLOP=2.7610;SG_DBNSFP(DBNSFP_2.0)_2_29WAY_LOGODDS=20.1150;SG_DBNSFP(DBNSFP_2.0)_2_LRT_OMEGA=0.0000;NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AFR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_EUR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_AMR_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_1000GP1_ASN_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_AA_AF=NA;SG_DBNSFP(DBNSFP_2.0)_2_ESP5400_EA_AF=NA

Not surprisingly, when the fields don't match exactly, we get all NA:

> info(readVcf(vcfF,'hg19'))[55,c(1:4,20:23)]
DataFrame with 1 row and 8 columns
                       NS        DP            AF        SGACOV SG_Cosmic(Cosmic_68)_68_COSMIC_Count SG_dbNSFP(dbNSFP_2.0)_2_SIFT_score SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_score
                <integer> <integer> <NumericList> <IntegerList>                            <integer>                          <numeric>                                    <numeric>
2:212587234_C/T         1      1196         0.048            58                                   NA                                 NA                                           NA
                SG_dbNSFP(dbNSFP_2.0)_2_PolyPhen2_HDIV_pred
                                                <character>
2:212587234_C/T                                          NA​


Any ideas on how to overcome this problem with the vcf file?
> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.12.4 Rsamtools_1.18.2         Biostrings_2.34.0        XVector_0.6.0            GenomicRanges_1.18.3     GenomeInfoDb_1.2.3       IRanges_2.0.0            S4Vectors_0.4.0         
 [9] Biobase_2.26.0           BiocGenerics_0.12.1      ROC_1.42.0               sva_3.12.0               genefilter_1.48.1        mgcv_1.8-3               nlme_3.1-118             BiocInstaller_1.16.1    
[17] data.table_1.9.4        

loaded via a namespace (and not attached):
 [1] annotate_1.44.0         AnnotationDbi_1.28.1    base64enc_0.1-2         BatchJobs_1.5           BBmisc_1.8              BiocParallel_1.0.0      biomaRt_2.22.0          bitops_1.0-6           
 [9] brew_1.0-6              BSgenome_1.34.0         checkmate_1.5.0         chron_2.3-45            codetools_0.2-9         DBI_0.3.1               digest_0.6.4            fail_1.2               
[17] foreach_1.4.2           GenomicAlignments_1.2.1 GenomicFeatures_1.18.2  grid_3.1.2              iterators_1.0.7         lattice_0.20-29         Matrix_1.1-4            plyr_1.8.1             
[25] Rcpp_0.11.3             RCurl_1.95-4.3          reshape2_1.4            RSQLite_1.0.0           rtracklayer_1.26.2      sendmailR_1.2-1         splines_3.1.2           stringr_0.6.2          
[33] survival_2.37-7         tools_3.1.2             XML_3.98-1.1            xtable_1.7-4            zlibbioc_1.12.0   
ADD COMMENTlink modified 4.9 years ago • written 4.9 years ago by jls22820
Answer: readVCF (VariantAnnotation) problem with mismatched header and info fields in vc
0
gravatar for jls2282
4.9 years ago by
jls22820
United States
jls22820 wrote:

OK, maybe not the most elegant solution, but here is code that fixes the vcf header on a vcfF file:

 

#Find 1st header block
library(data.table)
row=1
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep="\n"),1,6)!="##INFO"){
  row=row+1
}
first row<-row
header1<-fread(vcfF,nrows=firstrow-1,header=F,sep='\n')
#Find INFO block
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep='\n'),1,6)=="##INFO"){
  row=row+1
}
mid row<-row
header2<-data.table(NULL)
for(row in firstrow:(midrow-1)){
  line<-fread(vcfF,nrows=1,skip=row-1,header=F,sep=',')
  line$V1<-toupper(line$V1)
  header2<-c(header2,paste(line,collapse=","))
}
# find last INFO block
row<-midrow
while(substr(fread(vcfF,nrows = 1,skip=row-1,header = F,sep='\n'),1,2)=="##"){
  row=row+1
}
last row<-row
header3<-fread(vcfF,skip=midrow-1,nrows=lastrow-midrow,header=F,sep='\n')
# get tabbed body
body<-fread(vcfF,skip=lastrow-1,header=F,sep='\t')
# write to file
fw<-file(paste0(vcfF,"_fixed"),open = 'wt')
write.table(header1,fw,col.names=F,row.names=F,quote = F)
write.table(unlist(header2),fw,col.names=F,row.names=F,quote = F)
write.table(header3,fw,col.names=F,row.names=F,quote=F)
write.table(body,fw,sep = '\t',col.names = F,row.names=F,quote=F)
close(fw)
ADD COMMENTlink written 4.9 years ago by jls22820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 136 users visited in the last hour