Question

VariantAnnotation VCF read error

0

Entering edit mode

JK ▴ 10

@jk-6972

Last seen 10.5 years ago

United States

Hi,

By running

(vcf <- readVcf(vcffile, "hg38"))

in VariantAnnotation I get an error message

Error in DataFrame(Samples = seq_along(colnms), row.names = colnms) : duplicate row names

What may be causing this? I am not sure how I ended up with duplicate

names in the VCF file. My VCF file was generated by merging several

files using vcftools vcf-merge function. May this be the problem?

Thank you!

>sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.12.2 Rsamtools_1.18.1         Biostrings_2.34.0        XVector_0.6.0            GenomicRanges_1.18.1     GenomeInfoDb_1.2.2      
 [7] IRanges_2.0.0            S4Vectors_0.4.0          GWASTools_1.12.0         gdsfmt_1.0.4             ncdf_1.6.8               Biobase_2.26.0          
[13] BiocGenerics_0.12.0     

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.28.1    base64enc_0.1-2         BatchJobs_1.4           BBmisc_1.7              BiocParallel_1.0.0      biomaRt_2.22.0         
 [7] bitops_1.0-6            brew_1.0-6              BSgenome_1.34.0         checkmate_1.4           codetools_0.2-9         DBI_0.3.1              
[13] digest_0.6.4            DNAcopy_1.40.0          fail_1.2                foreach_1.4.2           GenomicAlignments_1.2.0 GenomicFeatures_1.18.2 
[19] grid_3.1.1              GWASExactHW_1.01        iterators_1.0.7         lattice_0.20-29         lmtest_0.9-33           quantreg_5.05          
[25] quantsmooth_1.32.0      RCurl_1.95-4.3          RSQLite_0.11.4          rtracklayer_1.26.1      sandwich_2.3-2          sendmailR_1.2-1        
[31] SparseM_1.05            splines_3.1.1           stringr_0.6.2           survival_2.37-7         tools_3.1.1             XML_3.98-1.1           
[37] zlibbioc_1.12.0         zoo_1.7-11

variantannotation readvcf • 2.4k views

ADD COMMENT • link updated 11.2 years ago by Valerie Obenchain ★ 6.8k • written 11.2 years ago by JK ▴ 10

score 0 · Answer 1 · 2014-11-03

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hi Jozsef,

The duplicate names are likely in the header FORMAT fields. If the file isn't too big, open it in an editor and look at the header tags marked with FORMAT. According to the vcf spec, the 'ID' key for a particular field should be unique. This means lines starting with INFO should have different 'ID' keys, the same applies for lines starting with FORMAT.

You can try reading in just the header information but if you have duplicate fields you may get an error:

> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> hdr <- scanVcfHeader(fl)

Another approach is to scan in the data then look at the names of the FORMAT (i.e., 'geno') fields:

> scn <- scanVcf(fl)
> names(scn[[1]]$GENO)
[1] "GT" "GQ" "DP" "HQ"

Valerie

ADD COMMENT • link 11.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Valerie,

Thank you.

This is what I have:

> names(scn[[1]]$GENO)
[1] "GT" "AD" "DP" "GQ" "PL"
> hdr
class: VCFHeader 
samples(18): sample AGP002_output_filtered_sample ... AGP046_output_filtered_sample AGP061_output_filtered_sample
meta(2): fileformat reference
fixed(1): FILTER
info(18): AF BaseQRankSum ... AC AN
geno(5): GT AD DP GQ PL

Jozsef

ADD REPLY • link 11.2 years ago JK ▴ 10

0

Entering edit mode

Can you send the file (or a small portion of it) to me off-line? (vobencha@fhcrc.org)

Valerie

ADD REPLY • link 11.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks for sending the file. (Testing done with VariantAnnotation 1.13.5 in devel.)

The duplicates were in the sample names, not the FORMAT field; sorry for steering you wrong there. You can view the samples by calling samples() on the header object:

hdr <- scanVcfHeader(fl)

> samples(hdr)
 [1] "sample"                        "AGP002_output_filtered_sample"
 [3] "AGP003_output_filtered_sample" "AGP004_output_filtered_sample"
 [5] "AGP004_output_filtered_sample" "AGP007_output_filtered_sample"
 [7] "AGP009_output_filtered_sample" "AGP012_output_filtered_sample"
 [9] "AGP013_output_filtered_sample" "AGP022_output_filtered_sample"
[11] "AGP025_output_filtered_sample" "AGP027_output_filtered_sample"
[13] "AGP029_output_filtered_sample" "AGP040_output_filtered_sample"
[15] "AGP041_output_filtered_sample" "AGP044_output_filtered_sample"
[17] "AGP046_output_filtered_sample" "AGP061_output_filtered_sample"

The duplicate entries are 'AGP004_output_filtered_sample'. Also, it looks like the first 'sample' is missing a prefix. Maybe it should have 'AGP001_output_filtered' in front?

You can more easily see the duplicates with a self-match (4th element matches both the 4th and 5th names):

> match(samples(hdr), samples(hdr))
 [1]  1  2  3  4  4  6  7  8  9 10 11 12 13 14 15 16 17 18

There are also many extra tabs at the end of most header lines. You can see these by inspecting the header with meta(). Because the VCF files are tab-delimited it would be good to remove these (i.e., they aren't treated as just white space).

> names(meta(hdr))
[1] "META"                                                          
[2] "FILTER\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"         
[3] "FORMAT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"         
...

(FYI, the release version of VariantAnnotation returns a DataFrame for meta(hdr) instead of the DataFrameList you'll see in devel. It just a different packaging of the same information.)

Let me know if you still have problems after cleaning up the extra tabs and fixing the sample names.

Valerie

ADD REPLY • link 11.2 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Thanks. That was an error in the script used to generate this file.

ADD REPLY • link 11.2 years ago JK ▴ 10