Hello everyone,
I am trying to read a vcf file with the readVcf command of the package VariantAnnotation with R. Before, I generated a tbi file from my original vcf file using the following commadns in the shell:
extract the markers of the 1st chromosome tabix -h SNPs.vcf.bgz chr1H > chr1.vcf
create tabix index for the vcf file created in step 1 tabix -p vcf chr1.vcf
The next step is executed in R
vcffile = open(VcfFile(file = "chr1.vcf",
index = "chr1.vcf.bgz.tbi",
yieldSize = 1000))
vcf = readVcf(vcffile)
This returns the following error: Error: scanVcf: scanVcf: scanTabix: [internal] hmm.. this doesn't look like a tabix file, sorry
Can anybody tell the meaning of this error? How can I find out what is wrong with the tabix file?
Thank you!
Maybe try using
VariantAnnotation::indexVcf
to create the needed index file?Thank you for your answer! I am completely new to working with vcf files so any hint is valuable for me.
returns:
and
returns an error:
Maybe there is an error in the vcf file? In the header of the vcf file it says that the fileformat is version 4.0. It seems that the SNPs are sorted according to their physical position...
Does adding the full name with the
tbi
at the end of the index file change the output?It returns
think there are some additional arguments that can be passed when using
indexVcf
. The function will take the additional arguments found inindexTabix
perhaps the format would be useful to pass in? It looks like there may be a specific format option for vcf4. See?indexTabix
in R for more informationOK, this looks better. However, the initial error seems to remain
Next step:
hmm. I'm wondering if the reading and indexing of the original file works? And then filter for chr1 after? There is more information on doing this in the vignettes found on the landing page . Does creating the index on the full SNPs.vcf.bgz then allow you to read in?
Or depending on what you intend to do you might be able to load the vcf without specifying the index?
Initially, this was what I intended to do. However, SNPs.vcf.bgz seems to be too large for indexing. Therefore, I came up with the solution to go for each chromosome by itself.
Would you mind sharing a subset of the vcf file so I can try and debug what is going on (lori.shepherd @ roswellpark.org). So I can narrow it down what the issue might be, does loading the vcf without the index works?
VcfFile(file = "chr1.vcf")
Thank you for your offer! I will try to generate a subset. Thank you for your support!
As far as I understand, tabix only works for a smaller data set. CSI index is then recommended, if the number of SNPs excedes a certain threshold.
~/htslib/bin/tabix -C -p vcf SNPs.vcf.bgz
this creates a CSI indexed file. However, it seems that this cannot be read into R with VariantAnnotation...
See Answer: VCF File too large for tabix. Best option to make it usable with R? for further discussion of ingestion of CSI-indexed VCF