readVcf does not recognize tabix index
0
0
Entering edit mode
naive • 0
@e3519b66
Last seen 14 months ago

Hello everyone,

I am trying to read a vcf file with the readVcf command of the package VariantAnnotation with R. Before, I generated a tbi file from my original vcf file using the following commadns in the shell:

1. extract the markers of the 1st chromosome tabix -h SNPs.vcf.bgz chr1H > chr1.vcf

2. create tabix index for the vcf file created in step 1 tabix -p vcf chr1.vcf

The next step is executed in R

vcffile = open(VcfFile(file = "chr1.vcf",
index = "chr1.vcf.bgz.tbi",
yieldSize = 1000))


This returns the following error: Error: scanVcf: scanVcf: scanTabix: [internal] hmm.. this doesn't look like a tabix file, sorry

Can anybody tell the meaning of this error? How can I find out what is wrong with the tabix file?

Thank you!

VariantAnnotation readvcf tabix • 952 views
0
Entering edit mode

Maybe try using VariantAnnotation::indexVcf to create the needed index file?

0
Entering edit mode

Thank you for your answer! I am completely new to working with vcf files so any hint is valuable for me.

 indexVcf(x = "chr1.vcf.bgz")


returns:

class: VcfFile
path: chr1.vcf.bgz
index: chr1.vcf.bgz.tbi
isOpen: FALSE
yieldSize: NA


and

vcffile = open(VcfFile(file = "chr1.vcf",
index = indexVcf(x = "chr1.vcf.bgz" ),
yieldSize = 1000))


returns an error:

Error in open.TabixFile(VcfFile(file = "chr1.vcf", index = indexVcf(x = "chr1.vcf.bgz"),  :
failed to open index file: chr1.vcf.bgz
In is.na(index) : is.na() applied to non-(list or vector) of type 'S4'


Maybe there is an error in the vcf file? In the header of the vcf file it says that the fileformat is version 4.0. It seems that the SNPs are sorted according to their physical position...

0
Entering edit mode

Does adding the full name with the tbi at the end of the index file change the output?

vcffile = open(VcfFile(file = "chr1.vcf",
index = indexVcf(x = "chr1.vcf.bgz.tbi" ),
yieldSize = 1000))

0
Entering edit mode

It returns

[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "TBI?"
Warning message:
In is.na(index) : is.na() applied to non-(list or vector) of type 'S4'

0
Entering edit mode

think there are some additional arguments that can be passed when using indexVcf . The function will take the additional arguments found in indexTabix perhaps the format would be useful to pass in? It looks like there may be a specific format option for vcf4. See ?indexTabix in R for more information

0
Entering edit mode

OK, this looks better. However, the initial error seems to remain

vcffile = open(VcfFile(file = "chr1.vcf",
index = indexTabix(file = "chr1.vcf.bgz", format = "vcf4"),
yieldSize = 1000))
> vcffile
class: VcfFile
path: chr1.vcf
index: chr1.vcf.bgz.tbi
isOpen: TRUE


Next step:

>   vcf = readVcf(vcffile, "chr1H")
Error: scanVcf: scanVcf: scanTabix: [internal] hmm.. this doesn't look like a tabix file, sorry
path: chr1.vcf
index: chr1.vcf.bgz.tbi
path: chr1.vcf

0
Entering edit mode

hmm. I'm wondering if the reading and indexing of the original file works? And then filter for chr1 after? There is more information on doing this in the vignettes found on the landing page . Does creating the index on the full SNPs.vcf.bgz then allow you to read in?
Or depending on what you intend to do you might be able to load the vcf without specifying the index?

0
Entering edit mode

Initially, this was what I intended to do. However, SNPs.vcf.bgz seems to be too large for indexing. Therefore, I came up with the solution to go for each chromosome by itself.

1
Entering edit mode

Would you mind sharing a subset of the vcf file so I can try and debug what is going on (lori.shepherd @ roswellpark.org). So I can narrow it down what the issue might be, does loading the vcf without the index works? VcfFile(file = "chr1.vcf")

0
Entering edit mode

Thank you for your offer! I will try to generate a subset. Thank you for your support!

> VcfFile(file = "chr1.vcf")
class: VcfFile
path: chr1.vcf
index: NA
isOpen: FALSE
yieldSize: NA

0
Entering edit mode

As far as I understand, tabix only works for a smaller data set. CSI index is then recommended, if the number of SNPs excedes a certain threshold.

~/htslib/bin/tabix -C -p vcf SNPs.vcf.bgz

this creates a CSI indexed file. However, it seems that this cannot be read into R with VariantAnnotation...

0
Entering edit mode

See Answer: VCF File too large for tabix. Best option to make it usable with R? for further discussion of ingestion of CSI-indexed VCF