Problem reading VCF file using readVcf from package VariantAnnotation

0

Entering edit mode

UBod ▴ 290

@ubodenhofer-5425

Last seen 6 hours ago

University of Applied Sciences Upper Au…

Hi, I am trying to read genotype data from a large VCF file using the readVcf() function from the VariantAnnotation package. I am not reading the entire file (which would crash my R session because of a lack of memory). Instead, I am reading bunches of SNV data located in 200kbp regions which I specify by passing a GRanges object to ScanVcfParams() first. No matter what I do, I get the following error message: when the supplied 'genome' vector is named, the names must match the seqnames As far as I can make sense of this message, it seems that there is some mismatch between the genome characteristics in my GRanges object and the genome characteristics in the VCF file. I dissected the R object returned by scanVcfHeader() and indeed found some interesting mismatches: The genome in the VCF file is denoted as "b37" and the sequence names are not 100% compatible with hg19. The lengths of chromosomes 1-22, X, and Y do match, but the lengths of mitochondrial DNA (denoted "M" in gh19 and "MT" in b37) differ by 2. So I forced my GRanges object to be 100% compatible with the information stored in the VCF file (by copying seqlevels, genome, and seqlengths) and restricted my analysis to chromosomes 1-22 and X. However, I still get the same error message. I also tried to locate the error message in the source code of the VariantAnnotation package to understand better what the problem is, but I could not find it. It seems the message is produced by a function that VariantAnnotation calls from another package. Any idea? Thanks in advance and best regards, Ulrich ---------------------------------------------------------------------- -- *Dr. Ulrich Bodenhofer* Associate Professor Institute of Bioinformatics *Johannes Kepler University* Altenberger Str. 69 4040 Linz, Austria Tel. +43 732 2468 4526 Fax +43 732 2468 4539 bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> http://www.bioinf.jku.at/ <http: www.bioinf.jku.at="">

VariantAnnotation VariantAnnotation VariantAnnotation VariantAnnotation • 3.0k views

ADD COMMENT • link updated 11.0 years ago by Valerie Obenchain ★ 6.8k • written 11.0 years ago by UBod ▴ 290

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Ulrich, In addition to reading genomic regions with 'which' in ScanVcfParam(), you can also use 'yieldSize' in TabixFile() to iterate through a VCF file in chunks. See example at the bottom of ?readVcf man page or ?TabixFile. The error you are seeing is thrown from a function in GenomicRanges, specifically .normargSeqlengths(). This function is called when checking compatibility of the Seqinfo object(s). How was the GRanges 'param' created? Did it come from a different genome? I'm not clear on why the GRanges has potentially different chromosome names, lengths and genome? If the VCF file is 'b37', all data (names, lengths, genome if present) in the GRanges must be compatible with that. Please provide a sample of the code you've tried and the output error. Specifically, it would be helpful to see seqinfo(GRanges) and meta(header(vcf)). Valerie On 04/24/2013 04:53 AM, Ulrich Bodenhofer wrote: > Hi, > > I am trying to read genotype data from a large VCF file using the > readVcf() function from the VariantAnnotation package. I am not reading > the entire file (which would crash my R session because of a lack of > memory). Instead, I am reading bunches of SNV data located in 200kbp > regions which I specify by passing a GRanges object to ScanVcfParams() > first. No matter what I do, I get the following error message: > > when the supplied 'genome' vector is named, the names must match the > seqnames > > As far as I can make sense of this message, it seems that there is some > mismatch between the genome characteristics in my GRanges object and the > genome characteristics in the VCF file. I dissected the R object > returned by scanVcfHeader() and indeed found some interesting > mismatches: The genome in the VCF file is denoted as "b37" and the > sequence names are not 100% compatible with hg19. The lengths of > chromosomes 1-22, X, and Y do match, but the lengths of mitochondrial > DNA (denoted "M" in gh19 and "MT" in b37) differ by 2. So I forced my > GRanges object to be 100% compatible with the information stored in the > VCF file (by copying seqlevels, genome, and seqlengths) and restricted > my analysis to chromosomes 1-22 and X. However, I still get the same > error message. > > I also tried to locate the error message in the source code of the > VariantAnnotation package to understand better what the problem is, but > I could not find it. It seems the message is produced by a function that > VariantAnnotation calls from another package. > > Any idea? > > Thanks in advance and best regards, > Ulrich > > > -------------------------------------------------------------------- ---- > *Dr. Ulrich Bodenhofer* > Associate Professor > Institute of Bioinformatics > > *Johannes Kepler University* > Altenberger Str. 69 > 4040 Linz, Austria > > Tel. +43 732 2468 4526 > Fax +43 732 2468 4539 > bodenhofer at bioinf.jku.at <mailto:bodenhofer at="" bioinf.jku.at=""> > http://www.bioinf.jku.at/ <http: www.bioinf.jku.at=""> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.0 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, First of all, thanks a lot for your very quick and helpful reply! Your hint to the function from GenomicRanges helped me to find the problem. So, finally, I can safely say that there is no problem in readVcf that caused this issue. The problem was that I supplied a single character string "b37" as genome argument to readVcf(). I did not pay attention to whether this string has a name attribute (which was, unfortunately, the case). Obviously, readVcf checks whether the names of the 'genome' argument match the sequence names in the VCF file. Knowing that, the error message makes perfect sense to me! I now stripped off the names from the string I am passing as 'genome' argument, and it works great! Interestingly, however, if I pass a whole vector to the 'genome' argument, the error persists even if the sequence names match perfectly. Anyway, I don't care, my solution to pass a single string without a name is absolutely acceptable for me. Thanks again and best regards, Ulrich On 04/24/2013 06:06 PM, Valerie Obenchain wrote: > Hi Ulrich, > > In addition to reading genomic regions with 'which' in ScanVcfParam(), > you can also use 'yieldSize' in TabixFile() to iterate through a VCF > file in chunks. See example at the bottom of ?readVcf man page or > ?TabixFile. > > The error you are seeing is thrown from a function in GenomicRanges, > specifically .normargSeqlengths(). This function is called when > checking compatibility of the Seqinfo object(s). > > How was the GRanges 'param' created? Did it come from a different > genome? I'm not clear on why the GRanges has potentially different > chromosome names, lengths and genome? If the VCF file is 'b37', all > data (names, lengths, genome if present) in the GRanges must be > compatible with that. > > Please provide a sample of the code you've tried and the output error. > Specifically, it would be helpful to see seqinfo(GRanges) and > meta(header(vcf)). > > Valerie > > On 04/24/2013 04:53 AM, Ulrich Bodenhofer wrote: >> Hi, >> >> I am trying to read genotype data from a large VCF file using the >> readVcf() function from the VariantAnnotation package. I am not reading >> the entire file (which would crash my R session because of a lack of >> memory). Instead, I am reading bunches of SNV data located in 200kbp >> regions which I specify by passing a GRanges object to ScanVcfParams() >> first. No matter what I do, I get the following error message: >> >> when the supplied 'genome' vector is named, the names must match the >> seqnames >> >> As far as I can make sense of this message, it seems that there is some >> mismatch between the genome characteristics in my GRanges object and the >> genome characteristics in the VCF file. I dissected the R object >> returned by scanVcfHeader() and indeed found some interesting >> mismatches: The genome in the VCF file is denoted as "b37" and the >> sequence names are not 100% compatible with hg19. The lengths of >> chromosomes 1-22, X, and Y do match, but the lengths of mitochondrial >> DNA (denoted "M" in gh19 and "MT" in b37) differ by 2. So I forced my >> GRanges object to be 100% compatible with the information stored in the >> VCF file (by copying seqlevels, genome, and seqlengths) and restricted >> my analysis to chromosomes 1-22 and X. However, I still get the same >> error message. >> >> I also tried to locate the error message in the source code of the >> VariantAnnotation package to understand better what the problem is, but >> I could not find it. It seems the message is produced by a function that >> VariantAnnotation calls from another package. >> >> Any idea? >> >> Thanks in advance and best regards, >> Ulrich

ADD REPLY • link 11.0 years ago UBod ▴ 290

Login before adding your answer.