VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'

0

Entering edit mode

Julian Gehring ★ 1.3k

@julian-gehring-5818

Last seen 5.0 years ago

Hi Valerie and Herve, When providing a 'Seqinfo' object as the 'genome' argument to 'readVcf', would it be possible to not have the requirement of the VCF header and the seqinfo object to have the same ordering? As the moment (VariantAnnotation_1.8.2), the import will fail if this is not the case. However, this requires the user to know the ordering of the seqnames in the VCF file beforehand. In the ideal case, the readVcf function could reorder the seqnames in the way as provided by the 'genome' argument. Best wishes Julian On 09/28/2013 12:57 AM, Hervé Pagès wrote: > Hi Julian, > > With the latest devel version of GenomicRanges (1.13.48, should become > available via biocLite() in the next 24 hours or so), setting the > seqinfo of a big VCF object should be much faster but your mileage > may vary. Let us know if you still find it slow enough to justify a > mechanism for letting the user specify the seqinfo upfront when using > readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept > everything it supports now plus a Seqinfo object). > > Cheers, > H. > > On 09/24/2013 10:00 AM, Julian Gehring wrote: >> Hi Valerie, >> >>> Thanks for clarifying. No, there is not currently a way to do this. >>> >>> The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can >>> you provide more detail as to (1) why you are changing it (did readVcf() >>> import something incorrectly or ?) and (2) what operations on the >>> 'seqinfo' are taking a long time. >> >> 'readVcf' did everything in a correct way. I needed to add the >> information about the genome, circularity, and seqlengths based on >> external annotitation, since it is not part of the VCF file. >> >> I can't tell you at the moment where 'seqinfo <-' spends most of its >> time. I'll profile it and get back to you. >> >> Thank your for your quick answer and your help! >> >> Best wishes >> Julian >> >> >>> Thanks. >>> Valerie >>> >>>> >>>> Best wishes >>>> Julian >>>> >>>> >>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >>>>> Hi Julian, >>>>> >>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>>>>> Hi, >>>>>> >>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>>>>> import of a VCF file using 'readVcf'? >>>>> >>>>> I think the question is how to read in a subset of >>>>> chromosomes/positions >>>>> from a vcf file without an accompanying tabix index. You can't. >>>>> readVcf() requires an index when subsets are defined by >>>>> chromosome/position. However you can read in subsets defined by INFO >>>>> and/or GENO fields without an index. >>>>> >>>>> Approaches: >>>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >>>>> (2) use ?filterVcf to write out a new file of records of interest >>>>> >>>>>> I'm aware that one can change it >>>>>> with the 'seqinfo' method afterwards, but for large VCF files this >>>>>> can >>>>>> take a significant amount of time. >>>>> >>>>> What operation is taking along time? Subsetting the VCF object by >>>>> chromosome? >>>>> >>>>> Valerie >>>>> >>>>>> >>>>>> An alternative would be to sneak it in by the 'which' arguments, such >>>>>> as: >>>>>> >>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>>>>> >>>>>> but this requires the file to be indexed beforehand. >>>>>> >>>>>> Best wishes >>>>>> Julian >>>>>> >>>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

GenomicRanges GenomicRanges • 1.0k views

ADD COMMENT • link updated 10.5 years ago by Michael Lawrence ★ 11k • written 10.5 years ago by Julian Gehring ★ 1.3k

0

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

This was my doing. I guess merge,Seqinfo,Seqinfo does not support different orderings? Would help to see the actual error message and stack trace. Any advice Herve? Michael On Sun, Nov 3, 2013 at 6:55 AM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Valerie and Herve, > > When providing a 'Seqinfo' object as the 'genome' argument to 'readVcf', > would it be possible to not have the requirement of the VCF header and the > seqinfo object to have the same ordering? As the moment > (VariantAnnotation_1.8.2), the import will fail if this is not the case. > However, this requires the user to know the ordering of the seqnames in > the VCF file beforehand. In the ideal case, the readVcf function could > reorder the seqnames in the way as provided by the 'genome' argument. > > > Best wishes > Julian > > > On 09/28/2013 12:57 AM, Hervé Pagès wrote: > >> Hi Julian, >> >> With the latest devel version of GenomicRanges (1.13.48, should become >> available via biocLite() in the next 24 hours or so), setting the >> seqinfo of a big VCF object should be much faster but your mileage >> may vary. Let us know if you still find it slow enough to justify a >> mechanism for letting the user specify the seqinfo upfront when using >> readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept >> everything it supports now plus a Seqinfo object). >> >> Cheers, >> H. >> >> On 09/24/2013 10:00 AM, Julian Gehring wrote: >> >>> Hi Valerie, >>> >>> Thanks for clarifying. No, there is not currently a way to do this. >>>> >>>> The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can >>>> you provide more detail as to (1) why you are changing it (did readVcf() >>>> import something incorrectly or ?) and (2) what operations on the >>>> 'seqinfo' are taking a long time. >>>> >>> >>> 'readVcf' did everything in a correct way. I needed to add the >>> information about the genome, circularity, and seqlengths based on >>> external annotitation, since it is not part of the VCF file. >>> >>> I can't tell you at the moment where 'seqinfo <-' spends most of its >>> time. I'll profile it and get back to you. >>> >>> Thank your for your quick answer and your help! >>> >>> Best wishes >>> Julian >>> >>> >>> Thanks. >>>> Valerie >>>> >>>> >>>>> Best wishes >>>>> Julian >>>>> >>>>> >>>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >>>>> >>>>>> Hi Julian, >>>>>> >>>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>>>>>> import of a VCF file using 'readVcf'? >>>>>>> >>>>>> >>>>>> I think the question is how to read in a subset of >>>>>> chromosomes/positions >>>>>> from a vcf file without an accompanying tabix index. You can't. >>>>>> readVcf() requires an index when subsets are defined by >>>>>> chromosome/position. However you can read in subsets defined by INFO >>>>>> and/or GENO fields without an index. >>>>>> >>>>>> Approaches: >>>>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >>>>>> (2) use ?filterVcf to write out a new file of records of interest >>>>>> >>>>>> I'm aware that one can change it >>>>>>> with the 'seqinfo' method afterwards, but for large VCF files this >>>>>>> can >>>>>>> take a significant amount of time. >>>>>>> >>>>>> >>>>>> What operation is taking along time? Subsetting the VCF object by >>>>>> chromosome? >>>>>> >>>>>> Valerie >>>>>> >>>>>> >>>>>>> An alternative would be to sneak it in by the 'which' arguments, such >>>>>>> as: >>>>>>> >>>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>>>>>> >>>>>>> but this requires the file to be indexed beforehand. >>>>>>> >>>>>>> Best wishes >>>>>>> Julian >>>>>>> >>>>>>> >>>>>> >>>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 10.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael, As a minimal example, I'll try to import the non-coding COSMIC VCF (ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v67_ 20131024.vcf.gz) with a NCBI 37 seqinfo object. #+BEGIN_SRC R > seq_info Seqinfo of length 25 seqnames seqlengths isCircular genome 1 249250621 FALSE GRCh37 2 243199373 FALSE GRCh37 3 198022430 FALSE GRCh37 4 191154276 FALSE GRCh37 5 180915260 FALSE GRCh37 ... ... ... ... 21 48129895 FALSE GRCh37 22 51304566 FALSE GRCh37 X 155270560 FALSE GRCh37 Y 59373566 FALSE GRCh37 MT 16569 TRUE GRCh37 > x = readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) Warning in .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames Error in makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : when 'new2old' is NULL, the first elements in the supplied 'seqlevels' must be identical to 'seqlevels(x)' > traceback() 8: stop("when 'new2old' is NULL, the first elements in the\n", " supplied 'seqlevels' must be identical to 'seqlevels(x)'") 7: makeNewSeqnames(x, new2old = new2old, seqlevels(value)) 6: `seqinfo<-`(`*tmp*`, value = <s4 object="" of="" class="" "seqinfo"="">) 5: `seqinfo<-`(`*tmp*`, value = <s4 object="" of="" class="" "seqinfo"="">) 4: .scanVcfToVCF(scanVcf(file, param = param), file, genome, param) 3: .readVcf(file, genome, param = ScanVcfParam()) 2: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) 1: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) #+END_SRC Please let me know if you need any more information. Best wishes Julian On 11/03/2013 07:12 PM, Michael Lawrence wrote: > This was my doing. I guess merge,Seqinfo,Seqinfo does not support > different orderings? Would help to see the actual error message and > stack trace. Any advice Herve? > > Michael > > > On Sun, Nov 3, 2013 at 6:55 AM, Julian Gehring <julian.gehring at="" embl.de=""> <mailto:julian.gehring at="" embl.de="">> wrote: > > Hi Valerie and Herve, > > When providing a 'Seqinfo' object as the 'genome' argument to > 'readVcf', would it be possible to not have the requirement of the > VCF header and the seqinfo object to have the same ordering? As the > moment (VariantAnnotation_1.8.2), the import will fail if this is > not the case. However, this requires the user to know the ordering > of the seqnames in the VCF file beforehand. In the ideal case, the > readVcf function could reorder the seqnames in the way as provided > by the 'genome' argument. > > > Best wishes > Julian > > > On 09/28/2013 12:57 AM, Hervé Pagès wrote: > > Hi Julian, > > With the latest devel version of GenomicRanges (1.13.48, should > become > available via biocLite() in the next 24 hours or so), setting the > seqinfo of a big VCF object should be much faster but your mileage > may vary. Let us know if you still find it slow enough to justify a > mechanism for letting the user specify the seqinfo upfront when > using > readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and > accept > everything it supports now plus a Seqinfo object). > > Cheers, > H. > > On 09/24/2013 10:00 AM, Julian Gehring wrote: > > Hi Valerie, > > Thanks for clarifying. No, there is not currently a way > to do this. > > The 'seqinfo' on the rowData(vcf) should not be > difficult to change. Can > you provide more detail as to (1) why you are changing > it (did readVcf() > import something incorrectly or ?) and (2) what > operations on the > 'seqinfo' are taking a long time. > > > 'readVcf' did everything in a correct way. I needed to add the > information about the genome, circularity, and seqlengths > based on > external annotitation, since it is not part of the VCF file. > > I can't tell you at the moment where 'seqinfo <-' spends > most of its > time. I'll profile it and get back to you. > > Thank your for your quick answer and your help! > > Best wishes > Julian > > > Thanks. > Valerie > > > Best wishes > Julian > > > On 09/24/2013 06:31 PM, Valerie Obenchain wrote: > > Hi Julian, > > On 09/24/2013 02:29 AM, Julian Gehring wrote: > > Hi, > > Is there a direct way to specifiy the > 'seqinfo' of a genome for the > import of a VCF file using 'readVcf'? > > > I think the question is how to read in a subset of > chromosomes/positions > from a vcf file without an accompanying tabix > index. You can't. > readVcf() requires an index when subsets are > defined by > chromosome/position. However you can read in > subsets defined by INFO > and/or GENO fields without an index. > > Approaches: > (1) create index with ?indexTabix and specify > 'which' in ScanVcfParam > (2) use ?filterVcf to write out a new file of > records of interest > > I'm aware that one can change it > with the 'seqinfo' method afterwards, but > for large VCF files this > can > take a significant amount of time. > > > What operation is taking along time? Subsetting > the VCF object by > chromosome? > > Valerie > > > An alternative would be to sneak it in by > the 'which' arguments, such > as: > > readVcf(file, genome, ScanVcfParam(which = > as(seq_info, "GRanges"))) > > but this requires the file to be indexed > beforehand. > > Best wishes > Julian > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > >

ADD REPLY • link 10.5 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Herve, Do you think seqinfo<-() could be made smarter, i.e., when new2old is NULL, it could just assume the equivalent of seqlevels(x) <- seqlevels(merged), which should do what we want in this case, since merge(seqinfo(x), seqinfo(y)) makes it so seqlevels(x) precede those unique to seqlevels(y)? Michael On Sun, Nov 3, 2013 at 12:41 PM, Julian Gehring <julian.gehring@embl.de>wrote: > Hi Michael, > > As a minimal example, I'll try to import the non-coding COSMIC VCF ( > ftp://ngs.sanger.ac.uk/production/cosmic/CosmicNonCodingVariants_v67_ > 20131024.vcf.gz) with a NCBI 37 seqinfo object. > > #+BEGIN_SRC R > > > seq_info > Seqinfo of length 25 > seqnames seqlengths isCircular genome > 1 249250621 FALSE GRCh37 > 2 243199373 FALSE GRCh37 > 3 198022430 FALSE GRCh37 > 4 191154276 FALSE GRCh37 > 5 180915260 FALSE GRCh37 > ... ... ... ... > 21 48129895 FALSE GRCh37 > 22 51304566 FALSE GRCh37 > X 155270560 FALSE GRCh37 > Y 59373566 FALSE GRCh37 > MT 16569 TRUE GRCh37 > > > x = readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) > Warning in .bcfHeaderAsSimpleList(header) : > duplicate keys in header will be forced to unique rownames > Error in makeNewSeqnames(x, new2old = new2old, seqlevels(value)) : > when 'new2old' is NULL, the first elements in the > supplied 'seqlevels' must be identical to 'seqlevels(x)' > > > traceback() > > 8: stop("when 'new2old' is NULL, the first elements in the\n", " supplied > 'seqlevels' must be identical to 'seqlevels(x)'") > 7: makeNewSeqnames(x, new2old = new2old, seqlevels(value)) > 6: `seqinfo<-`(`*tmp*`, value = <s4 object="" of="" class="" "seqinfo"="">) > 5: `seqinfo<-`(`*tmp*`, value = <s4 object="" of="" class="" "seqinfo"="">) > 4: .scanVcfToVCF(scanVcf(file, param = param), file, genome, param) > 3: .readVcf(file, genome, param = ScanVcfParam()) > 2: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) > 1: readVcf("CosmicNonCodingVariants_v67_20131024.vcf.gz", seq_info) > > #+END_SRC > > Please let me know if you need any more information. > > Best wishes > Julian > > > > On 11/03/2013 07:12 PM, Michael Lawrence wrote: > >> This was my doing. I guess merge,Seqinfo,Seqinfo does not support >> different orderings? Would help to see the actual error message and >> stack trace. Any advice Herve? >> >> Michael >> >> >> On Sun, Nov 3, 2013 at 6:55 AM, Julian Gehring <julian.gehring@embl.de>> <mailto:julian.gehring@embl.de>> wrote: >> >> Hi Valerie and Herve, >> >> When providing a 'Seqinfo' object as the 'genome' argument to >> 'readVcf', would it be possible to not have the requirement of the >> VCF header and the seqinfo object to have the same ordering? As the >> moment (VariantAnnotation_1.8.2), the import will fail if this is >> not the case. However, this requires the user to know the ordering >> of the seqnames in the VCF file beforehand. In the ideal case, the >> readVcf function could reorder the seqnames in the way as provided >> by the 'genome' argument. >> >> >> Best wishes >> Julian >> >> >> On 09/28/2013 12:57 AM, Hervé Pagès wrote: >> >> Hi Julian, >> >> With the latest devel version of GenomicRanges (1.13.48, should >> become >> available via biocLite() in the next 24 hours or so), setting the >> seqinfo of a big VCF object should be much faster but your mileage >> may vary. Let us know if you still find it slow enough to justify >> a >> mechanism for letting the user specify the seqinfo upfront when >> using >> readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and >> accept >> everything it supports now plus a Seqinfo object). >> >> Cheers, >> H. >> >> On 09/24/2013 10:00 AM, Julian Gehring wrote: >> >> Hi Valerie, >> >> Thanks for clarifying. No, there is not currently a way >> to do this. >> >> The 'seqinfo' on the rowData(vcf) should not be >> difficult to change. Can >> you provide more detail as to (1) why you are changing >> it (did readVcf() >> import something incorrectly or ?) and (2) what >> operations on the >> 'seqinfo' are taking a long time. >> >> >> 'readVcf' did everything in a correct way. I needed to add >> the >> information about the genome, circularity, and seqlengths >> based on >> external annotitation, since it is not part of the VCF file. >> >> I can't tell you at the moment where 'seqinfo <-' spends >> most of its >> time. I'll profile it and get back to you. >> >> Thank your for your quick answer and your help! >> >> Best wishes >> Julian >> >> >> Thanks. >> Valerie >> >> >> Best wishes >> Julian >> >> >> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >> >> Hi Julian, >> >> On 09/24/2013 02:29 AM, Julian Gehring wrote: >> >> Hi, >> >> Is there a direct way to specifiy the >> 'seqinfo' of a genome for the >> import of a VCF file using 'readVcf'? >> >> >> I think the question is how to read in a subset of >> chromosomes/positions >> from a vcf file without an accompanying tabix >> index. You can't. >> readVcf() requires an index when subsets are >> defined by >> chromosome/position. However you can read in >> subsets defined by INFO >> and/or GENO fields without an index. >> >> Approaches: >> (1) create index with ?indexTabix and specify >> 'which' in ScanVcfParam >> (2) use ?filterVcf to write out a new file of >> records of interest >> >> I'm aware that one can change it >> with the 'seqinfo' method afterwards, but >> for large VCF files this >> can >> take a significant amount of time. >> >> >> What operation is taking along time? Subsetting >> the VCF object by >> chromosome? >> >> Valerie >> >> >> An alternative would be to sneak it in by >> the 'which' arguments, such >> as: >> >> readVcf(file, genome, ScanVcfParam(which = >> as(seq_info, "GRanges"))) >> >> but this requires the file to be indexed >> beforehand. >> >> Best wishes >> Julian >> >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org>> > >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__ >> conductor >> <http: news.gmane.org="" gmane.science.biology.informatics.="">> conductor> >> >> >> >> _________________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-project.org> >> https://stat.ethz.ch/mailman/__listinfo/bioconductor >> >> <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: >> http://news.gmane.org/gmane.__science.biology.informatics.__conductor< >> http://news.gmane.org/gmane.science.biology.informatics.conductor> >> >> >> [[alternative HTML version deleted]]

ADD REPLY • link 10.5 years ago Michael Lawrence ★ 11k

Login before adding your answer.