VariantAnnotation: Specifying 'seqinfo' at import with 'readVcf'

0

Entering edit mode

Julian Gehring ★ 1.3k

@julian-gehring-5818

Last seen 5.0 years ago

Hi, Is there a direct way to specifiy the 'seqinfo' of a genome for the import of a VCF file using 'readVcf'? I'm aware that one can change it with the 'seqinfo' method afterwards, but for large VCF files this can take a significant amount of time. An alternative would be to sneak it in by the 'which' arguments, such as: readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) but this requires the file to be indexed beforehand. Best wishes Julian

• 1.2k views

ADD COMMENT • link updated 10.6 years ago by Valerie Obenchain ★ 6.8k • written 10.6 years ago by Julian Gehring ★ 1.3k

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 2.3 years ago

United States

Hi Julian, On 09/24/2013 02:29 AM, Julian Gehring wrote: > Hi, > > Is there a direct way to specifiy the 'seqinfo' of a genome for the > import of a VCF file using 'readVcf'? I think the question is how to read in a subset of chromosomes/positions from a vcf file without an accompanying tabix index. You can't. readVcf() requires an index when subsets are defined by chromosome/position. However you can read in subsets defined by INFO and/or GENO fields without an index. Approaches: (1) create index with ?indexTabix and specify 'which' in ScanVcfParam (2) use ?filterVcf to write out a new file of records of interest > I'm aware that one can change it > with the 'seqinfo' method afterwards, but for large VCF files this can > take a significant amount of time. What operation is taking along time? Subsetting the VCF object by chromosome? Valerie > > An alternative would be to sneak it in by the 'which' arguments, such as: > > readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) > > but this requires the file to be indexed beforehand. > > Best wishes > Julian >

ADD COMMENT • link 10.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, In this case, I'm not concerned about reading only a part of the VCF file. When I call 'readVCF', a 'GRanges' object gets created and also the corresponding 'seqinfo' slot. I was trying to find a way to feed the 'seqinfo' information directly in the construction of the VCF, rather than changing it after the VCF object already has been created. Is there a way to do this? Best wishes Julian On 09/24/2013 06:31 PM, Valerie Obenchain wrote: > Hi Julian, > > On 09/24/2013 02:29 AM, Julian Gehring wrote: >> Hi, >> >> Is there a direct way to specifiy the 'seqinfo' of a genome for the >> import of a VCF file using 'readVcf'? > > I think the question is how to read in a subset of chromosomes/positions > from a vcf file without an accompanying tabix index. You can't. > readVcf() requires an index when subsets are defined by > chromosome/position. However you can read in subsets defined by INFO > and/or GENO fields without an index. > > Approaches: > (1) create index with ?indexTabix and specify 'which' in ScanVcfParam > (2) use ?filterVcf to write out a new file of records of interest > >> I'm aware that one can change it >> with the 'seqinfo' method afterwards, but for large VCF files this can >> take a significant amount of time. > > What operation is taking along time? Subsetting the VCF object by > chromosome? > > Valerie > >> >> An alternative would be to sneak it in by the 'which' arguments, such as: >> >> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >> >> but this requires the file to be indexed beforehand. >> >> Best wishes >> Julian >> >

ADD REPLY • link 10.6 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

On 09/24/2013 09:36 AM, Julian Gehring wrote: > Hi Valerie, > > In this case, I'm not concerned about reading only a part of the VCF file. > > When I call 'readVCF', a 'GRanges' object gets created and also the > corresponding 'seqinfo' slot. I was trying to find a way to feed the > 'seqinfo' information directly in the construction of the VCF, rather > than changing it after the VCF object already has been created. Is > there a way to do this? Thanks for clarifying. No, there is not currently a way to do this. The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can you provide more detail as to (1) why you are changing it (did readVcf() import something incorrectly or ?) and (2) what operations on the 'seqinfo' are taking a long time. Thanks. Valerie > > Best wishes > Julian > > > On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >> Hi Julian, >> >> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>> Hi, >>> >>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>> import of a VCF file using 'readVcf'? >> >> I think the question is how to read in a subset of chromosomes/positions >> from a vcf file without an accompanying tabix index. You can't. >> readVcf() requires an index when subsets are defined by >> chromosome/position. However you can read in subsets defined by INFO >> and/or GENO fields without an index. >> >> Approaches: >> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >> (2) use ?filterVcf to write out a new file of records of interest >> >>> I'm aware that one can change it >>> with the 'seqinfo' method afterwards, but for large VCF files this can >>> take a significant amount of time. >> >> What operation is taking along time? Subsetting the VCF object by >> chromosome? >> >> Valerie >> >>> >>> An alternative would be to sneak it in by the 'which' arguments, such >>> as: >>> >>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>> >>> but this requires the file to be indexed beforehand. >>> >>> Best wishes >>> Julian >>> >>

ADD REPLY • link 10.6 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Hi Valerie, > Thanks for clarifying. No, there is not currently a way to do this. > > The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can > you provide more detail as to (1) why you are changing it (did readVcf() > import something incorrectly or ?) and (2) what operations on the > 'seqinfo' are taking a long time. 'readVcf' did everything in a correct way. I needed to add the information about the genome, circularity, and seqlengths based on external annotitation, since it is not part of the VCF file. I can't tell you at the moment where 'seqinfo <-' spends most of its time. I'll profile it and get back to you. Thank your for your quick answer and your help! Best wishes Julian > Thanks. > Valerie > >> >> Best wishes >> Julian >> >> >> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >>> Hi Julian, >>> >>> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>>> Hi, >>>> >>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>>> import of a VCF file using 'readVcf'? >>> >>> I think the question is how to read in a subset of chromosomes/positions >>> from a vcf file without an accompanying tabix index. You can't. >>> readVcf() requires an index when subsets are defined by >>> chromosome/position. However you can read in subsets defined by INFO >>> and/or GENO fields without an index. >>> >>> Approaches: >>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >>> (2) use ?filterVcf to write out a new file of records of interest >>> >>>> I'm aware that one can change it >>>> with the 'seqinfo' method afterwards, but for large VCF files this can >>>> take a significant amount of time. >>> >>> What operation is taking along time? Subsetting the VCF object by >>> chromosome? >>> >>> Valerie >>> >>>> >>>> An alternative would be to sneak it in by the 'which' arguments, such >>>> as: >>>> >>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>>> >>>> but this requires the file to be indexed beforehand. >>>> >>>> Best wishes >>>> Julian >>>> >>> >

ADD REPLY • link 10.6 years ago Julian Gehring ★ 1.3k

0

Entering edit mode

Hi Julian, With the latest devel version of GenomicRanges (1.13.48, should become available via biocLite() in the next 24 hours or so), setting the seqinfo of a big VCF object should be much faster but your mileage may vary. Let us know if you still find it slow enough to justify a mechanism for letting the user specify the seqinfo upfront when using readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept everything it supports now plus a Seqinfo object). Cheers, H. On 09/24/2013 10:00 AM, Julian Gehring wrote: > Hi Valerie, > >> Thanks for clarifying. No, there is not currently a way to do this. >> >> The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can >> you provide more detail as to (1) why you are changing it (did readVcf() >> import something incorrectly or ?) and (2) what operations on the >> 'seqinfo' are taking a long time. > > 'readVcf' did everything in a correct way. I needed to add the > information about the genome, circularity, and seqlengths based on > external annotitation, since it is not part of the VCF file. > > I can't tell you at the moment where 'seqinfo <-' spends most of its > time. I'll profile it and get back to you. > > Thank your for your quick answer and your help! > > Best wishes > Julian > > >> Thanks. >> Valerie >> >>> >>> Best wishes >>> Julian >>> >>> >>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >>>> Hi Julian, >>>> >>>> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>>>> Hi, >>>>> >>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>>>> import of a VCF file using 'readVcf'? >>>> >>>> I think the question is how to read in a subset of >>>> chromosomes/positions >>>> from a vcf file without an accompanying tabix index. You can't. >>>> readVcf() requires an index when subsets are defined by >>>> chromosome/position. However you can read in subsets defined by INFO >>>> and/or GENO fields without an index. >>>> >>>> Approaches: >>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >>>> (2) use ?filterVcf to write out a new file of records of interest >>>> >>>>> I'm aware that one can change it >>>>> with the 'seqinfo' method afterwards, but for large VCF files this can >>>>> take a significant amount of time. >>>> >>>> What operation is taking along time? Subsetting the VCF object by >>>> chromosome? >>>> >>>> Valerie >>>> >>>>> >>>>> An alternative would be to sneak it in by the 'which' arguments, such >>>>> as: >>>>> >>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>>>> >>>>> but this requires the file to be indexed beforehand. >>>>> >>>>> Best wishes >>>>> Julian >>>>> >>>> >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Valerie and Herve, Your latest changes improve the speed of this greatly, thanks for your work! Thinking more generally, do you think it would be worth generalizing the readVcf in the way you described? I'm not just thinking about speed here, but also about a more flexible way of interacting with the functions (given that the user has to specify the genome name anyway). Best wishes Julian On 09/28/2013 12:57 AM, Hervé Pagès wrote: > Hi Julian, > > With the latest devel version of GenomicRanges (1.13.48, should become > available via biocLite() in the next 24 hours or so), setting the > seqinfo of a big VCF object should be much faster but your mileage > may vary. Let us know if you still find it slow enough to justify a > mechanism for letting the user specify the seqinfo upfront when using > readVcf() (e.g. the 'genome' arg could be renamed 'seqinfo' and accept > everything it supports now plus a Seqinfo object). > > Cheers, > H. > > On 09/24/2013 10:00 AM, Julian Gehring wrote: >> Hi Valerie, >> >>> Thanks for clarifying. No, there is not currently a way to do this. >>> >>> The 'seqinfo' on the rowData(vcf) should not be difficult to change. Can >>> you provide more detail as to (1) why you are changing it (did readVcf() >>> import something incorrectly or ?) and (2) what operations on the >>> 'seqinfo' are taking a long time. >> >> 'readVcf' did everything in a correct way. I needed to add the >> information about the genome, circularity, and seqlengths based on >> external annotitation, since it is not part of the VCF file. >> >> I can't tell you at the moment where 'seqinfo <-' spends most of its >> time. I'll profile it and get back to you. >> >> Thank your for your quick answer and your help! >> >> Best wishes >> Julian >> >> >>> Thanks. >>> Valerie >>> >>>> >>>> Best wishes >>>> Julian >>>> >>>> >>>> On 09/24/2013 06:31 PM, Valerie Obenchain wrote: >>>>> Hi Julian, >>>>> >>>>> On 09/24/2013 02:29 AM, Julian Gehring wrote: >>>>>> Hi, >>>>>> >>>>>> Is there a direct way to specifiy the 'seqinfo' of a genome for the >>>>>> import of a VCF file using 'readVcf'? >>>>> >>>>> I think the question is how to read in a subset of >>>>> chromosomes/positions >>>>> from a vcf file without an accompanying tabix index. You can't. >>>>> readVcf() requires an index when subsets are defined by >>>>> chromosome/position. However you can read in subsets defined by INFO >>>>> and/or GENO fields without an index. >>>>> >>>>> Approaches: >>>>> (1) create index with ?indexTabix and specify 'which' in ScanVcfParam >>>>> (2) use ?filterVcf to write out a new file of records of interest >>>>> >>>>>> I'm aware that one can change it >>>>>> with the 'seqinfo' method afterwards, but for large VCF files this >>>>>> can >>>>>> take a significant amount of time. >>>>> >>>>> What operation is taking along time? Subsetting the VCF object by >>>>> chromosome? >>>>> >>>>> Valerie >>>>> >>>>>> >>>>>> An alternative would be to sneak it in by the 'which' arguments, such >>>>>> as: >>>>>> >>>>>> readVcf(file, genome, ScanVcfParam(which = as(seq_info, "GRanges"))) >>>>>> >>>>>> but this requires the file to be indexed beforehand. >>>>>> >>>>>> Best wishes >>>>>> Julian >>>>>> >>>>> >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.6 years ago Julian Gehring ★ 1.3k

Login before adding your answer.