seqlevels in VCF objects
2
0
Entering edit mode
Murat Tasan ▴ 70
@murat-tasan-5676
Last seen 9.6 years ago
hi all - i've encountered an annoying problem, and i'd like to avoid read/writing the many GBs required for the blunt-force solution... the 1000 Genomes project provides a collection of VCF files providing the genotypes for all found variants. after reading in the VCF files (via vcf <- readVcf(...)), i have an VCF object, but the info(vcf) object reveals the chromosome names (i.e. 'seqlevels') are "1", "2", ..., "X", "Y". Bioconductor's TxDb.Hsapiens.UCSC.hg19.knownGene object, however, uses the UCSC standard prefix for chromosome names: "chr1", "chr2", etc. in trying to subset(...) or predictCoding(...) the VCF data against the genome objects (including BSgenome.Hsapiens.UCSC.hg19) this causes an obvious failure. i tried re-setting the seqlevels of the VCF 'info' object like so (thinking the seqnames factor just indexes back on the seqlevels as a key): seqlevels(vcf at info) <- sprintf("chr%s", seqlevels(vcf at info)) but this doesn't seem to have any effect. any idea on how to make this bulk change of seqnames for data in VCF objects? cheers, -m
BSgenome BSgenome genomes BSgenome BSgenome genomes • 2.3k views
ADD COMMENT
0
Entering edit mode
@valerie-obenchain-4275
Last seen 2.3 years ago
United States
On 01/14/13 15:40, Murat Tasan wrote: > hi all - i've encountered an annoying problem, and i'd like to avoid > read/writing the many GBs required for the blunt-force solution... > > the 1000 Genomes project provides a collection of VCF files providing > the genotypes for all found variants. > after reading in the VCF files (via vcf<- readVcf(...)), i have an > VCF object, but the info(vcf) object reveals the chromosome names > (i.e. 'seqlevels') are "1", "2", ..., "X", "Y". > Bioconductor's TxDb.Hsapiens.UCSC.hg19.knownGene object, however, uses > the UCSC standard prefix for chromosome names: "chr1", "chr2", etc. > > in trying to subset(...) or predictCoding(...) the VCF data against > the genome objects (including BSgenome.Hsapiens.UCSC.hg19) this causes > an obvious failure. > > i tried re-setting the seqlevels of the VCF 'info' object like so > (thinking the seqnames factor just indexes back on the seqlevels as a > key): > > seqlevels(vcf at info)<- sprintf("chr%s", seqlevels(vcf at info)) > > but this doesn't seem to have any effect. > > any idea on how to make this bulk change of seqnames for data in VCF objects? There are examples of how to do this in the vignette and on the ?VCF and ?predictCoding man pages. In ?VCF there is a section in the examples dedicated to "Renaming seqlevels and subsetting". Renaming all seqlevels in a VCF can be done with renameSeqlevels() or seqlevels() and subsetting a VCF on seqlevels can be done with keepSeqlevels(). > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") > vcf <- readVcf(fl, "hg19") > seqlevels(vcf) [1] "20" > seqlevels(vcf) <- "chr20" > seqlevels(vcf) [1] "chr20" Valerie > > cheers, > > -m > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
ah, great, thanks! somehow i was intuitively expecting that i'd have to set seqlevels on the 'info' sub-object, rather than the VCF object itself. thanks much! -m On Mon, Jan 14, 2013 at 9:25 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > On 01/14/13 15:40, Murat Tasan wrote: >> >> hi all - i've encountered an annoying problem, and i'd like to avoid >> read/writing the many GBs required for the blunt-force solution... >> >> the 1000 Genomes project provides a collection of VCF files providing >> the genotypes for all found variants. >> after reading in the VCF files (via vcf<- readVcf(...)), i have an >> VCF object, but the info(vcf) object reveals the chromosome names >> (i.e. 'seqlevels') are "1", "2", ..., "X", "Y". >> Bioconductor's TxDb.Hsapiens.UCSC.hg19.knownGene object, however, uses >> the UCSC standard prefix for chromosome names: "chr1", "chr2", etc. >> >> in trying to subset(...) or predictCoding(...) the VCF data against >> the genome objects (including BSgenome.Hsapiens.UCSC.hg19) this causes >> an obvious failure. >> >> i tried re-setting the seqlevels of the VCF 'info' object like so >> (thinking the seqnames factor just indexes back on the seqlevels as a >> key): >> >> seqlevels(vcf at info)<- sprintf("chr%s", seqlevels(vcf at info)) >> >> but this doesn't seem to have any effect. >> >> any idea on how to make this bulk change of seqnames for data in VCF >> objects? > > > > There are examples of how to do this in the vignette and on the ?VCF and > ?predictCoding man pages. > In ?VCF there is a section in the examples dedicated to "Renaming seqlevels > and subsetting". > Renaming all seqlevels in a VCF can be done with renameSeqlevels() or > seqlevels() and subsetting a VCF on > seqlevels can be done with keepSeqlevels(). > >> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation") >> vcf <- readVcf(fl, "hg19") >> seqlevels(vcf) > [1] "20" >> seqlevels(vcf) <- "chr20" >> seqlevels(vcf) > [1] "chr20" > > > Valerie > > >> >> cheers, >> >> -m >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY
0
Entering edit mode
Paul Leo ▴ 970
@paul-leo-2092
Last seen 9.6 years ago
There is a discussion of something similar in the "GenomicRanges Use Cases.pdf" (in the genomicRanges library) see section 3.1 of the pdf: You can change the seqleveles directly: seqlevels(exonRanges) <- newlvls # reorder the levels names(newlvls) <- seqlevels(aligns) Dr Paul Leo Senior Bioinformatician The University of Queensland Diamantina Institute --------------------------------------------------------------------- TRI, level , 37 Kent Street, Woolloongabba QLD 4102 Tel: +61 7 3443 7072 Mob: 041 303 8691 Fax: +61 7 3443 6966 -----Original Message----- From: Murat Tasan <mmuurr@gmail.com> To: bioconductor at r-project.org Subject: [BioC] seqlevels in VCF objects Date: Mon, 14 Jan 2013 18:40:46 -0500 hi all - i've encountered an annoying problem, and i'd like to avoid read/writing the many GBs required for the blunt-force solution... the 1000 Genomes project provides a collection of VCF files providing the genotypes for all found variants. after reading in the VCF files (via vcf <- readVcf(...)), i have an VCF object, but the info(vcf) object reveals the chromosome names (i.e. 'seqlevels') are "1", "2", ..., "X", "Y". Bioconductor's TxDb.Hsapiens.UCSC.hg19.knownGene object, however, uses the UCSC standard prefix for chromosome names: "chr1", "chr2", etc. in trying to subset(...) or predictCoding(...) the VCF data against the genome objects (including BSgenome.Hsapiens.UCSC.hg19) this causes an obvious failure. i tried re-setting the seqlevels of the VCF 'info' object like so (thinking the seqnames factor just indexes back on the seqlevels as a key): seqlevels(vcf at info) <- sprintf("chr%s", seqlevels(vcf at info)) but this doesn't seem to have any effect. any idea on how to make this bulk change of seqnames for data in VCF objects? cheers, -m _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT

Login before adding your answer.

Traffic: 874 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6