Renaming the seqlevels in a transcript database from GenomicFeatures

4

Entering edit mode

Fong Chun Chan ▴ 320

@fong-chun-chan-5706

Last seen 10.2 years ago

Hi, I am using the GenomicFeatures package to extract exons from a transcript database file. I am using the ensembl transcript database which has no chr, yet my bam files that I am working have the chr prefix. I was thinking one could append chr to the seqlevels in the transcript database like this: library('GenomicFeatures') txdb <- loadDb(txdbFile) newSeqNames <- paste('chr', seqlevels(txdb), sep = '') names(newSeqNames) <- seqlevels(txdb) txdb <- renameSeqlevels( txdb, newSeqNames ) seqlevels(txdb) [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" "chr11" "chr12" [13] "chr13" "chr14" This appears to work. But when I run: transcripts(txdb, vals = list( tx_chrom = "chr1") ) GRanges with 0 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> --- seqlengths: It suggests that the chromosome renaming is not working...because when I run: transcripts(txdb, vals = list( tx_chrom = '1') ) GRanges with 17239 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr1 [11869, 14409] + | 126213 ENST00000456328 Has anyone had any luck in getting this to work? Thanks, [[alternative HTML version deleted]]

GenomicFeatures GenomicFeatures • 6.5k views

ADD COMMENT • link updated 10.4 years ago by Sonali Arora ▴ 390 • written 10.4 years ago by Fong Chun Chan ▴ 320

4

Entering edit mode

Sonali Arora ▴ 390

@sonali-arora-6563

Last seen 8.7 years ago

United States

Hi Fong, A TranscriptDb cannot be directly subset with 'keepSeqlevels' and 'dropSeqlevels'. But the GRanges or GRangesLists extracted from the TranscriptDb can be subset. Below I show an example for how to extract GRanges, subset and change the chromosome name style from "UCSC" to "NCBI" ( ie from "chr1" to "1") ## load packages library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg18.knownGene) ## extract GRanges txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene seqlevels(txdb) txbygene <- transcriptsBy(txdb, "gene") seqlevels(txbygene) ## subset to get only ranges from chr 1 to 22 auto <- extractSeqlevelsByGroup(species="Homo sapiens", style="UCSC", group="auto") txbygene <- keepSeqlevels(txbygene,auto) seqlevels(txbygene) ##convert GRangesList to GRanges ucsc_gr <- unlist(txbygene) ## renaming styles. newStyle <- mapSeqlevels(seqlevels(ucsc_gr),"NCBI") ncbi_gr <- renameSeqlevels(ucsc_gr, newStyle) ##check that naming was successful ncbi_gr[seqnames(ncbi_gr)=="1"] ucsc_gr[seqnames(ucsc_gr)=="chr1"] You can easily adapt this example for your scenario - to convert from "1" to "chr1". For more details : Please look at examples section of: http://www.bioconductor.org/packages/devel/bioc/vignettes/GenomeInfoDb /inst/doc/GenomeInfoDb.pdf and the man page for ?keepSeqlevels. Thanks and Regards, Sonali. On 6/26/2014 2:02 PM, Fong Chun Chan wrote: > Hi, > > I am using the GenomicFeatures package to extract exons from a transcript > database file. I am using the ensembl transcript database which has no chr, > yet my bam files that I am working have the chr prefix. > > I was thinking one could append chr to the seqlevels in the transcript > database like this: > > library('GenomicFeatures') > txdb <- loadDb(txdbFile) > newSeqNames <- paste('chr', seqlevels(txdb), sep = '') > names(newSeqNames) <- seqlevels(txdb) > txdb <- renameSeqlevels( txdb, newSeqNames ) > seqlevels(txdb) > > [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" > "chr11" "chr12" [13] "chr13" "chr14" > > This appears to work. > > But when I run: > > transcripts(txdb, vals = list( tx_chrom = "chr1") ) > > GRanges with 0 ranges and 2 metadata columns: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > --- > seqlengths: > > It suggests that the chromosome renaming is not working...because when I > run: > > transcripts(txdb, vals = list( tx_chrom = '1') ) > > GRanges with 17239 ranges and 2 metadata columns: > seqnames ranges strand | tx_id > tx_name > <rle> <iranges> <rle> | <integer> > <character> > [1] chr1 [11869, 14409] + | 126213 > ENST00000456328 > > Has anyone had any luck in getting this to work? > > Thanks, > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Thanks and Regards, Sonali

ADD COMMENT • link 10.4 years ago Sonali Arora ▴ 390

0

Entering edit mode

Hi Sonali, I'm pretty sure what Fong has reported is a legitimate bug. The seqlevels are mapped at some level, but the values for tx_chrom have not been updated. Also, what you say about keepSeqlevels and TranscriptDb is not true, at least for devel (and this stuff was written a year ago): > seqlevels(keepSeqlevels(txdb, "1")) [1] "1" Michael On Fri, Jun 27, 2014 at 12:18 PM, Sonali Arora <sarora@fhcrc.org> wrote: > Hi Fong, > > A TranscriptDb cannot be directly subset with 'keepSeqlevels' and > 'dropSeqlevels'. > But the GRanges or GRangesLists extracted from the TranscriptDb can be > subset. > Below I show an example for how to extract GRanges, subset and change the > chromosome name style from "UCSC" to "NCBI" ( ie from "chr1" to "1") > > ## load packages > library(GenomicFeatures) > library(TxDb.Hsapiens.UCSC.hg18.knownGene) > > ## extract GRanges > txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene > seqlevels(txdb) > txbygene <- transcriptsBy(txdb, "gene") > seqlevels(txbygene) > > ## subset to get only ranges from chr 1 to 22 > auto <- extractSeqlevelsByGroup(species="Homo sapiens", style="UCSC", > group="auto") > txbygene <- keepSeqlevels(txbygene,auto) > seqlevels(txbygene) > > ##convert GRangesList to GRanges > ucsc_gr <- unlist(txbygene) > > ## renaming styles. > newStyle <- mapSeqlevels(seqlevels(ucsc_gr),"NCBI") > ncbi_gr <- renameSeqlevels(ucsc_gr, newStyle) > > ##check that naming was successful > ncbi_gr[seqnames(ncbi_gr)=="1"] > ucsc_gr[seqnames(ucsc_gr)=="chr1"] > > You can easily adapt this example for your scenario - to convert from > "1" to "chr1". For more details : Please look at examples section of: > http://www.bioconductor.org/packages/devel/bioc/vignettes/ > GenomeInfoDb/inst/doc/GenomeInfoDb.pdf > and the man page for ?keepSeqlevels. > > Thanks and Regards, > Sonali. > > > > > On 6/26/2014 2:02 PM, Fong Chun Chan wrote: > >> Hi, >> >> I am using the GenomicFeatures package to extract exons from a transcript >> database file. I am using the ensembl transcript database which has no >> chr, >> yet my bam files that I am working have the chr prefix. >> >> I was thinking one could append chr to the seqlevels in the transcript >> database like this: >> >> library('GenomicFeatures') >> txdb <- loadDb(txdbFile) >> newSeqNames <- paste('chr', seqlevels(txdb), sep = '') >> names(newSeqNames) <- seqlevels(txdb) >> txdb <- renameSeqlevels( txdb, newSeqNames ) >> seqlevels(txdb) >> >> [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" >> "chr11" "chr12" [13] "chr13" "chr14" >> >> This appears to work. >> >> But when I run: >> >> transcripts(txdb, vals = list( tx_chrom = "chr1") ) >> >> GRanges with 0 ranges and 2 metadata columns: >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> | <integer> <character> >> --- >> seqlengths: >> >> It suggests that the chromosome renaming is not working...because when I >> run: >> >> transcripts(txdb, vals = list( tx_chrom = '1') ) >> >> GRanges with 17239 ranges and 2 metadata columns: >> seqnames ranges strand | tx_id >> tx_name >> <rle> <iranges> <rle> | <integer> >> <character> >> [1] chr1 [11869, 14409] + | 126213 >> ENST00000456328 >> >> Has anyone had any luck in getting this to work? >> >> Thanks, >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > -- > Thanks and Regards, > Sonali > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane. > science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 10.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi guys, Indeed, 'transcripts(txdb, vals=list(tx_chrom="some_chrom"))' is broken when the "effective" seqlevels are different from the "native" ones. More precisely, the transcripts(), exons(), and cds() extractors are not translating the user-supplied seqlevels to native seqlevels before injecting them in the SQL query they generate. We need to fix that. We will post again here when it's done. Thanks for the report. H. On 06/27/2014 07:53 PM, Michael Lawrence wrote: > Hi Sonali, > > I'm pretty sure what Fong has reported is a legitimate bug. The seqlevels > are mapped at some level, but the values for tx_chrom have not been > updated. > > Also, what you say about keepSeqlevels and TranscriptDb is not true, at > least for devel (and this stuff was written a year ago): > >> seqlevels(keepSeqlevels(txdb, "1")) > [1] "1" > > Michael > > > On Fri, Jun 27, 2014 at 12:18 PM, Sonali Arora <sarora at="" fhcrc.org=""> wrote: > >> Hi Fong, >> >> A TranscriptDb cannot be directly subset with 'keepSeqlevels' and >> 'dropSeqlevels'. >> But the GRanges or GRangesLists extracted from the TranscriptDb can be >> subset. >> Below I show an example for how to extract GRanges, subset and change the >> chromosome name style from "UCSC" to "NCBI" ( ie from "chr1" to "1") >> >> ## load packages >> library(GenomicFeatures) >> library(TxDb.Hsapiens.UCSC.hg18.knownGene) >> >> ## extract GRanges >> txdb <- TxDb.Hsapiens.UCSC.hg18.knownGene >> seqlevels(txdb) >> txbygene <- transcriptsBy(txdb, "gene") >> seqlevels(txbygene) >> >> ## subset to get only ranges from chr 1 to 22 >> auto <- extractSeqlevelsByGroup(species="Homo sapiens", style="UCSC", >> group="auto") >> txbygene <- keepSeqlevels(txbygene,auto) >> seqlevels(txbygene) >> >> ##convert GRangesList to GRanges >> ucsc_gr <- unlist(txbygene) >> >> ## renaming styles. >> newStyle <- mapSeqlevels(seqlevels(ucsc_gr),"NCBI") >> ncbi_gr <- renameSeqlevels(ucsc_gr, newStyle) >> >> ##check that naming was successful >> ncbi_gr[seqnames(ncbi_gr)=="1"] >> ucsc_gr[seqnames(ucsc_gr)=="chr1"] >> >> You can easily adapt this example for your scenario - to convert from >> "1" to "chr1". For more details : Please look at examples section of: >> http://www.bioconductor.org/packages/devel/bioc/vignettes/ >> GenomeInfoDb/inst/doc/GenomeInfoDb.pdf >> and the man page for ?keepSeqlevels. >> >> Thanks and Regards, >> Sonali. >> >> >> >> >> On 6/26/2014 2:02 PM, Fong Chun Chan wrote: >> >>> Hi, >>> >>> I am using the GenomicFeatures package to extract exons from a transcript >>> database file. I am using the ensembl transcript database which has no >>> chr, >>> yet my bam files that I am working have the chr prefix. >>> >>> I was thinking one could append chr to the seqlevels in the transcript >>> database like this: >>> >>> library('GenomicFeatures') >>> txdb <- loadDb(txdbFile) >>> newSeqNames <- paste('chr', seqlevels(txdb), sep = '') >>> names(newSeqNames) <- seqlevels(txdb) >>> txdb <- renameSeqlevels( txdb, newSeqNames ) >>> seqlevels(txdb) >>> >>> [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" "chr10" >>> "chr11" "chr12" [13] "chr13" "chr14" >>> >>> This appears to work. >>> >>> But when I run: >>> >>> transcripts(txdb, vals = list( tx_chrom = "chr1") ) >>> >>> GRanges with 0 ranges and 2 metadata columns: >>> seqnames ranges strand | tx_id tx_name >>> <rle> <iranges> <rle> | <integer> <character> >>> --- >>> seqlengths: >>> >>> It suggests that the chromosome renaming is not working...because when I >>> run: >>> >>> transcripts(txdb, vals = list( tx_chrom = '1') ) >>> >>> GRanges with 17239 ranges and 2 metadata columns: >>> seqnames ranges strand | tx_id >>> tx_name >>> <rle> <iranges> <rle> | <integer> >>> <character> >>> [1] chr1 [11869, 14409] + | 126213 >>> ENST00000456328 >>> >>> Has anyone had any luck in getting this to work? >>> >>> Thanks, >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane. >>> science.biology.informatics.conductor >>> >> >> -- >> Thanks and Regards, >> Sonali >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 10.4 years ago Hervé Pagès 16k

Login before adding your answer.