GenomicFeatures: makeTranscriptDbFromUCSC on "refGene" supported?

0

Entering edit mode

Erik van den Akker ▴ 50

@erik-van-den-akker-4165

Last seen 9.6 years ago

Hi all, I'm a PhD student in bioinformatics working at the Leiden University Medical Center and at the Delft University of Technlogy in the Netherlands. Currently I'm working on the vizualization of genome wide data sources, such as Linkage, GWAS & Expression data. In order to be able to quickely access information on gene locations (along with the UTR, CDS, exons etc), I thought it would be a good idea to make use of the GenomicFeatures package. This package works perfectly and very quickely for the example provided in the vignette (good job!): > library(GenomicFeatures) > system.time(mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "knownGene")) user system elapsed 49.50 0.69 100.05 > mm9KG TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: mm9 | UCSC Table: knownGene | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 49409 | exon_nrow: 237551 | cds_nrow: 204831 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) | GenomicFeatures version at creation time: 1.0.3 | RSQLite version at creation time: 0.9-1 And even for larger databases(humans), this works perfectly: > system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "knownGene")) user system elapsed 82.09 1.11 162.53 > hg19KG TranscriptDb object: | Db type: TranscriptDb | Data source: UCSC | Genome: hg19 | UCSC Table: knownGene | Type of Gene ID: Entrez Gene ID | Full dataset: yes | transcript_nrow: 77614 | exon_nrow: 281605 | cds_nrow: 236664 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) | GenomicFeatures version at creation time: 1.0.3 | RSQLite version at creation time: 0.9-1 However, for tablename = "refGene" I had to shoot down my R session after half an hour for both the settings genome = "mm9" & genome = "hg19" > system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")) > system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene")) As this package makes use of functionalities provided by rtracklayer, before the actual SQLite db is stored, I verified whether this was working correctly: > library(rtracklayer) > session <- browserSession() > genome(session) <- "hg19" > query <- ucscTableQuery(session,"refGene") > system.time(Table <- getTable(query)) user system elapsed 7.70 0.39 61.73 Typing "head(Table)" gave the expected results, suggesting that something is not working correctly in creating the SQLite databases. So, my question: Given that refGene pops up when using supportedUCSCtables(), I wondered: 1) Did I do something wrong?; 2) should I just have more patience & 3) could anyone confirm these problems? And @PackageMaintainers: If this is a genuine bug, are you planning to fix this or speed things up? As I work with gene expression data, which are commonly annotated to either RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with TranscriptDBs based on these features. Although I can think of many work around solutions using "knownGene" I would prefer to work with the package as originally intended and hence this post. Thanks for the work already done on this great package! Cheerz, Erik van den Akker > sessionInfo() R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C [5] LC_TIME=Dutch_Netherlands.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 loaded via a namespace (and not attached): [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 BSgenome_1.16.5 DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 [[alternative HTML version deleted]]

rtracklayer GenomicFeatures rtracklayer GenomicFeatures • 1.7k views

ADD COMMENT • link updated 13.8 years ago by Vincent J. Carey, Jr. 6.7k • written 13.8 years ago by Erik van den Akker ▴ 50

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 6 weeks ago

United States

it seems to me that the speed issues are probably related to UCSC server availability at the time of your session, which might depend on external factors. however i can confirm a problem with the refGene request. I got further than a timeout first i get a timing similar to yours for knownGene-- > system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename + = "knownGene")) user system elapsed 65.585 0.675 200.614 then (and this event at line 26744 is literally reproducible) > system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename + = "refGene" ) + ) Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 26744 did not have 8 elements Timing stopped at: 1.597 0.067 207.762 The GenomicFeatures developers will comment later. > sessionInfo() R version 2.12.0 Under development (unstable) (2010-06-30 r52417) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 loaded via a namespace (and not attached): [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 DBI_0.2-5 [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 biomaRt_2.5.1 [9] rtracklayer_1.9.3 tools_2.12.0 On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker <erikvandenakker at="" gmail.com=""> wrote: > Hi all, > > I'm a PhD student in bioinformatics working at the Leiden University Medical > > Center and at the Delft University of Technlogy in the Netherlands. > Currently > I'm working on the vizualization of genome wide data sources, such as > Linkage, > GWAS & Expression data. > In order to be able to quickely access information on gene locations (along > with the UTR, CDS, exons etc), I thought it would be a good idea to make use > > of the GenomicFeatures package. This package works perfectly and very > quickely > for the example provided in the vignette (good job!): > >> library(GenomicFeatures) >> system.time(mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = > "knownGene")) > ? user ?system elapsed > ?49.50 ? ?0.69 ?100.05 > >> mm9KG > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: mm9 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 49409 > | exon_nrow: 237551 > | cds_nrow: 204831 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > > And even for larger databases(humans), this works perfectly: > >> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename > = "knownGene")) > ? user ?system elapsed > ?82.09 ? ?1.11 ?162.53 > >> hg19KG > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 77614 > | exon_nrow: 281605 > | cds_nrow: 236664 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > However, for tablename = "refGene" I had to shoot down my R session after > half an hour for both the settings genome = "mm9" & genome = "hg19" > >> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = > "refGene")) > >> system.time(hg19KG <- makeTranscriptDbFromUCSC(genome = "hg19", tablename > = "refGene")) > > As this package makes use of functionalities provided by rtracklayer, before > > the actual SQLite db is stored, I verified whether this was working > correctly: > >> library(rtracklayer) >> session ?<- browserSession() >> genome(session) <- "hg19" >> query <- ucscTableQuery(session,"refGene") >> system.time(Table <- getTable(query)) > ?user ?system elapsed > ? 7.70 ? ?0.39 ? 61.73 > > Typing "head(Table)" gave the expected results, suggesting that something > is not working correctly in creating the SQLite databases. > > So, my question: > Given that refGene pops up when using supportedUCSCtables(), > I wondered: > 1) Did I do something wrong?; 2) should I just have more patience & 3) could > anyone > confirm these problems? > And > @PackageMaintainers: If this is a genuine bug, are you planning to fix this > or speed things up? > > As I work with gene expression data, which are commonly annotated to either > RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with > TranscriptDBs > based on these features. Although I can think of many work around solutions > using "knownGene" I would prefer to work with the package as originally > intended > and hence this post. > > Thanks for the work already done on this great package! > > Cheerz, > > Erik van den Akker > > >> sessionInfo() > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 ?LC_CTYPE=Dutch_Netherlands.1252 > LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base > > other attached packages: > [1] rtracklayer_1.8.1 ? ? RCurl_1.4-2 ? ? ? ? ? bitops_1.0-4.1 > GenomicFeatures_1.0.3 GenomicRanges_1.0.5 ? IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 ? ? biomaRt_2.4.0 ? ? Biostrings_2.16.7 BSgenome_1.16.5 > DBI_0.2-5 ? ? ? ? RSQLite_0.9-1 ? ? tools_2.11.1 ? ? ?XML_3.1-0 > > ? ? ? ?[[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 13.8 years ago Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hi Erik, Vince, I'm puzzled by this. Populating the db is made using prepared statements which are usually very fast. I'm investigating and will let you know. Thanks for the report. H. On 07/14/2010 06:30 AM, Vincent Carey wrote: > it seems to me that the speed issues are probably related to UCSC > server availability at the time of your session, which might depend on > external factors. > > however i can confirm a problem with the refGene request. I got > further than a timeout > > first i get a timing similar to yours for knownGene-- > >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename > + = "knownGene")) > user system elapsed > 65.585 0.675 200.614 > > then (and this event at line 26744 is literally reproducible) > >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename > + = "refGene" ) > + ) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > line 26744 did not have 8 elements > Timing stopped at: 1.597 0.067 207.762 > > The GenomicFeatures developers will comment later. > >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-06-30 r52417) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 DBI_0.2-5 > [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 biomaRt_2.5.1 > [9] rtracklayer_1.9.3 tools_2.12.0 > > > On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker > <erikvandenakker at="" gmail.com=""> wrote: >> Hi all, >> >> I'm a PhD student in bioinformatics working at the Leiden University Medical >> >> Center and at the Delft University of Technlogy in the Netherlands. >> Currently >> I'm working on the vizualization of genome wide data sources, such as >> Linkage, >> GWAS& Expression data. >> In order to be able to quickely access information on gene locations (along >> with the UTR, CDS, exons etc), I thought it would be a good idea to make use >> >> of the GenomicFeatures package. This package works perfectly and very >> quickely >> for the example provided in the vignette (good job!): >> >>> library(GenomicFeatures) >>> system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >> "knownGene")) >> user system elapsed >> 49.50 0.69 100.05 >> >>> mm9KG >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: mm9 >> | UCSC Table: knownGene >> | Type of Gene ID: Entrez Gene ID >> | Full dataset: yes >> | transcript_nrow: 49409 >> | exon_nrow: 237551 >> | cds_nrow: 204831 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) >> | GenomicFeatures version at creation time: 1.0.3 >> | RSQLite version at creation time: 0.9-1 >> >> >> And even for larger databases(humans), this works perfectly: >> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >> = "knownGene")) >> user system elapsed >> 82.09 1.11 162.53 >> >>> hg19KG >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: hg19 >> | UCSC Table: knownGene >> | Type of Gene ID: Entrez Gene ID >> | Full dataset: yes >> | transcript_nrow: 77614 >> | exon_nrow: 281605 >> | cds_nrow: 236664 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) >> | GenomicFeatures version at creation time: 1.0.3 >> | RSQLite version at creation time: 0.9-1 >> >> However, for tablename = "refGene" I had to shoot down my R session after >> half an hour for both the settings genome = "mm9"& genome = "hg19" >> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >> "refGene")) >> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >> = "refGene")) >> >> As this package makes use of functionalities provided by rtracklayer, before >> >> the actual SQLite db is stored, I verified whether this was working >> correctly: >> >>> library(rtracklayer) >>> session<- browserSession() >>> genome(session)<- "hg19" >>> query<- ucscTableQuery(session,"refGene") >>> system.time(Table<- getTable(query)) >> user system elapsed >> 7.70 0.39 61.73 >> >> Typing "head(Table)" gave the expected results, suggesting that something >> is not working correctly in creating the SQLite databases. >> >> So, my question: >> Given that refGene pops up when using supportedUCSCtables(), >> I wondered: >> 1) Did I do something wrong?; 2) should I just have more patience& 3) could >> anyone >> confirm these problems? >> And >> @PackageMaintainers: If this is a genuine bug, are you planning to fix this >> or speed things up? >> >> As I work with gene expression data, which are commonly annotated to either >> RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with >> TranscriptDBs >> based on these features. Although I can think of many work around solutions >> using "knownGene" I would prefer to work with the package as originally >> intended >> and hence this post. >> >> Thanks for the work already done on this great package! >> >> Cheerz, >> >> Erik van den Akker >> >> >>> sessionInfo() >> R version 2.11.1 (2010-05-31) >> i386-pc-mingw32 >> >> locale: >> [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 >> LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C >> [5] LC_TIME=Dutch_Netherlands.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 >> GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 BSgenome_1.16.5 >> DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Vince & Hervé, I'm afraid i've run into another (small) problem with using the GenomicFeatures package: For the sake of clarity, I've used the example as described in the vignette again for making a TranscriptDb object: > library(GenomicFeatures) > mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "knownGene") > GR <- transcripts(mm9KG) > GR[1] GRanges with 1 range and 2 elementMetadata values seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* seqlengths chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 Next I wanted to selectively extract features from the database based on the criteria which can be specified with the "vals" argument: > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*")) which returns the same information as GR[1] as expected. By setting the "columns" argument to c("tx_id","tx_name","exon_id","gene_id"), I found that "18777" was the associated gene_id: > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1* "),columns=c("tx_id","tx_name","gene_id")) GRanges with 1 range and 3 elementMetadata values seqnames ranges strand | tx_id tx_name gene_id <rle> <iranges> <rle> | <integer> <character> <compressedcharacterlist> [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* *18777* seqlengths chr1 chr2 chr3 chr4 chr5 chr6 ... chr7_random chr8_random chr9_random chrUn_random chrX_random chrY_random 197195432 181748087 159599783 155630120 152537259 149517037 ... 362490 849593 449403 5900358 1785075 58682461 Besides tx_name, subselections can be specified by tx_id, tx_chrom, tx_name, tx_strand & gene_id. All these worked fine except for the latter, which gave the following error: > transcripts(mm9KG,vals=list(gene_id="*18777* "),columns=c("tx_id","tx_name","gene_id")) Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such column: gene_id) As gene_id is among ValidValNames (as specified in the function transcripts) and all other criteria gave the expected results, I'm affraid this is a bug. I've noticed this was the only one selecting in a <compressedcharacterlist>, so this might perhaps be the reason. Except for these minor issues, I do think this package is very relevant and I see many applications for my current work, therefore, thanks again, Cheerz, Erik van den Akker > sessionInfo() R version 2.11.1 (2010-05-31) i386-pc-mingw32 locale: [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C [5] LC_TIME=Dutch_Netherlands.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 loaded via a namespace (and not attached): [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 BSgenome_1.16.5 DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 rtracklayer_1.8.1 tools_2.11.1 [10] XML_3.1-0 2010/7/14 Hervé Pagès <hpages@fhcrc.org> > Hi Erik, Vince, > > I'm puzzled by this. Populating the db is made using prepared > statements which are usually very fast. I'm investigating and > will let you know. Thanks for the report. > > H. > > > > On 07/14/2010 06:30 AM, Vincent Carey wrote: > >> it seems to me that the speed issues are probably related to UCSC >> server availability at the time of your session, which might depend on >> external factors. >> >> however i can confirm a problem with the refGene request. I got >> further than a timeout >> >> first i get a timing similar to yours for knownGene-- >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >>> >> + = "knownGene")) >> user system elapsed >> 65.585 0.675 200.614 >> >> then (and this event at line 26744 is literally reproducible) >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >>> >> + = "refGene" ) >> + ) >> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, >> : >> line 26744 did not have 8 elements >> Timing stopped at: 1.597 0.067 207.762 >> >> The GenomicFeatures developers will comment later. >> >> sessionInfo() >>> >> R version 2.12.0 Under development (unstable) (2010-06-30 r52417) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods base >> >> other attached packages: >> [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 DBI_0.2-5 >> [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 >> biomaRt_2.5.1 >> [9] rtracklayer_1.9.3 tools_2.12.0 >> >> >> On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker >> <erikvandenakker@gmail.com> wrote: >> >>> Hi all, >>> >>> I'm a PhD student in bioinformatics working at the Leiden University >>> Medical >>> >>> Center and at the Delft University of Technlogy in the Netherlands. >>> Currently >>> I'm working on the vizualization of genome wide data sources, such as >>> Linkage, >>> GWAS& Expression data. >>> In order to be able to quickely access information on gene locations >>> (along >>> with the UTR, CDS, exons etc), I thought it would be a good idea to make >>> use >>> >>> of the GenomicFeatures package. This package works perfectly and very >>> quickely >>> for the example provided in the vignette (good job!): >>> >>> library(GenomicFeatures) >>>> system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>>> >>> "knownGene")) >>> user system elapsed >>> 49.50 0.69 100.05 >>> >>> mm9KG >>>> >>> TranscriptDb object: >>> | Db type: TranscriptDb >>> | Data source: UCSC >>> | Genome: mm9 >>> | UCSC Table: knownGene >>> | Type of Gene ID: Entrez Gene ID >>> | Full dataset: yes >>> | transcript_nrow: 49409 >>> | exon_nrow: 237551 >>> | cds_nrow: 204831 >>> | Db created by: GenomicFeatures package from Bioconductor >>> | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) >>> | GenomicFeatures version at creation time: 1.0.3 >>> | RSQLite version at creation time: 0.9-1 >>> >>> >>> And even for larger databases(humans), this works perfectly: >>> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >>>> >>> = "knownGene")) >>> user system elapsed >>> 82.09 1.11 162.53 >>> >>> hg19KG >>>> >>> TranscriptDb object: >>> | Db type: TranscriptDb >>> | Data source: UCSC >>> | Genome: hg19 >>> | UCSC Table: knownGene >>> | Type of Gene ID: Entrez Gene ID >>> | Full dataset: yes >>> | transcript_nrow: 77614 >>> | exon_nrow: 281605 >>> | cds_nrow: 236664 >>> | Db created by: GenomicFeatures package from Bioconductor >>> | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) >>> | GenomicFeatures version at creation time: 1.0.3 >>> | RSQLite version at creation time: 0.9-1 >>> >>> However, for tablename = "refGene" I had to shoot down my R session after >>> half an hour for both the settings genome = "mm9"& genome = "hg19" >>> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "mm9", tablename >>>> = >>>> >>> "refGene")) >>> >>> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = "hg19", tablename >>>> >>> = "refGene")) >>> >>> As this package makes use of functionalities provided by rtracklayer, >>> before >>> >>> the actual SQLite db is stored, I verified whether this was working >>> correctly: >>> >>> library(rtracklayer) >>>> session<- browserSession() >>>> genome(session)<- "hg19" >>>> query<- ucscTableQuery(session,"refGene") >>>> system.time(Table<- getTable(query)) >>>> >>> user system elapsed >>> 7.70 0.39 61.73 >>> >>> Typing "head(Table)" gave the expected results, suggesting that something >>> is not working correctly in creating the SQLite databases. >>> >>> So, my question: >>> Given that refGene pops up when using supportedUCSCtables(), >>> I wondered: >>> 1) Did I do something wrong?; 2) should I just have more patience& 3) >>> could >>> anyone >>> confirm these problems? >>> And >>> @PackageMaintainers: If this is a genuine bug, are you planning to fix >>> this >>> or speed things up? >>> >>> As I work with gene expression data, which are commonly annotated to >>> either >>> RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with >>> TranscriptDBs >>> based on these features. Although I can think of many work around >>> solutions >>> using "knownGene" I would prefer to work with the package as originally >>> intended >>> and hence this post. >>> >>> Thanks for the work already done on this great package! >>> >>> Cheerz, >>> >>> Erik van den Akker >>> >>> >>> sessionInfo() >>>> >>> R version 2.11.1 (2010-05-31) >>> i386-pc-mingw32 >>> >>> locale: >>> [1] LC_COLLATE=Dutch_Netherlands.1252 LC_CTYPE=Dutch_Netherlands.1252 >>> LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C >>> [5] LC_TIME=Dutch_Netherlands.1252 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 >>> GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 >>> >>> loaded via a namespace (and not attached): >>> [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 BSgenome_1.16.5 >>> DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor@stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Erik van den Akker ▴ 50

0

Entering edit mode

Erik, The first problem you reported is fixed in svn. Nothing to do with SQLite performance (all the create and insert statements used to make the db take less than 1s or 2s in total). The problem was earlier with an internal helper function in charge of doing some cleaning of the downloaded data (before they are actually sent to the db). makeTranscriptDbFromUCSC(genome="mm9", tablename="refGene") now takes 80s (vs 45 min before the fix). I still need to look at this other problem you found. I'll let you know. H. On 07/14/2010 02:18 PM, Erik van den Akker wrote: > > Hi Vince & Herv?, > > I'm afraid i've run into another (small) problem with using the > GenomicFeatures package: > > For the sake of clarity, I've used the example as described in the > vignette again for making a > TranscriptDb object: > > > library(GenomicFeatures) > > mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = > "knownGene") > > GR <- transcripts(mm9KG) > > GR[1] > GRanges with 1 range and 2 elementMetadata values > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* > > seqlengths > chr1 chr2 chr3 chr4 > chr5 chr6 ... chr7_random chr8_random chr9_random > chrUn_random chrX_random chrY_random > 197195432 181748087 159599783 155630120 152537259 > 149517037 ... 362490 849593 449403 5900358 > 1785075 58682461 > > Next I wanted to selectively extract features from the database based on > the criteria which can be > specified with the "vals" argument: > > > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*")) > > which returns the same information as GR[1] as expected. By setting the > "columns" argument to > c("tx_id","tx_name","exon_id","gene_id"), I found that "18777" was the > associated gene_id: > > > > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*"),columns=c("tx_id ","tx_name","gene_id")) > GRanges with 1 range and 3 elementMetadata values > seqnames ranges strand | tx_id > tx_name gene_id > <rle> <iranges> <rle> | <integer> <character> <compressedcharacterlist> > [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* *18777* > > seqlengths > chr1 chr2 chr3 chr4 > chr5 chr6 ... chr7_random chr8_random chr9_random > chrUn_random chrX_random chrY_random > 197195432 181748087 159599783 155630120 152537259 > 149517037 ... 362490 849593 449403 5900358 > 1785075 58682461 > > Besides tx_name, subselections can be specified by tx_id, tx_chrom, > tx_name, tx_strand & gene_id. > All these worked fine except for the latter, which gave the following error: > > > > transcripts(mm9KG,vals=list(gene_id="*18777*"),columns=c("tx_id","tx _name","gene_id")) > Error in sqliteExecStatement(con, statement, bind.data) : > RS-DBI driver: (error in statement: no such column: gene_id) > > As gene_id is among ValidValNames (as specified in the function > transcripts) and all other > criteria gave the expected results, I'm affraid this is a bug. I've > noticed this was the only > one selecting in a <compressedcharacterlist>, so this might perhaps be > the reason. > > Except for these minor issues, I do think this package is very relevant > and I see many applications for my > current work, therefore, thanks again, > > Cheerz, > > Erik van den Akker > > > sessionInfo() > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 > LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 > LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 > BSgenome_1.16.5 DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 > rtracklayer_1.8.1 tools_2.11.1 > [10] XML_3.1-0 > > > > 2010/7/14 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Erik, Vince, > > I'm puzzled by this. Populating the db is made using prepared > statements which are usually very fast. I'm investigating and > will let you know. Thanks for the report. > > H. > > > > On 07/14/2010 06:30 AM, Vincent Carey wrote: > > it seems to me that the speed issues are probably related to UCSC > server availability at the time of your session, which might > depend on > external factors. > > however i can confirm a problem with the refGene request. I got > further than a timeout > > first i get a timing similar to yours for knownGene-- > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > + = "knownGene")) > user system elapsed > 65.585 0.675 200.614 > > then (and this event at line 26744 is literally reproducible) > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > + = "refGene" ) > + ) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 26744 did not have 8 elements > Timing stopped at: 1.597 0.067 207.762 > > The GenomicFeatures developers will comment later. > > sessionInfo() > > R version 2.12.0 Under development (unstable) (2010-06-30 r52417) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 > DBI_0.2-5 > [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 > biomaRt_2.5.1 > [9] rtracklayer_1.9.3 tools_2.12.0 > > > On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker > <erikvandenakker at="" gmail.com="" <mailto:erikvandenakker="" at="" gmail.com="">> > wrote: > > Hi all, > > I'm a PhD student in bioinformatics working at the Leiden > University Medical > > Center and at the Delft University of Technlogy in the > Netherlands. > Currently > I'm working on the vizualization of genome wide data > sources, such as > Linkage, > GWAS& Expression data. > In order to be able to quickely access information on gene > locations (along > with the UTR, CDS, exons etc), I thought it would be a good > idea to make use > > of the GenomicFeatures package. This package works perfectly > and very > quickely > for the example provided in the vignette (good job!): > > library(GenomicFeatures) > system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = > "mm9", tablename = > > "knownGene")) > user system elapsed > 49.50 0.69 100.05 > > mm9KG > > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: mm9 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 49409 > | exon_nrow: 237551 > | cds_nrow: 204831 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > > And even for larger databases(humans), this works perfectly: > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > = "knownGene")) > user system elapsed > 82.09 1.11 162.53 > > hg19KG > > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 77614 > | exon_nrow: 281605 > | cds_nrow: 236664 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > However, for tablename = "refGene" I had to shoot down my R > session after > half an hour for both the settings genome = "mm9"& genome = > "hg19" > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "mm9", tablename = > > "refGene")) > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > = "refGene")) > > As this package makes use of functionalities provided by > rtracklayer, before > > the actual SQLite db is stored, I verified whether this was > working > correctly: > > library(rtracklayer) > session<- browserSession() > genome(session)<- "hg19" > query<- ucscTableQuery(session,"refGene") > system.time(Table<- getTable(query)) > > user system elapsed > 7.70 0.39 61.73 > > Typing "head(Table)" gave the expected results, suggesting > that something > is not working correctly in creating the SQLite databases. > > So, my question: > Given that refGene pops up when using supportedUCSCtables(), > I wondered: > 1) Did I do something wrong?; 2) should I just have more > patience& 3) could > anyone > confirm these problems? > And > @PackageMaintainers: If this is a genuine bug, are you > planning to fix this > or speed things up? > > As I work with gene expression data, which are commonly > annotated to either > RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with > TranscriptDBs > based on these features. Although I can think of many work > around solutions > using "knownGene" I would prefer to work with the package as > originally > intended > and hence this post. > > Thanks for the work already done on this great package! > > Cheerz, > > Erik van den Akker > > > sessionInfo() > > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 > LC_CTYPE=Dutch_Netherlands.1252 > LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 > BSgenome_1.16.5 > DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

This is fixed in GenomicFeatures release (1.0.5) and devel (1.1.6). Both should become available via biocLite() in the next 24 or 36 hours. Thanks again for the report. Cheers, H. On 07/14/2010 02:18 PM, Erik van den Akker wrote: > > Hi Vince & Herv?, > > I'm afraid i've run into another (small) problem with using the > GenomicFeatures package: > > For the sake of clarity, I've used the example as described in the > vignette again for making a > TranscriptDb object: > > > library(GenomicFeatures) > > mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = > "knownGene") > > GR <- transcripts(mm9KG) > > GR[1] > GRanges with 1 range and 2 elementMetadata values > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* > > seqlengths > chr1 chr2 chr3 chr4 > chr5 chr6 ... chr7_random chr8_random chr9_random > chrUn_random chrX_random chrY_random > 197195432 181748087 159599783 155630120 152537259 > 149517037 ... 362490 849593 449403 5900358 > 1785075 58682461 > > Next I wanted to selectively extract features from the database based on > the criteria which can be > specified with the "vals" argument: > > > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*")) > > which returns the same information as GR[1] as expected. By setting the > "columns" argument to > c("tx_id","tx_name","exon_id","gene_id"), I found that "18777" was the > associated gene_id: > > > > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*"),columns=c("tx_id ","tx_name","gene_id")) > GRanges with 1 range and 3 elementMetadata values > seqnames ranges strand | tx_id > tx_name gene_id > <rle> <iranges> <rle> | <integer> <character> <compressedcharacterlist> > [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* *18777* > > seqlengths > chr1 chr2 chr3 chr4 > chr5 chr6 ... chr7_random chr8_random chr9_random > chrUn_random chrX_random chrY_random > 197195432 181748087 159599783 155630120 152537259 > 149517037 ... 362490 849593 449403 5900358 > 1785075 58682461 > > Besides tx_name, subselections can be specified by tx_id, tx_chrom, > tx_name, tx_strand & gene_id. > All these worked fine except for the latter, which gave the following error: > > > > transcripts(mm9KG,vals=list(gene_id="*18777*"),columns=c("tx_id","tx _name","gene_id")) > Error in sqliteExecStatement(con, statement, bind.data) : > RS-DBI driver: (error in statement: no such column: gene_id) > > As gene_id is among ValidValNames (as specified in the function > transcripts) and all other > criteria gave the expected results, I'm affraid this is a bug. I've > noticed this was the only > one selecting in a <compressedcharacterlist>, so this might perhaps be > the reason. > > Except for these minor issues, I do think this package is very relevant > and I see many applications for my > current work, therefore, thanks again, > > Cheerz, > > Erik van den Akker > > > sessionInfo() > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 > LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 > LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 > BSgenome_1.16.5 DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 > rtracklayer_1.8.1 tools_2.11.1 > [10] XML_3.1-0 > > > > 2010/7/14 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Erik, Vince, > > I'm puzzled by this. Populating the db is made using prepared > statements which are usually very fast. I'm investigating and > will let you know. Thanks for the report. > > H. > > > > On 07/14/2010 06:30 AM, Vincent Carey wrote: > > it seems to me that the speed issues are probably related to UCSC > server availability at the time of your session, which might > depend on > external factors. > > however i can confirm a problem with the refGene request. I got > further than a timeout > > first i get a timing similar to yours for knownGene-- > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > + = "knownGene")) > user system elapsed > 65.585 0.675 200.614 > > then (and this event at line 26744 is literally reproducible) > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > + = "refGene" ) > + ) > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : > line 26744 did not have 8 elements > Timing stopped at: 1.597 0.067 207.762 > > The GenomicFeatures developers will comment later. > > sessionInfo() > > R version 2.12.0 Under development (unstable) (2010-06-30 r52417) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 > DBI_0.2-5 > [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 > biomaRt_2.5.1 > [9] rtracklayer_1.9.3 tools_2.12.0 > > > On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker > <erikvandenakker at="" gmail.com="" <mailto:erikvandenakker="" at="" gmail.com="">> > wrote: > > Hi all, > > I'm a PhD student in bioinformatics working at the Leiden > University Medical > > Center and at the Delft University of Technlogy in the > Netherlands. > Currently > I'm working on the vizualization of genome wide data > sources, such as > Linkage, > GWAS& Expression data. > In order to be able to quickely access information on gene > locations (along > with the UTR, CDS, exons etc), I thought it would be a good > idea to make use > > of the GenomicFeatures package. This package works perfectly > and very > quickely > for the example provided in the vignette (good job!): > > library(GenomicFeatures) > system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = > "mm9", tablename = > > "knownGene")) > user system elapsed > 49.50 0.69 100.05 > > mm9KG > > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: mm9 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 49409 > | exon_nrow: 237551 > | cds_nrow: 204831 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > > And even for larger databases(humans), this works perfectly: > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > = "knownGene")) > user system elapsed > 82.09 1.11 162.53 > > hg19KG > > TranscriptDb object: > | Db type: TranscriptDb > | Data source: UCSC > | Genome: hg19 > | UCSC Table: knownGene > | Type of Gene ID: Entrez Gene ID > | Full dataset: yes > | transcript_nrow: 77614 > | exon_nrow: 281605 > | cds_nrow: 236664 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) > | GenomicFeatures version at creation time: 1.0.3 > | RSQLite version at creation time: 0.9-1 > > However, for tablename = "refGene" I had to shoot down my R > session after > half an hour for both the settings genome = "mm9"& genome = > "hg19" > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "mm9", tablename = > > "refGene")) > > system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = > "hg19", tablename > > = "refGene")) > > As this package makes use of functionalities provided by > rtracklayer, before > > the actual SQLite db is stored, I verified whether this was > working > correctly: > > library(rtracklayer) > session<- browserSession() > genome(session)<- "hg19" > query<- ucscTableQuery(session,"refGene") > system.time(Table<- getTable(query)) > > user system elapsed > 7.70 0.39 61.73 > > Typing "head(Table)" gave the expected results, suggesting > that something > is not working correctly in creating the SQLite databases. > > So, my question: > Given that refGene pops up when using supportedUCSCtables(), > I wondered: > 1) Did I do something wrong?; 2) should I just have more > patience& 3) could > anyone > confirm these problems? > And > @PackageMaintainers: If this is a genuine bug, are you > planning to fix this > or speed things up? > > As I work with gene expression data, which are commonly > annotated to either > RefSeqIDs or Ensembl Transcript IDs, I would prefer to work with > TranscriptDBs > based on these features. Although I can think of many work > around solutions > using "knownGene" I would prefer to work with the package as > originally > intended > and hence this post. > > Thanks for the work already done on this great package! > > Cheerz, > > Erik van den Akker > > > sessionInfo() > > R version 2.11.1 (2010-05-31) > i386-pc-mingw32 > > locale: > [1] LC_COLLATE=Dutch_Netherlands.1252 > LC_CTYPE=Dutch_Netherlands.1252 > LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C > [5] LC_TIME=Dutch_Netherlands.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 > GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 > BSgenome_1.16.5 > DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > <mailto:bioconductor at="" stat.math.ethz.ch=""> > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.8 years ago Hervé Pagès 16k

0

Entering edit mode

Wow! Impressive reaction time! Thanks for this quick fix! Cheerz, Erik 2010/7/15 Hervé Pagès <hpages@fhcrc.org> > This is fixed in GenomicFeatures release (1.0.5) and devel > (1.1.6). Both should become available via biocLite() in the > next 24 or 36 hours. > > Thanks again for the report. > > Cheers, > H. > > On 07/14/2010 02:18 PM, Erik van den Akker wrote: > >> >> Hi Vince & Hervé, >> >> I'm afraid i've run into another (small) problem with using the >> GenomicFeatures package: >> >> For the sake of clarity, I've used the example as described in the >> vignette again for making a >> TranscriptDb object: >> >> > library(GenomicFeatures) >> > mm9KG <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >> "knownGene") >> > GR <- transcripts(mm9KG) >> > GR[1] >> GRanges with 1 range and 2 elementMetadata values >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* >> >> seqlengths >> chr1 chr2 chr3 chr4 >> chr5 chr6 ... chr7_random chr8_random chr9_random >> chrUn_random chrX_random chrY_random >> 197195432 181748087 159599783 155630120 152537259 >> 149517037 ... 362490 849593 449403 5900358 >> 1785075 58682461 >> >> Next I wanted to selectively extract features from the database based on >> the criteria which can be >> specified with the "vals" argument: >> >> > transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*")) >> >> which returns the same information as GR[1] as expected. By setting the >> "columns" argument to >> c("tx_id","tx_name","exon_id","gene_id"), I found that "18777" was the >> associated gene_id: >> >> > >> >> transcripts(mm9KG,vals=list(tx_name="*uc007afg.1*"),columns=c("tx_i d","tx_name","gene_id")) >> GRanges with 1 range and 3 elementMetadata values >> seqnames ranges strand | tx_id >> tx_name gene_id >> <rle> <iranges> <rle> | <integer> <character> <compressedcharacterlist> >> [1] chr1 [4797974, 4832908] + | 14 *uc007afg.1* *18777* >> >> seqlengths >> chr1 chr2 chr3 chr4 >> chr5 chr6 ... chr7_random chr8_random chr9_random >> chrUn_random chrX_random chrY_random >> 197195432 181748087 159599783 155630120 152537259 >> 149517037 ... 362490 849593 449403 5900358 >> 1785075 58682461 >> >> Besides tx_name, subselections can be specified by tx_id, tx_chrom, >> tx_name, tx_strand & gene_id. >> All these worked fine except for the latter, which gave the following >> error: >> >> > >> >> transcripts(mm9KG,vals=list(gene_id="*18777*"),columns=c("tx_id","t x_name","gene_id")) >> Error in sqliteExecStatement(con, statement, bind.data) : >> RS-DBI driver: (error in statement: no such column: gene_id) >> >> As gene_id is among ValidValNames (as specified in the function >> transcripts) and all other >> criteria gave the expected results, I'm affraid this is a bug. I've >> noticed this was the only >> one selecting in a <compressedcharacterlist>, so this might perhaps be >> the reason. >> >> Except for these minor issues, I do think this package is very relevant >> and I see many applications for my >> current work, therefore, thanks again, >> >> Cheerz, >> >> Erik van den Akker >> >> > sessionInfo() >> R version 2.11.1 (2010-05-31) >> i386-pc-mingw32 >> >> locale: >> [1] LC_COLLATE=Dutch_Netherlands.1252 >> LC_CTYPE=Dutch_Netherlands.1252 LC_MONETARY=Dutch_Netherlands.1252 >> LC_NUMERIC=C >> [5] LC_TIME=Dutch_Netherlands.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 >> BSgenome_1.16.5 DBI_0.2-5 RCurl_1.4-2 RSQLite_0.9-1 >> rtracklayer_1.8.1 tools_2.11.1 >> [10] XML_3.1-0 >> >> >> >> 2010/7/14 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Erik, Vince, >> >> I'm puzzled by this. Populating the db is made using prepared >> statements which are usually very fast. I'm investigating and >> will let you know. Thanks for the report. >> >> H. >> >> >> >> On 07/14/2010 06:30 AM, Vincent Carey wrote: >> >> it seems to me that the speed issues are probably related to UCSC >> server availability at the time of your session, which might >> depend on >> external factors. >> >> however i can confirm a problem with the refGene request. I got >> further than a timeout >> >> first i get a timing similar to yours for knownGene-- >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = >> "hg19", tablename >> >> + = "knownGene")) >> user system elapsed >> 65.585 0.675 200.614 >> >> then (and this event at line 26744 is literally reproducible) >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = >> "hg19", tablename >> >> + = "refGene" ) >> + ) >> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, >> na.strings, : >> line 26744 did not have 8 elements >> Timing stopped at: 1.597 0.067 207.762 >> >> The GenomicFeatures developers will comment later. >> >> sessionInfo() >> >> R version 2.12.0 Under development (unstable) (2010-06-30 r52417) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices datasets utils methods >> base >> >> other attached packages: >> [1] GenomicFeatures_1.1.5 GenomicRanges_1.1.15 IRanges_1.7.9 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.12 >> DBI_0.2-5 >> [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 >> biomaRt_2.5.1 >> [9] rtracklayer_1.9.3 tools_2.12.0 >> >> >> On Wed, Jul 14, 2010 at 8:51 AM, Erik van den Akker >> <erikvandenakker@gmail.com <mailto:erikvandenakker@gmail.com="">> >> >> wrote: >> >> Hi all, >> >> I'm a PhD student in bioinformatics working at the Leiden >> University Medical >> >> Center and at the Delft University of Technlogy in the >> Netherlands. >> Currently >> I'm working on the vizualization of genome wide data >> sources, such as >> Linkage, >> GWAS& Expression data. >> In order to be able to quickely access information on gene >> locations (along >> with the UTR, CDS, exons etc), I thought it would be a good >> idea to make use >> >> of the GenomicFeatures package. This package works perfectly >> and very >> quickely >> for the example provided in the vignette (good job!): >> >> library(GenomicFeatures) >> system.time(mm9KG<- makeTranscriptDbFromUCSC(genome = >> "mm9", tablename = >> >> "knownGene")) >> user system elapsed >> 49.50 0.69 100.05 >> >> mm9KG >> >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: mm9 >> | UCSC Table: knownGene >> | Type of Gene ID: Entrez Gene ID >> | Full dataset: yes >> | transcript_nrow: 49409 >> | exon_nrow: 237551 >> | cds_nrow: 204831 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-07-14 14:07:54 +0200 (Wed, 14 Jul 2010) >> | GenomicFeatures version at creation time: 1.0.3 >> | RSQLite version at creation time: 0.9-1 >> >> >> And even for larger databases(humans), this works perfectly: >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = >> "hg19", tablename >> >> = "knownGene")) >> user system elapsed >> 82.09 1.11 162.53 >> >> hg19KG >> >> TranscriptDb object: >> | Db type: TranscriptDb >> | Data source: UCSC >> | Genome: hg19 >> | UCSC Table: knownGene >> | Type of Gene ID: Entrez Gene ID >> | Full dataset: yes >> | transcript_nrow: 77614 >> | exon_nrow: 281605 >> | cds_nrow: 236664 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2010-07-14 14:11:03 +0200 (Wed, 14 Jul 2010) >> | GenomicFeatures version at creation time: 1.0.3 >> | RSQLite version at creation time: 0.9-1 >> >> However, for tablename = "refGene" I had to shoot down my R >> session after >> half an hour for both the settings genome = "mm9"& genome = >> "hg19" >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = >> "mm9", tablename = >> >> "refGene")) >> >> system.time(hg19KG<- makeTranscriptDbFromUCSC(genome = >> "hg19", tablename >> >> = "refGene")) >> >> As this package makes use of functionalities provided by >> rtracklayer, before >> >> the actual SQLite db is stored, I verified whether this was >> working >> correctly: >> >> library(rtracklayer) >> session<- browserSession() >> genome(session)<- "hg19" >> query<- ucscTableQuery(session,"refGene") >> system.time(Table<- getTable(query)) >> >> user system elapsed >> 7.70 0.39 61.73 >> >> Typing "head(Table)" gave the expected results, suggesting >> that something >> is not working correctly in creating the SQLite databases. >> >> So, my question: >> Given that refGene pops up when using supportedUCSCtables(), >> I wondered: >> 1) Did I do something wrong?; 2) should I just have more >> patience& 3) could >> anyone >> confirm these problems? >> And >> @PackageMaintainers: If this is a genuine bug, are you >> planning to fix this >> or speed things up? >> >> As I work with gene expression data, which are commonly >> annotated to either >> RefSeqIDs or Ensembl Transcript IDs, I would prefer to work >> with >> TranscriptDBs >> based on these features. Although I can think of many work >> around solutions >> using "knownGene" I would prefer to work with the package as >> originally >> intended >> and hence this post. >> >> Thanks for the work already done on this great package! >> >> Cheerz, >> >> Erik van den Akker >> >> >> sessionInfo() >> >> R version 2.11.1 (2010-05-31) >> i386-pc-mingw32 >> >> locale: >> [1] LC_COLLATE=Dutch_Netherlands.1252 >> LC_CTYPE=Dutch_Netherlands.1252 >> LC_MONETARY=Dutch_Netherlands.1252 LC_NUMERIC=C >> [5] LC_TIME=Dutch_Netherlands.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] rtracklayer_1.8.1 RCurl_1.4-2 bitops_1.0-4.1 >> GenomicFeatures_1.0.3 GenomicRanges_1.0.5 IRanges_1.6.8 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 biomaRt_2.4.0 Biostrings_2.16.7 >> BSgenome_1.16.5 >> DBI_0.2-5 RSQLite_0.9-1 tools_2.11.1 XML_3.1-0 >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> <mailto:bioconductor@stat.math.ethz.ch> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> <mailto:bioconductor@stat.math.ethz.ch> >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M2-B876 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.8 years ago Erik van den Akker ▴ 50

Login before adding your answer.