Question

Extracting strand information from GenomicFeatures transcript db objects

0

Entering edit mode

Veerendra GP ▴ 100

@veerendra-gp-4214

Last seen 11.4 years ago

Dear Members, Greetings! I am trying to create a Transcript db object, from the GTF file for Musmusculus transcriptome data available at ensembl FTP site (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus I am using the function, "makeTranscriptDbFromGFF()" from the " GenomicFeatures" bioconductor library. *Command used:* > library("GenomicFeatures") > txdb_75 = makeTranscriptDbFromGFF (file = '/Mus_musculus.GRCm38.75.gtf', exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL', species='Mus musculus') *Error message:* extracting transcript information Estimating transcript ranges. Error in unique(sub$transcript_id) : error in evaluating the argument 'x' in selecting a method for function 'unique': Error in subs[[i]] : subscript out of bounds I am not able to understand the above error message. However, I could create the transcript db object successfully for the GTF file downloaded from ensembl version 74. ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/ *Session information:* > sessionInfo() R version 3.0.2 (2013-09-25) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] data.table_1.9.2 biomaRt_2.16.0 GenomicFeatures_1.12.4 [4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5 [7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4 loaded via a namespace (and not attached): [1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 [5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1 reshape2_1.2.2 [9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4 stats4_3.0.2 [13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 zlibbioc_1.6.0 Your assistance in this regard would be very much appreciated. Thank you in advance. Best Regards, Veerendra On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP <gpveerendra09@gmail.com>wrote: > Dear Valerie, > > Thank you very much for the descriptive reply and your suggestion, it > indeed was a cryptic error for me but now I understand the cause. > > Thanks, > Veerendra > > On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > >> Hello, >> >> Thanks for sending a reproducible example. Sorry for the cryptic error >> message. That is something we could improve. >> >> The problem is that for the UCSC data, some transcripts in a single gene >> are from different strands. In this case the strand Rle will have multiple >> runValues (i.e. runValue longer than 1). >> >> When we summarize strand runValues by elementLength we see the UCSC data >> has runValues of 1, 2, 3, 4 and 5. >> >> table(elementLengths(runValue(strand(GenegroupedTranscripts)))) >>> >> >> 1 >> 37315 >> >>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) >>> >> >> 1 2 3 4 5 >> 21686 72 1 1 1 >> >> Here is a closer look at the entry that has an elementLength of 3: >> >> UCSC_elts <- elementLengths(runValue(strand( >>> UCSCGenegroupedTranscripts))) >>> UCSCGenegroupedTranscripts[UCSC_elts == 3] >>> >> >> GRangesList of length 1: >> $108816 >> GRanges with 5 ranges and 2 metadata columns: >> seqnames ranges strand | tx_id tx_name >> <rle> <iranges> <rle> | <integer> <character> >> [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 >> [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 >> [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 >> [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 >> [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 >> >> >> The strand Rle for this gene is, >> >> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) >>> >> CompressedRleList of length 1 >> $`108816` >> factor-Rle of length 5 with 3 runs >> Lengths: 2 2 1 >> Values : + - + >> Levels(3): + - * >> >> In the case of multiple stands per list element (per gene in this case) >> you can't use as.integer(). It looks like you are after a vector of a >> single strand per gene. Depending on your use case you could remove these >> genes or set the strand of these genes to '*'. >> >> >> Valerie >> >> >> >> >> On 02/12/2013 08:11 AM, Veerendra GP wrote: >> >>> Hello Everyone! >>> >>> Greetings! >>> >>> I am using "GenomicFeatures" library to create Transcript db objects for >>> the mouse genome. >>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also >>> >>> UCSC, makeTranscriptDbFromUCSC() >>> functions. >>> >>> As I am interested in the strand information, I tried fetching the it >>> using >>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to get >>> it >>> from the transcriptdb object, created using >>> *makeTranscriptDbFromBiomart()* but >>> >>> ending up with an error when I am trying to do the same with >>> transcriptdb object created using makeTranscriptDbFromUCSC() function*.* >>> >>> >>> working code: >>> >>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= >>> "mmusculus_gene_ensembl"); >>> saveDb(mouseGNM, file="MouseGNM.sqlite"); >>> library("GenomicFeatures"); >>> MouseGNM<- loadDb("MouseGNM.sqlite"); >>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); >>> >>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of >>>> length >>>> >>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >>> elements> >>> >>> strand.info[1:100] >>>> >>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 1 2 1 >>> 2 2 >>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 >>> 1 >>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >>> >>> here 2 is the antisense strand and 1 the sense strand >>> >>> >>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>> "knownGene"); >>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); >>> strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); >>> >>> runValue(strand(UCSCGenegroupedTranscripts)) >>>> >>> CompressedIntegerList of length 21761 >>> [["100009600"]] 2 >>> [["100009609"]] 2 >>> [["100009614"]] 1 >>> [["100012"]] 2 >>> [["100017"]] 2 >>> [["100019"]] 1 >>> [["100033459"]] 1 >>> [["100034251"]] 1 >>> [["100034361"]] 2 >>> [["100034684"]] 2 >>> ... >>> <21751 more elements> >>> >>> *ERROR:* >>> >>> strand.info2<- as.integer(runValue(strand( >>>> UCSCGenegroupedTranscripts))); >>>> >>> Error in as.vector(x, mode = "integer") : coercing an AtomicList object >>> to >>> an atomic vector is supported only for objects with top-level elements of >>> length <= 1 >>> >>> I am not able to understand this error message, could anyone let me know >>> what is going wrong in the code. >>> Here is the session information. >>> >>> sessionInfo() R version 2.15.2 (2012-10-26) Platform: >>>> x86_64-pc-linux-gnu >>>> >>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> attached >>> base packages: [1] stats graphics grDevices utils datasets methods base >>> other attached packages: [1] GenomicFeatures_1.10.1 AnnotationDbi_1.20.3 >>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 >>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >>> zlibbioc_1.4.0 >>> >>> >>> Thank you in advance. >>> Regards, >>> >>> Veerendra. >>> >>> >>> >>> >>> >> > > > -- > GP.Veerendra > PhD Student (Bioinformatics) > Stazione Zoologica Anton Dohrn > Naples, Italy > M:+393663915221 > -- Veerendra Gadekar PhD Student (Bioinformatics) Animal Physiology and Evolution Stazione Zoologica Anton Dohrn Villa Comunale, 80121 Naples - Italy M:+393663915221 [[alternative HTML version deleted]]

TranscriptDb biomaRt TranscriptDb biomaRt • 1.8k views

ADD COMMENT • link updated 11.9 years ago by Hervé Pagès 16k • written 11.9 years ago by Veerendra GP ▴ 100

score 0 · Answer 1 · 2014-03-07

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 6 hours ago

Seattle, WA, United States

Hi Veerendra, Don't know about the error you get with makeTranscriptDbFromGFF(). However, please note that you should also be able to create that TranscriptDb object thru the Ensembl Mart: library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart("ensembl", "mmusculus_gene_ensembl") Then: > txdb TranscriptDb object: | Db type: TranscriptDb | Supporting package: GenomicFeatures | Data source: BioMart | Organism: Mus musculus | Resource URL: www.biomart.org:80 | BioMart database: ensembl | BioMart database version: ENSEMBL GENES 75 (SANGER UK) | BioMart dataset: mmusculus_gene_ensembl | BioMart dataset description: Mus musculus genes (GRCm38.p2) | BioMart dataset version: GRCm38.p2 | Full dataset: yes | miRBase build ID: NA | transcript_nrow: 94929 | exon_nrow: 404385 | cds_nrow: 313584 | Db created by: GenomicFeatures package from Bioconductor | Creation time: 2014-03-07 15:20:13 -0800 (Fri, 07 Mar 2014) | GenomicFeatures version at creation time: 1.14.3 | RSQLite version at creation time: 0.11.4 | DBSCHEMAVERSION: 1.0 For whatever reason, I got a warning that the chromosome lengths and circularity flags are not available for this TranscriptDb object (I'll look into this). Not a big deal though, the TranscriptDb object should still be fully functional. Cheers, H. On 03/07/2014 01:59 PM, Veerendra Gadekar wrote: > Dear Members, > > Greetings! > > I am trying to create a Transcript db object, from the GTF file for > Musmusculus transcriptome data available at ensembl FTP site > > (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus > > I am using the function, "makeTranscriptDbFromGFF()" from the " > GenomicFeatures" bioconductor library. > > *Command used:* > >> library("GenomicFeatures") >> txdb_75 = makeTranscriptDbFromGFF (file = '/Mus_musculus.GRCm38.75.gtf', > exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL', > species='Mus musculus') > > *Error message:* > > extracting transcript information > Estimating transcript ranges. > Error in unique(sub$transcript_id) : > error in evaluating the argument 'x' in selecting a method for function > 'unique': Error in subs[[i]] : subscript out of bounds > > I am not able to understand the above error message. However, I could > create the transcript db object successfully for the GTF file > > downloaded from ensembl version 74. > ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/ > > *Session information:* > >> sessionInfo() > R version 3.0.2 (2013-09-25) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] data.table_1.9.2 biomaRt_2.16.0 GenomicFeatures_1.12.4 > [4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5 > [7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4 > > loaded via a namespace (and not attached): > [1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 > > [5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1 > reshape2_1.2.2 > [9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4 stats4_3.0.2 > > [13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 > zlibbioc_1.6.0 > > Your assistance in this regard would be very much appreciated. > > Thank you in advance. > > Best Regards, > > Veerendra > > > > > > > On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP <gpveerendra09 at="" gmail.com="">wrote: > >> Dear Valerie, >> >> Thank you very much for the descriptive reply and your suggestion, it >> indeed was a cryptic error for me but now I understand the cause. >> >> Thanks, >> Veerendra >> >> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain <vobencha at="" fhcrc.org="">wrote: >> >>> Hello, >>> >>> Thanks for sending a reproducible example. Sorry for the cryptic error >>> message. That is something we could improve. >>> >>> The problem is that for the UCSC data, some transcripts in a single gene >>> are from different strands. In this case the strand Rle will have multiple >>> runValues (i.e. runValue longer than 1). >>> >>> When we summarize strand runValues by elementLength we see the UCSC data >>> has runValues of 1, 2, 3, 4 and 5. >>> >>> table(elementLengths(runValue(strand(GenegroupedTranscripts)))) >>>> >>> >>> 1 >>> 37315 >>> >>>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) >>>> >>> >>> 1 2 3 4 5 >>> 21686 72 1 1 1 >>> >>> Here is a closer look at the entry that has an elementLength of 3: >>> >>> UCSC_elts <- elementLengths(runValue(strand( >>>> UCSCGenegroupedTranscripts))) >>>> UCSCGenegroupedTranscripts[UCSC_elts == 3] >>>> >>> >>> GRangesList of length 1: >>> $108816 >>> GRanges with 5 ranges and 2 metadata columns: >>> seqnames ranges strand | tx_id tx_name >>> <rle> <iranges> <rle> | <integer> <character> >>> [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 >>> [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 >>> [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 >>> [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 >>> [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 >>> >>> >>> The strand Rle for this gene is, >>> >>> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) >>>> >>> CompressedRleList of length 1 >>> $`108816` >>> factor-Rle of length 5 with 3 runs >>> Lengths: 2 2 1 >>> Values : + - + >>> Levels(3): + - * >>> >>> In the case of multiple stands per list element (per gene in this case) >>> you can't use as.integer(). It looks like you are after a vector of a >>> single strand per gene. Depending on your use case you could remove these >>> genes or set the strand of these genes to '*'. >>> >>> >>> Valerie >>> >>> >>> >>> >>> On 02/12/2013 08:11 AM, Veerendra GP wrote: >>> >>>> Hello Everyone! >>>> >>>> Greetings! >>>> >>>> I am using "GenomicFeatures" library to create Transcript db objects for >>>> the mouse genome. >>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also >>>> >>>> UCSC, makeTranscriptDbFromUCSC() >>>> functions. >>>> >>>> As I am interested in the strand information, I tried fetching the it >>>> using >>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to get >>>> it >>>> from the transcriptdb object, created using >>>> *makeTranscriptDbFromBiomart()* but >>>> >>>> ending up with an error when I am trying to do the same with >>>> transcriptdb object created using makeTranscriptDbFromUCSC() function*.* >>>> >>>> >>>> working code: >>>> >>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= >>>> "mmusculus_gene_ensembl"); >>>> saveDb(mouseGNM, file="MouseGNM.sqlite"); >>>> library("GenomicFeatures"); >>>> MouseGNM<- loadDb("MouseGNM.sqlite"); >>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >>>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); >>>> >>>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of >>>>> length >>>>> >>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >>>> elements> >>>> >>>> strand.info[1:100] >>>>> >>>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 1 2 1 >>>> 2 2 >>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 >>>> 1 >>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >>>> >>>> here 2 is the antisense strand and 1 the sense strand >>>> >>>> >>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>>> "knownGene"); >>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); >>>> strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); >>>> >>>> runValue(strand(UCSCGenegroupedTranscripts)) >>>>> >>>> CompressedIntegerList of length 21761 >>>> [["100009600"]] 2 >>>> [["100009609"]] 2 >>>> [["100009614"]] 1 >>>> [["100012"]] 2 >>>> [["100017"]] 2 >>>> [["100019"]] 1 >>>> [["100033459"]] 1 >>>> [["100034251"]] 1 >>>> [["100034361"]] 2 >>>> [["100034684"]] 2 >>>> ... >>>> <21751 more elements> >>>> >>>> *ERROR:* >>>> >>>> strand.info2<- as.integer(runValue(strand( >>>>> UCSCGenegroupedTranscripts))); >>>>> >>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList object >>>> to >>>> an atomic vector is supported only for objects with top-level elements of >>>> length <= 1 >>>> >>>> I am not able to understand this error message, could anyone let me know >>>> what is going wrong in the code. >>>> Here is the session information. >>>> >>>> sessionInfo() R version 2.15.2 (2012-10-26) Platform: >>>>> x86_64-pc-linux-gnu >>>>> >>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>> attached >>>> base packages: [1] stats graphics grDevices utils datasets methods base >>>> other attached packages: [1] GenomicFeatures_1.10.1 AnnotationDbi_1.20.3 >>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 >>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >>>> zlibbioc_1.4.0 >>>> >>>> >>>> Thank you in advance. >>>> Regards, >>>> >>>> Veerendra. >>>> >>>> >>>> >>>> >>>> >>> >> >> >> -- >> GP.Veerendra >> PhD Student (Bioinformatics) >> Stazione Zoologica Anton Dohrn >> Naples, Italy >> M:+393663915221 >> > > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 11.9 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Veerendra, Just to let you know that I've fixed the warning you get with makeTranscriptDbFromGFF() that "the chromosome lengths and circularity flags are not available for this TranscriptDb object". The fix is in GenomicFeatures 1.14.5 (release) and 1.15.11 (devel), both of which will become available via biocLite() in the next hour or so. Cheers, H. On 03/07/2014 03:24 PM, Hervé Pagès wrote: > Hi Veerendra, > > Don't know about the error you get with makeTranscriptDbFromGFF(). > > However, please note that you should also be able to create that > TranscriptDb object thru the Ensembl Mart: > > library(GenomicFeatures) > txdb <- makeTranscriptDbFromBiomart("ensembl", "mmusculus_gene_ensembl") > > Then: > > > txdb > TranscriptDb object: > | Db type: TranscriptDb > | Supporting package: GenomicFeatures > | Data source: BioMart > | Organism: Mus musculus > | Resource URL: www.biomart.org:80 > | BioMart database: ensembl > | BioMart database version: ENSEMBL GENES 75 (SANGER UK) > | BioMart dataset: mmusculus_gene_ensembl > | BioMart dataset description: Mus musculus genes (GRCm38.p2) > | BioMart dataset version: GRCm38.p2 > | Full dataset: yes > | miRBase build ID: NA > | transcript_nrow: 94929 > | exon_nrow: 404385 > | cds_nrow: 313584 > | Db created by: GenomicFeatures package from Bioconductor > | Creation time: 2014-03-07 15:20:13 -0800 (Fri, 07 Mar 2014) > | GenomicFeatures version at creation time: 1.14.3 > | RSQLite version at creation time: 0.11.4 > | DBSCHEMAVERSION: 1.0 > > For whatever reason, I got a warning that the chromosome lengths and > circularity flags are not available for this TranscriptDb object (I'll > look into this). Not a big deal though, the TranscriptDb object should > still be fully functional. > > Cheers, > H. > > > On 03/07/2014 01:59 PM, Veerendra Gadekar wrote: >> Dear Members, >> >> Greetings! >> >> I am trying to create a Transcript db object, from the GTF file for >> Musmusculus transcriptome data available at ensembl FTP site >> >> (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus >> >> I am using the function, "makeTranscriptDbFromGFF()" from the " >> GenomicFeatures" bioconductor library. >> >> *Command used:* >> >>> library("GenomicFeatures") >>> txdb_75 = makeTranscriptDbFromGFF (file = '/Mus_musculus.GRCm38.75.gtf', >> exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL', >> species='Mus musculus') >> >> *Error message:* >> >> extracting transcript information >> Estimating transcript ranges. >> Error in unique(sub$transcript_id) : >> error in evaluating the argument 'x' in selecting a method for function >> 'unique': Error in subs[[i]] : subscript out of bounds >> >> I am not able to understand the above error message. However, I could >> create the transcript db object successfully for the GTF file >> >> downloaded from ensembl version 74. >> ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/ >> >> *Session information:* >> >>> sessionInfo() >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> [8] base >> >> other attached packages: >> [1] data.table_1.9.2 biomaRt_2.16.0 GenomicFeatures_1.12.4 >> [4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5 >> [7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4 >> >> loaded via a namespace (and not attached): >> [1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 >> >> [5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1 >> reshape2_1.2.2 >> [9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4 >> stats4_3.0.2 >> >> [13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 >> zlibbioc_1.6.0 >> >> Your assistance in this regard would be very much appreciated. >> >> Thank you in advance. >> >> Best Regards, >> >> Veerendra >> >> >> >> >> >> >> On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP >> <gpveerendra09 at="" gmail.com="">wrote: >> >>> Dear Valerie, >>> >>> Thank you very much for the descriptive reply and your suggestion, it >>> indeed was a cryptic error for me but now I understand the cause. >>> >>> Thanks, >>> Veerendra >>> >>> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain >>> <vobencha at="" fhcrc.org="">wrote: >>> >>>> Hello, >>>> >>>> Thanks for sending a reproducible example. Sorry for the cryptic error >>>> message. That is something we could improve. >>>> >>>> The problem is that for the UCSC data, some transcripts in a single >>>> gene >>>> are from different strands. In this case the strand Rle will have >>>> multiple >>>> runValues (i.e. runValue longer than 1). >>>> >>>> When we summarize strand runValues by elementLength we see the UCSC >>>> data >>>> has runValues of 1, 2, 3, 4 and 5. >>>> >>>> table(elementLengths(runValue(strand(GenegroupedTranscripts)))) >>>>> >>>> >>>> 1 >>>> 37315 >>>> >>>>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) >>>>> >>>> >>>> 1 2 3 4 5 >>>> 21686 72 1 1 1 >>>> >>>> Here is a closer look at the entry that has an elementLength of 3: >>>> >>>> UCSC_elts <- elementLengths(runValue(strand( >>>>> UCSCGenegroupedTranscripts))) >>>>> UCSCGenegroupedTranscripts[UCSC_elts == 3] >>>>> >>>> >>>> GRangesList of length 1: >>>> $108816 >>>> GRanges with 5 ranges and 2 metadata columns: >>>> seqnames ranges strand | tx_id tx_name >>>> <rle> <iranges> <rle> | <integer> <character> >>>> [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 >>>> [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 >>>> [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 >>>> [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 >>>> [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 >>>> >>>> >>>> The strand Rle for this gene is, >>>> >>>> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) >>>>> >>>> CompressedRleList of length 1 >>>> $`108816` >>>> factor-Rle of length 5 with 3 runs >>>> Lengths: 2 2 1 >>>> Values : + - + >>>> Levels(3): + - * >>>> >>>> In the case of multiple stands per list element (per gene in this case) >>>> you can't use as.integer(). It looks like you are after a vector of a >>>> single strand per gene. Depending on your use case you could remove >>>> these >>>> genes or set the strand of these genes to '*'. >>>> >>>> >>>> Valerie >>>> >>>> >>>> >>>> >>>> On 02/12/2013 08:11 AM, Veerendra GP wrote: >>>> >>>>> Hello Everyone! >>>>> >>>>> Greetings! >>>>> >>>>> I am using "GenomicFeatures" library to create Transcript db >>>>> objects for >>>>> the mouse genome. >>>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also >>>>> >>>>> UCSC, makeTranscriptDbFromUCSC() >>>>> functions. >>>>> >>>>> As I am interested in the strand information, I tried fetching the it >>>>> using >>>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to >>>>> get >>>>> it >>>>> from the transcriptdb object, created using >>>>> *makeTranscriptDbFromBiomart()* but >>>>> >>>>> ending up with an error when I am trying to do the same with >>>>> transcriptdb object created using makeTranscriptDbFromUCSC() >>>>> function*.* >>>>> >>>>> >>>>> working code: >>>>> >>>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= >>>>> "mmusculus_gene_ensembl"); >>>>> saveDb(mouseGNM, file="MouseGNM.sqlite"); >>>>> library("GenomicFeatures"); >>>>> MouseGNM<- loadDb("MouseGNM.sqlite"); >>>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >>>>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); >>>>> >>>>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of >>>>>> length >>>>>> >>>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >>>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >>>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >>>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >>>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >>>>> elements> >>>>> >>>>> strand.info[1:100] >>>>>> >>>>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 >>>>> 1 2 1 >>>>> 2 2 >>>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 >>>>> 1 1 2 >>>>> 1 >>>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >>>>> >>>>> here 2 is the antisense strand and 1 the sense strand >>>>> >>>>> >>>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>>>> "knownGene"); >>>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >>>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >>>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); >>>>> strand.info2<- >>>>> as.integer(runValue(strand(UCSCGenegroupedTranscripts))); >>>>> >>>>> runValue(strand(UCSCGenegroupedTranscripts)) >>>>>> >>>>> CompressedIntegerList of length 21761 >>>>> [["100009600"]] 2 >>>>> [["100009609"]] 2 >>>>> [["100009614"]] 1 >>>>> [["100012"]] 2 >>>>> [["100017"]] 2 >>>>> [["100019"]] 1 >>>>> [["100033459"]] 1 >>>>> [["100034251"]] 1 >>>>> [["100034361"]] 2 >>>>> [["100034684"]] 2 >>>>> ... >>>>> <21751 more elements> >>>>> >>>>> *ERROR:* >>>>> >>>>> strand.info2<- as.integer(runValue(strand( >>>>>> UCSCGenegroupedTranscripts))); >>>>>> >>>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList >>>>> object >>>>> to >>>>> an atomic vector is supported only for objects with top-level >>>>> elements of >>>>> length <= 1 >>>>> >>>>> I am not able to understand this error message, could anyone let me >>>>> know >>>>> what is going wrong in the code. >>>>> Here is the session information. >>>>> >>>>> sessionInfo() R version 2.15.2 (2012-10-26) Platform: >>>>>> x86_64-pc-linux-gnu >>>>>> >>>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >>>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >>>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >>>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>> attached >>>>> base packages: [1] stats graphics grDevices utils datasets methods >>>>> base >>>>> other attached packages: [1] GenomicFeatures_1.10.1 >>>>> AnnotationDbi_1.20.3 >>>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 >>>>> BiocGenerics_0.4.0 >>>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >>>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >>>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >>>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >>>>> zlibbioc_1.4.0 >>>>> >>>>> >>>>> Thank you in advance. >>>>> Regards, >>>>> >>>>> Veerendra. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> GP.Veerendra >>> PhD Student (Bioinformatics) >>> Stazione Zoologica Anton Dohrn >>> Naples, Italy >>> M:+393663915221 >>> >> >> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.9 years ago Hervé Pagès 16k

0

Entering edit mode

On 03/14/2014 10:46 AM, Hervé Pagès wrote: > Hi Veerendra, > > Just to let you know that I've fixed the warning you get with > makeTranscriptDbFromGFF() that "the chromosome lengths and > circularity flags are not available for this TranscriptDb object". I mean, I fixed the warning you get with makeTranscriptDbFromBiomart(), not with makeTranscriptDbFromGFF() (which I didn't touch). H. > The fix is in GenomicFeatures 1.14.5 (release) and 1.15.11 (devel), > both of which will become available via biocLite() in the next hour > or so. > > Cheers, > H. > > > On 03/07/2014 03:24 PM, Hervé Pagès wrote: >> Hi Veerendra, >> >> Don't know about the error you get with makeTranscriptDbFromGFF(). >> >> However, please note that you should also be able to create that >> TranscriptDb object thru the Ensembl Mart: >> >> library(GenomicFeatures) >> txdb <- makeTranscriptDbFromBiomart("ensembl", >> "mmusculus_gene_ensembl") >> >> Then: >> >> > txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures >> | Data source: BioMart >> | Organism: Mus musculus >> | Resource URL: www.biomart.org:80 >> | BioMart database: ensembl >> | BioMart database version: ENSEMBL GENES 75 (SANGER UK) >> | BioMart dataset: mmusculus_gene_ensembl >> | BioMart dataset description: Mus musculus genes (GRCm38.p2) >> | BioMart dataset version: GRCm38.p2 >> | Full dataset: yes >> | miRBase build ID: NA >> | transcript_nrow: 94929 >> | exon_nrow: 404385 >> | cds_nrow: 313584 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2014-03-07 15:20:13 -0800 (Fri, 07 Mar 2014) >> | GenomicFeatures version at creation time: 1.14.3 >> | RSQLite version at creation time: 0.11.4 >> | DBSCHEMAVERSION: 1.0 >> >> For whatever reason, I got a warning that the chromosome lengths and >> circularity flags are not available for this TranscriptDb object (I'll >> look into this). Not a big deal though, the TranscriptDb object should >> still be fully functional. >> >> Cheers, >> H. >> >> >> On 03/07/2014 01:59 PM, Veerendra Gadekar wrote: >>> Dear Members, >>> >>> Greetings! >>> >>> I am trying to create a Transcript db object, from the GTF file for >>> Musmusculus transcriptome data available at ensembl FTP site >>> >>> (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus >>> >>> I am using the function, "makeTranscriptDbFromGFF()" from the " >>> GenomicFeatures" bioconductor library. >>> >>> *Command used:* >>> >>>> library("GenomicFeatures") >>>> txdb_75 = makeTranscriptDbFromGFF (file = >>>> '/Mus_musculus.GRCm38.75.gtf', >>> exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL', >>> species='Mus musculus') >>> >>> *Error message:* >>> >>> extracting transcript information >>> Estimating transcript ranges. >>> Error in unique(sub$transcript_id) : >>> error in evaluating the argument 'x' in selecting a method for function >>> 'unique': Error in subs[[i]] : subscript out of bounds >>> >>> I am not able to understand the above error message. However, I could >>> create the transcript db object successfully for the GTF file >>> >>> downloaded from ensembl version 74. >>> ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/ >>> >>> *Session information:* >>> >>>> sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] data.table_1.9.2 biomaRt_2.16.0 GenomicFeatures_1.12.4 >>> [4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5 >>> [7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4 >>> >>> loaded via a namespace (and not attached): >>> [1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 >>> >>> [5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1 >>> reshape2_1.2.2 >>> [9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4 >>> stats4_3.0.2 >>> >>> [13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 >>> zlibbioc_1.6.0 >>> >>> Your assistance in this regard would be very much appreciated. >>> >>> Thank you in advance. >>> >>> Best Regards, >>> >>> Veerendra >>> >>> >>> >>> >>> >>> >>> On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP >>> <gpveerendra09 at="" gmail.com="">wrote: >>> >>>> Dear Valerie, >>>> >>>> Thank you very much for the descriptive reply and your suggestion, it >>>> indeed was a cryptic error for me but now I understand the cause. >>>> >>>> Thanks, >>>> Veerendra >>>> >>>> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain >>>> <vobencha at="" fhcrc.org="">wrote: >>>> >>>>> Hello, >>>>> >>>>> Thanks for sending a reproducible example. Sorry for the cryptic error >>>>> message. That is something we could improve. >>>>> >>>>> The problem is that for the UCSC data, some transcripts in a single >>>>> gene >>>>> are from different strands. In this case the strand Rle will have >>>>> multiple >>>>> runValues (i.e. runValue longer than 1). >>>>> >>>>> When we summarize strand runValues by elementLength we see the UCSC >>>>> data >>>>> has runValues of 1, 2, 3, 4 and 5. >>>>> >>>>> table(elementLengths(runValue(strand(GenegroupedTranscripts)))) >>>>>> >>>>> >>>>> 1 >>>>> 37315 >>>>> >>>>>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) >>>>>> >>>>> >>>>> 1 2 3 4 5 >>>>> 21686 72 1 1 1 >>>>> >>>>> Here is a closer look at the entry that has an elementLength of 3: >>>>> >>>>> UCSC_elts <- elementLengths(runValue(strand( >>>>>> UCSCGenegroupedTranscripts))) >>>>>> UCSCGenegroupedTranscripts[UCSC_elts == 3] >>>>>> >>>>> >>>>> GRangesList of length 1: >>>>> $108816 >>>>> GRanges with 5 ranges and 2 metadata columns: >>>>> seqnames ranges strand | tx_id >>>>> tx_name >>>>> <rle> <iranges> <rle> | <integer> >>>>> <character> >>>>> [1] chr4 [42452733, 42476567] + | 10641 >>>>> uc008smu.1 >>>>> [2] chr4 [42471253, 42471291] + | 10642 >>>>> uc008smv.1 >>>>> [3] chr4 [41902019, 41925853] - | 12216 >>>>> uc008slb.1 >>>>> [4] chr4 [41907295, 41907333] - | 12217 >>>>> uc008slg.1 >>>>> [5] chrUn_random [ 588639, 612480] + | 55359 >>>>> uc012hdn.1 >>>>> >>>>> >>>>> The strand Rle for this gene is, >>>>> >>>>> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) >>>>>> >>>>> CompressedRleList of length 1 >>>>> $`108816` >>>>> factor-Rle of length 5 with 3 runs >>>>> Lengths: 2 2 1 >>>>> Values : + - + >>>>> Levels(3): + - * >>>>> >>>>> In the case of multiple stands per list element (per gene in this >>>>> case) >>>>> you can't use as.integer(). It looks like you are after a vector of a >>>>> single strand per gene. Depending on your use case you could remove >>>>> these >>>>> genes or set the strand of these genes to '*'. >>>>> >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> >>>>> On 02/12/2013 08:11 AM, Veerendra GP wrote: >>>>> >>>>>> Hello Everyone! >>>>>> >>>>>> Greetings! >>>>>> >>>>>> I am using "GenomicFeatures" library to create Transcript db >>>>>> objects for >>>>>> the mouse genome. >>>>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also >>>>>> >>>>>> UCSC, makeTranscriptDbFromUCSC() >>>>>> functions. >>>>>> >>>>>> As I am interested in the strand information, I tried fetching the it >>>>>> using >>>>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to >>>>>> get >>>>>> it >>>>>> from the transcriptdb object, created using >>>>>> *makeTranscriptDbFromBiomart()* but >>>>>> >>>>>> ending up with an error when I am trying to do the same with >>>>>> transcriptdb object created using makeTranscriptDbFromUCSC() >>>>>> function*.* >>>>>> >>>>>> >>>>>> working code: >>>>>> >>>>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", >>>>>> dataset= >>>>>> "mmusculus_gene_ensembl"); >>>>>> saveDb(mouseGNM, file="MouseGNM.sqlite"); >>>>>> library("GenomicFeatures"); >>>>>> MouseGNM<- loadDb("MouseGNM.sqlite"); >>>>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >>>>>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); >>>>>> >>>>>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of >>>>>>> length >>>>>>> >>>>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >>>>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >>>>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >>>>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >>>>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >>>>>> elements> >>>>>> >>>>>> strand.info[1:100] >>>>>>> >>>>>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 >>>>>> 1 2 1 >>>>>> 2 2 >>>>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 >>>>>> 1 1 2 >>>>>> 1 >>>>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >>>>>> >>>>>> here 2 is the antisense strand and 1 the sense strand >>>>>> >>>>>> >>>>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>>>>> "knownGene"); >>>>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >>>>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >>>>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = >>>>>> "gene"); >>>>>> strand.info2<- >>>>>> as.integer(runValue(strand(UCSCGenegroupedTranscripts))); >>>>>> >>>>>> runValue(strand(UCSCGenegroupedTranscripts)) >>>>>>> >>>>>> CompressedIntegerList of length 21761 >>>>>> [["100009600"]] 2 >>>>>> [["100009609"]] 2 >>>>>> [["100009614"]] 1 >>>>>> [["100012"]] 2 >>>>>> [["100017"]] 2 >>>>>> [["100019"]] 1 >>>>>> [["100033459"]] 1 >>>>>> [["100034251"]] 1 >>>>>> [["100034361"]] 2 >>>>>> [["100034684"]] 2 >>>>>> ... >>>>>> <21751 more elements> >>>>>> >>>>>> *ERROR:* >>>>>> >>>>>> strand.info2<- as.integer(runValue(strand( >>>>>>> UCSCGenegroupedTranscripts))); >>>>>>> >>>>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList >>>>>> object >>>>>> to >>>>>> an atomic vector is supported only for objects with top-level >>>>>> elements of >>>>>> length <= 1 >>>>>> >>>>>> I am not able to understand this error message, could anyone let me >>>>>> know >>>>>> what is going wrong in the code. >>>>>> Here is the session information. >>>>>> >>>>>> sessionInfo() R version 2.15.2 (2012-10-26) Platform: >>>>>>> x86_64-pc-linux-gnu >>>>>>> >>>>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >>>>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] >>>>>> LC_MONETARY=en_US.UTF-8 >>>>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >>>>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> attached >>>>>> base packages: [1] stats graphics grDevices utils datasets methods >>>>>> base >>>>>> other attached packages: [1] GenomicFeatures_1.10.1 >>>>>> AnnotationDbi_1.20.3 >>>>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 >>>>>> BiocGenerics_0.4.0 >>>>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >>>>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >>>>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >>>>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >>>>>> zlibbioc_1.4.0 >>>>>> >>>>>> >>>>>> Thank you in advance. >>>>>> Regards, >>>>>> >>>>>> Veerendra. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> GP.Veerendra >>>> PhD Student (Bioinformatics) >>>> Stazione Zoologica Anton Dohrn >>>> Naples, Italy >>>> M:+393663915221 >>>> >>> >>> >>> >> > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 11.9 years ago Hervé Pagès 16k

0

Entering edit mode

Dear Hervé, Thank you very much for your email. I am going to update my "GenomicFeatures" right now! Just for your information, I tried using "makeTranscriptDbFromBiomart()", as instructed by you in your previous email. It worked fine! Cheers, Veerendra On Fri, Mar 14, 2014 at 6:46 PM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Veerendra, > > Just to let you know that I've fixed the warning you get with > makeTranscriptDbFromGFF() that "the chromosome lengths and > circularity flags are not available for this TranscriptDb object". > The fix is in GenomicFeatures 1.14.5 (release) and 1.15.11 (devel), > both of which will become available via biocLite() in the next hour > or so. > > Cheers, > H. > > > > On 03/07/2014 03:24 PM, Hervé Pagès wrote: > >> Hi Veerendra, >> >> Don't know about the error you get with makeTranscriptDbFromGFF(). >> >> However, please note that you should also be able to create that >> TranscriptDb object thru the Ensembl Mart: >> >> library(GenomicFeatures) >> txdb <- makeTranscriptDbFromBiomart("ensembl", >> "mmusculus_gene_ensembl") >> >> Then: >> >> > txdb >> TranscriptDb object: >> | Db type: TranscriptDb >> | Supporting package: GenomicFeatures >> | Data source: BioMart >> | Organism: Mus musculus >> | Resource URL: www.biomart.org:80 >> | BioMart database: ensembl >> | BioMart database version: ENSEMBL GENES 75 (SANGER UK) >> | BioMart dataset: mmusculus_gene_ensembl >> | BioMart dataset description: Mus musculus genes (GRCm38.p2) >> | BioMart dataset version: GRCm38.p2 >> | Full dataset: yes >> | miRBase build ID: NA >> | transcript_nrow: 94929 >> | exon_nrow: 404385 >> | cds_nrow: 313584 >> | Db created by: GenomicFeatures package from Bioconductor >> | Creation time: 2014-03-07 15:20:13 -0800 (Fri, 07 Mar 2014) >> | GenomicFeatures version at creation time: 1.14.3 >> | RSQLite version at creation time: 0.11.4 >> | DBSCHEMAVERSION: 1.0 >> >> For whatever reason, I got a warning that the chromosome lengths and >> circularity flags are not available for this TranscriptDb object (I'll >> look into this). Not a big deal though, the TranscriptDb object should >> still be fully functional. >> >> Cheers, >> H. >> >> >> On 03/07/2014 01:59 PM, Veerendra Gadekar wrote: >> >>> Dear Members, >>> >>> Greetings! >>> >>> I am trying to create a Transcript db object, from the GTF file for >>> Musmusculus transcriptome data available at ensembl FTP site >>> >>> (version 75). ftp://ftp.ensembl.org/pub/release-75/gtf/mus_musculus >>> >>> I am using the function, "makeTranscriptDbFromGFF()" from the " >>> GenomicFeatures" bioconductor library. >>> >>> *Command used:* >>> >>> library("GenomicFeatures") >>>> txdb_75 = makeTranscriptDbFromGFF (file = '/Mus_musculus.GRCm38.75.gtf', >>>> >>> exonRankAttributeName='exon_number', format='gtf', dataSource='ENSEMBL', >>> species='Mus musculus') >>> >>> *Error message:* >>> >>> extracting transcript information >>> Estimating transcript ranges. >>> Error in unique(sub$transcript_id) : >>> error in evaluating the argument 'x' in selecting a method for function >>> 'unique': Error in subs[[i]] : subscript out of bounds >>> >>> I am not able to understand the above error message. However, I could >>> create the transcript db object successfully for the GTF file >>> >>> downloaded from ensembl version 74. >>> ftp://ftp.ensembl.org/pub/release-74/gtf/mus_musculus/ >>> >>> *Session information:* >>> >>> sessionInfo() >>>> >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> [8] base >>> >>> other attached packages: >>> [1] data.table_1.9.2 biomaRt_2.16.0 GenomicFeatures_1.12.4 >>> [4] AnnotationDbi_1.22.6 Biobase_2.20.1 GenomicRanges_1.12.5 >>> [7] IRanges_1.18.4 BiocGenerics_0.6.0 BiocInstaller_1.10.4 >>> >>> loaded via a namespace (and not attached): >>> [1] Biostrings_2.28.0 bitops_1.0-6 BSgenome_1.28.0 DBI_0.2-7 >>> >>> [5] plyr_1.8.1 Rcpp_0.11.0 RCurl_1.95-4.1 >>> reshape2_1.2.2 >>> [9] Rsamtools_1.12.4 RSQLite_0.11.4 rtracklayer_1.20.4 >>> stats4_3.0.2 >>> >>> [13] stringr_0.6.2 tools_3.0.2 XML_3.98-1.1 >>> zlibbioc_1.6.0 >>> >>> Your assistance in this regard would be very much appreciated. >>> >>> Thank you in advance. >>> >>> Best Regards, >>> >>> Veerendra >>> >>> >>> >>> >>> >>> >>> On Wed, Feb 13, 2013 at 1:32 PM, Veerendra GP >>> <gpveerendra09@gmail.com>wrote: >>> >>> Dear Valerie, >>>> >>>> Thank you very much for the descriptive reply and your suggestion, it >>>> indeed was a cryptic error for me but now I understand the cause. >>>> >>>> Thanks, >>>> Veerendra >>>> >>>> On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain >>>> <vobencha@fhcrc.org>wrote: >>>> >>>> Hello, >>>>> >>>>> Thanks for sending a reproducible example. Sorry for the cryptic error >>>>> message. That is something we could improve. >>>>> >>>>> The problem is that for the UCSC data, some transcripts in a single >>>>> gene >>>>> are from different strands. In this case the strand Rle will have >>>>> multiple >>>>> runValues (i.e. runValue longer than 1). >>>>> >>>>> When we summarize strand runValues by elementLength we see the UCSC >>>>> data >>>>> has runValues of 1, 2, 3, 4 and 5. >>>>> >>>>> table(elementLengths(runValue(strand(GenegroupedTranscripts)))) >>>>> >>>>>> >>>>>> >>>>> 1 >>>>> 37315 >>>>> >>>>> table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) >>>>>> >>>>>> >>>>> 1 2 3 4 5 >>>>> 21686 72 1 1 1 >>>>> >>>>> Here is a closer look at the entry that has an elementLength of 3: >>>>> >>>>> UCSC_elts <- elementLengths(runValue(strand( >>>>> >>>>>> UCSCGenegroupedTranscripts))) >>>>>> UCSCGenegroupedTranscripts[UCSC_elts == 3] >>>>>> >>>>>> >>>>> GRangesList of length 1: >>>>> $108816 >>>>> GRanges with 5 ranges and 2 metadata columns: >>>>> seqnames ranges strand | tx_id tx_name >>>>> <rle> <iranges> <rle> | <integer> <character> >>>>> [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 >>>>> [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 >>>>> [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 >>>>> [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 >>>>> [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 >>>>> >>>>> >>>>> The strand Rle for this gene is, >>>>> >>>>> strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) >>>>> >>>>>> >>>>>> CompressedRleList of length 1 >>>>> $`108816` >>>>> factor-Rle of length 5 with 3 runs >>>>> Lengths: 2 2 1 >>>>> Values : + - + >>>>> Levels(3): + - * >>>>> >>>>> In the case of multiple stands per list element (per gene in this case) >>>>> you can't use as.integer(). It looks like you are after a vector of a >>>>> single strand per gene. Depending on your use case you could remove >>>>> these >>>>> genes or set the strand of these genes to '*'. >>>>> >>>>> >>>>> Valerie >>>>> >>>>> >>>>> >>>>> >>>>> On 02/12/2013 08:11 AM, Veerendra GP wrote: >>>>> >>>>> Hello Everyone! >>>>>> >>>>>> Greetings! >>>>>> >>>>>> I am using "GenomicFeatures" library to create Transcript db >>>>>> objects for >>>>>> the mouse genome. >>>>>> I did it by using biomart, *makeTranscriptDbFromBiomart()* and also >>>>>> >>>>>> UCSC, makeTranscriptDbFromUCSC() >>>>>> functions. >>>>>> >>>>>> As I am interested in the strand information, I tried fetching the it >>>>>> using >>>>>> *as.integer(),* *runValue**()* & *strand()* functions. I am able to >>>>>> get >>>>>> it >>>>>> from the transcriptdb object, created using >>>>>> *makeTranscriptDbFromBiomart()* but >>>>>> >>>>>> ending up with an error when I am trying to do the same with >>>>>> transcriptdb object created using makeTranscriptDbFromUCSC() >>>>>> function*.* >>>>>> >>>>>> >>>>>> working code: >>>>>> >>>>>> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= >>>>>> "mmusculus_gene_ensembl"); >>>>>> saveDb(mouseGNM, file="MouseGNM.sqlite"); >>>>>> library("GenomicFeatures"); >>>>>> MouseGNM<- loadDb("MouseGNM.sqlite"); >>>>>> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >>>>>> strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); >>>>>> >>>>>> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of >>>>>> >>>>>>> length >>>>>>> >>>>>>> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >>>>>> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >>>>>> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >>>>>> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >>>>>> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >>>>>> elements> >>>>>> >>>>>> strand.info[1:100] >>>>>> >>>>>>> >>>>>>> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 >>>>>> 1 2 1 >>>>>> 2 2 >>>>>> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 >>>>>> 1 1 2 >>>>>> 1 >>>>>> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >>>>>> >>>>>> here 2 is the antisense strand and 1 the sense strand >>>>>> >>>>>> >>>>>> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = >>>>>> "knownGene"); >>>>>> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >>>>>> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >>>>>> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); >>>>>> strand.info2<- >>>>>> as.integer(runValue(strand(UCSCGenegroupedTranscripts))); >>>>>> >>>>>> runValue(strand(UCSCGenegroupedTranscripts)) >>>>>> >>>>>>> >>>>>>> CompressedIntegerList of length 21761 >>>>>> [["100009600"]] 2 >>>>>> [["100009609"]] 2 >>>>>> [["100009614"]] 1 >>>>>> [["100012"]] 2 >>>>>> [["100017"]] 2 >>>>>> [["100019"]] 1 >>>>>> [["100033459"]] 1 >>>>>> [["100034251"]] 1 >>>>>> [["100034361"]] 2 >>>>>> [["100034684"]] 2 >>>>>> ... >>>>>> <21751 more elements> >>>>>> >>>>>> *ERROR:* >>>>>> >>>>>> strand.info2<- as.integer(runValue(strand( >>>>>> >>>>>>> UCSCGenegroupedTranscripts))); >>>>>>> >>>>>>> Error in as.vector(x, mode = "integer") : coercing an AtomicList >>>>>> object >>>>>> to >>>>>> an atomic vector is supported only for objects with top-level >>>>>> elements of >>>>>> length <= 1 >>>>>> >>>>>> I am not able to understand this error message, could anyone let me >>>>>> know >>>>>> what is going wrong in the code. >>>>>> Here is the session information. >>>>>> >>>>>> sessionInfo() R version 2.15.2 (2012-10-26) Platform: >>>>>> >>>>>>> x86_64-pc-linux-gnu >>>>>>> >>>>>>> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >>>>>> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >>>>>> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >>>>>> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>>>>> attached >>>>>> base packages: [1] stats graphics grDevices utils datasets methods >>>>>> base >>>>>> other attached packages: [1] GenomicFeatures_1.10.1 >>>>>> AnnotationDbi_1.20.3 >>>>>> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 >>>>>> BiocGenerics_0.4.0 >>>>>> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >>>>>> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >>>>>> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >>>>>> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >>>>>> zlibbioc_1.4.0 >>>>>> >>>>>> >>>>>> Thank you in advance. >>>>>> Regards, >>>>>> >>>>>> Veerendra. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> -- >>>> GP.Veerendra >>>> PhD Student (Bioinformatics) >>>> Stazione Zoologica Anton Dohrn >>>> Naples, Italy >>>> M:+393663915221 >>>> >>>> >>> >>> >>> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > -- Veerendra Gadekar PhD Student (Bioinformatics) Animal Physiology and Evolution Stazione Zoologica Anton Dohrn Villa Comunale, 80121 Naples - Italy M:+393663915221 [[alternative HTML version deleted]]

ADD REPLY • link 11.9 years ago Veerendra GP ▴ 100