Question: Extracting strand information from GenomicFeatures transcript db objects
0
6.5 years ago by
Veerendra GP100
Veerendra GP100 wrote:
Hello Everyone! Greetings! I am using "GenomicFeatures" library to create Transcript db objects for the mouse genome. I did it by using biomart, *makeTranscriptDbFromBiomart()* and also UCSC, makeTranscriptDbFromUCSC() functions. As I am interested in the strand information, I tried fetching the it using *as.integer(),* *runValue**()* & *strand()* functions. I am able to get it from the transcriptdb object, created using *makeTranscriptDbFromBiomart()* but ending up with an error when I am trying to do the same with transcriptdb object created using makeTranscriptDbFromUCSC() function*.* working code: mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= "mmusculus_gene_ensembl"); saveDb(mouseGNM, file="MouseGNM.sqlite"); library("GenomicFeatures"); MouseGNM<- loadDb("MouseGNM.sqlite"); GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); > runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of length 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more elements> > strand.info[1:100] [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 2 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 here 2 is the antisense strand and 1 the sense strand UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "knownGene"); saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); > runValue(strand(UCSCGenegroupedTranscripts)) CompressedIntegerList of length 21761 [["100009600"]] 2 [["100009609"]] 2 [["100009614"]] 1 [["100012"]] 2 [["100017"]] 2 [["100019"]] 1 [["100033459"]] 1 [["100034251"]] 1 [["100034361"]] 2 [["100034684"]] 2 ... <21751 more elements> *ERROR:* > strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); Error in as.vector(x, mode = "integer") : coercing an AtomicList object to an atomic vector is supported only for objects with top-level elements of length <= 1 I am not able to understand this error message, could anyone let me know what is going wrong in the code. Here is the session information. > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-pc- linux-gnu (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] GenomicFeatures_1.10.1 AnnotationDbi_1.20.3 Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] biomaRt_2.14.0 Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 zlibbioc_1.4.0 Thank you in advance. Regards, Veerendra. -- GP.Veerendra PhD Student (Bioinformatics) Stazione Zoologica Anton Dohrn Naples, Italy M:+393663915221 [[alternative HTML version deleted]]
transcriptdb biomart • 896 views
modified 6.5 years ago by Valerie Obenchain6.7k • written 6.5 years ago by Veerendra GP100
Answer: Extracting strand information from GenomicFeatures transcript db objects
0
6.5 years ago by
United States
Valerie Obenchain6.7k wrote:
Hello, Thanks for sending a reproducible example. Sorry for the cryptic error message. That is something we could improve. The problem is that for the UCSC data, some transcripts in a single gene are from different strands. In this case the strand Rle will have multiple runValues (i.e. runValue longer than 1). When we summarize strand runValues by elementLength we see the UCSC data has runValues of 1, 2, 3, 4 and 5. > table(elementLengths(runValue(strand(GenegroupedTranscripts)))) 1 37315 > table(elementLengths(runValue(strand(UCSCGenegroupedTranscripts)))) 1 2 3 4 5 21686 72 1 1 1 Here is a closer look at the entry that has an elementLength of 3: > UCSC_elts <- elementLengths(runValue(strand(UCSCGenegroupedTranscripts))) > UCSCGenegroupedTranscripts[UCSC_elts == 3] GRangesList of length 1: $108816 GRanges with 5 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <rle> <iranges> <rle> | <integer> <character> [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 The strand Rle for this gene is, > strand(UCSCGenegroupedTranscripts[UCSC_elts == 3]) CompressedRleList of length 1$108816 factor-Rle of length 5 with 3 runs Lengths: 2 2 1 Values : + - + Levels(3): + - * In the case of multiple stands per list element (per gene in this case) you can't use as.integer(). It looks like you are after a vector of a single strand per gene. Depending on your use case you could remove these genes or set the strand of these genes to '*'. Valerie On 02/12/2013 08:11 AM, Veerendra GP wrote: > Hello Everyone! > > Greetings! > > I am using "GenomicFeatures" library to create Transcript db objects for > the mouse genome. > I did it by using biomart, *makeTranscriptDbFromBiomart()* and also > UCSC, makeTranscriptDbFromUCSC() > functions. > > As I am interested in the strand information, I tried fetching the it using > *as.integer(),* *runValue**()* & *strand()* functions. I am able to get it > from the transcriptdb object, created using *makeTranscriptDbFromBiomart()* but > ending up with an error when I am trying to do the same with > transcriptdb object created using makeTranscriptDbFromUCSC() function*.* > > working code: > > mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= > "mmusculus_gene_ensembl"); > saveDb(mouseGNM, file="MouseGNM.sqlite"); > library("GenomicFeatures"); > MouseGNM<- loadDb("MouseGNM.sqlite"); > GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); > strand.info <-as.integer(runValue(strand(GenegroupedTranscripts))); > >> runValue(strand(GenegroupedTranscripts)) CompressedIntegerList of length > 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 > [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 > [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 > [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 > [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more > elements> > >> strand.info[1:100] > [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 2 > 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 > 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 > > here 2 is the antisense strand and 1 the sense strand > > > UCSC.mouseGNM <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = > "knownGene"); > saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); > UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); > UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); > strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); > >> runValue(strand(UCSCGenegroupedTranscripts)) > CompressedIntegerList of length 21761 > [["100009600"]] 2 > [["100009609"]] 2 > [["100009614"]] 1 > [["100012"]] 2 > [["100017"]] 2 > [["100019"]] 1 > [["100033459"]] 1 > [["100034251"]] 1 > [["100034361"]] 2 > [["100034684"]] 2 > ... > <21751 more elements> > > *ERROR:* >> strand.info2<- as.integer(runValue(strand(UCSCGenegroupedTranscripts))); > Error in as.vector(x, mode = "integer") : coercing an AtomicList object to > an atomic vector is supported only for objects with top-level elements of > length <= 1 > > I am not able to understand this error message, could anyone let me know > what is going wrong in the code. > Here is the session information. > >> sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-pc- linux-gnu > (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] > LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 > LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C > LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached > base packages: [1] stats graphics grDevices utils datasets methods base > other attached packages: [1] GenomicFeatures_1.10.1 AnnotationDbi_1.20.3 > Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 > loaded via a namespace (and not attached): [1] biomaRt_2.14.0 > Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 > parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 > rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 > zlibbioc_1.4.0 > > > Thank you in advance. > Regards, > > Veerendra. > > > >
Dear Valerie, Thank you very much for the descriptive reply and your suggestion, it indeed was a cryptic error for me but now I understand the cause. Thanks, Veerendra On Wed, Feb 13, 2013 at 12:19 AM, Valerie Obenchain <vobencha@fhcrc.org>wrote: > Hello, > > Thanks for sending a reproducible example. Sorry for the cryptic error > message. That is something we could improve. > > The problem is that for the UCSC data, some transcripts in a single gene > are from different strands. In this case the strand Rle will have multiple > runValues (i.e. runValue longer than 1). > > When we summarize strand runValues by elementLength we see the UCSC data > has runValues of 1, 2, 3, 4 and 5. > > table(elementLengths(runValue(**strand(GenegroupedTranscripts)**))) >> > > 1 > 37315 > >> table(elementLengths(runValue(**strand(**UCSCGenegroupedTranscripts )))) >> > > 1 2 3 4 5 > 21686 72 1 1 1 > > Here is a closer look at the entry that has an elementLength of 3: > > UCSC_elts <- elementLengths(runValue(**strand(** >> UCSCGenegroupedTranscripts))) >> UCSCGenegroupedTranscripts[**UCSC_elts == 3] >> > > GRangesList of length 1: > $108816 > GRanges with 5 ranges and 2 metadata columns: > seqnames ranges strand | tx_id tx_name > <rle> <iranges> <rle> | <integer> <character> > [1] chr4 [42452733, 42476567] + | 10641 uc008smu.1 > [2] chr4 [42471253, 42471291] + | 10642 uc008smv.1 > [3] chr4 [41902019, 41925853] - | 12216 uc008slb.1 > [4] chr4 [41907295, 41907333] - | 12217 uc008slg.1 > [5] chrUn_random [ 588639, 612480] + | 55359 uc012hdn.1 > > > The strand Rle for this gene is, > > strand(**UCSCGenegroupedTranscripts[**UCSC_elts == 3]) >> > CompressedRleList of length 1 >$108816 > factor-Rle of length 5 with 3 runs > Lengths: 2 2 1 > Values : + - + > Levels(3): + - * > > In the case of multiple stands per list element (per gene in this case) > you can't use as.integer(). It looks like you are after a vector of a > single strand per gene. Depending on your use case you could remove these > genes or set the strand of these genes to '*'. > > > Valerie > > > > > On 02/12/2013 08:11 AM, Veerendra GP wrote: > >> Hello Everyone! >> >> Greetings! >> >> I am using "GenomicFeatures" library to create Transcript db objects for >> the mouse genome. >> I did it by using biomart, *makeTranscriptDbFromBiomart()*** and also >> >> UCSC, makeTranscriptDbFromUCSC() >> functions. >> >> As I am interested in the strand information, I tried fetching the it >> using >> *as.integer(),* *runValue**()* & *strand()* functions. I am able to get it >> from the transcriptdb object, created using *makeTranscriptDbFromBiomart() >> *** but >> >> ending up with an error when I am trying to do the same with >> transcriptdb object created using makeTranscriptDbFromUCSC() function*.* >> >> >> working code: >> >> mouseGNM <- makeTranscriptDbFromBiomart (biomart = "ensembl", dataset= >> "mmusculus_gene_ensembl"); >> saveDb(mouseGNM, file="MouseGNM.sqlite"); >> library("GenomicFeatures"); >> MouseGNM<- loadDb("MouseGNM.sqlite"); >> GenegroupedTranscripts<- transcriptsBy(MouseGNM, by = "gene"); >> strand.info <-as.integer(runValue(strand(**GenegroupedTranscripts))); >> >> runValue(strand(**GenegroupedTranscripts)) CompressedIntegerList of >>> length >>> >> 37315 [["ENSMUSG00000000001"]] 2 [["ENSMUSG00000000003"]] 2 >> [["ENSMUSG00000000028"]] 2 [["ENSMUSG00000000031"]] 2 >> [["ENSMUSG00000000037"]] 1 [["ENSMUSG00000000049"]] 1 >> [["ENSMUSG00000000056"]] 1 [["ENSMUSG00000000058"]] 1 >> [["ENSMUSG00000000078"]] 1 [["ENSMUSG00000000085"]] 1 ... <37305 more >> elements> >> >> strand.info[1:100] >>> >> [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 1 1 2 1 2 >> 2 >> 1 2 [38] 2 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 2 1 1 1 2 1 2 1 2 2 1 1 1 1 1 2 1 >> 2 2 2 2 [75] 2 2 2 2 2 2 2 1 2 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 2 1 >> >> here 2 is the antisense strand and 1 the sense strand >> >> >> UCSC.mouseGNM <- makeTranscriptDbFromUCSC(**genome = "mm9", tablename = >> "knownGene"); >> saveDb(UCSC.mouseGNM, file="UCSCMouseGNM.sqlite"); >> UCSCMouseGNM<- loadDb("UCSCMouseGNM.sqlite"); >> UCSCGenegroupedTranscripts<- transcriptsBy(UCSCMouseGNM, by = "gene"); >> strand.info2<- as.integer(runValue(strand(** >> UCSCGenegroupedTranscripts))); >> >> runValue(strand(**UCSCGenegroupedTranscripts)) >>> >> CompressedIntegerList of length 21761 >> [["100009600"]] 2 >> [["100009609"]] 2 >> [["100009614"]] 1 >> [["100012"]] 2 >> [["100017"]] 2 >> [["100019"]] 1 >> [["100033459"]] 1 >> [["100034251"]] 1 >> [["100034361"]] 2 >> [["100034684"]] 2 >> ... >> <21751 more elements> >> >> *ERROR:* >> >> strand.info2<- as.integer(runValue(strand(** >>> UCSCGenegroupedTranscripts))); >>> >> Error in as.vector(x, mode = "integer") : coercing an AtomicList object to >> an atomic vector is supported only for objects with top-level elements of >> length <= 1 >> >> I am not able to understand this error message, could anyone let me know >> what is going wrong in the code. >> Here is the session information. >> >> sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-pc- linux-gnu >>> >> (64-bit) locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C [3] >> LC_TIME=en_US.UTF-8 LC_COLLATE=en_AU.UTF-8 [5] LC_MONETARY=en_US.UTF-8 >> LC_MESSAGES=en_AU.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C >> LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> attached >> base packages: [1] stats graphics grDevices utils datasets methods base >> other attached packages: [1] GenomicFeatures_1.10.1 AnnotationDbi_1.20.3 >> Biobase_2.18.0 [4] GenomicRanges_1.10.5 IRanges_1.16.4 BiocGenerics_0.4.0 >> loaded via a namespace (and not attached): [1] biomaRt_2.14.0 >> Biostrings_2.26.2 bitops_1.0-5 BSgenome_1.26.1 [5] DBI_0.2-5 >> parallel_2.15.2 RCurl_1.95-3 Rsamtools_1.10.2 [9] RSQLite_0.11.2 >> rtracklayer_1.18.2 stats4_2.15.2 tools_2.15.2 [13] XML_3.95-0.1 >> zlibbioc_1.4.0 >> >> >> Thank you in advance. >> Regards, >> >> Veerendra. >> >> >> >> >> > -- GP.Veerendra PhD Student (Bioinformatics) Stazione Zoologica Anton Dohrn Naples, Italy M:+393663915221 [[alternative HTML version deleted]]