UCSC data anomaly in 50638 transcript(s): the cds cumulative length is
1
0
Entering edit mode
@herve-pages-1542
Last seen 11 minutes ago
Seattle, WA, United States
Hi Adi, Hope you don't mind that I'm cc'ing the list. On 05/27/2014 04:17 PM, Tarca, Adi wrote: > Dear Herv?, > > Should I worry about the warning below? > > I just want to overall some rna seq reads with know genes. Do you mean "overlap"? > > Thanks, > > Adi > > > txdb2=makeTranscriptDbFromUCSC( > > + genome="hg19", > > + tablename="knownGene") Note that we provide a few "TxDb" packages that contain pre-computed TranscriptDb objects for a few organisms and tracks: http://bioconductor.org/packages/release/BiocViews.html#___Transcri ptDb There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene package. > > Download the knownGene table ... OK > > Download the knownToLocusLink table ... OK > > Extract the 'transcripts' data frame ... OK > > Extract the 'splicings' data frame ... OK > > Download and preprocess the 'chrominfo' data frame ... OK > > Prepare the 'metadata' data frame ... OK > > Make the TranscriptDb object ... OK > > Warning message: > > In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : > > UCSC data anomaly in 50638 transcript(s): the cds cumulative length is > > not a multiple of 3 for transcripts ???uc001aaa.3??? ???uc010nxr.1??? > > ???uc009vis.3??? ???uc009vjc.1??? ???uc009vjd.2??? ???uc009vit.3??? > ???uc009viu.3??? > > ???uc001aae.4??? ???uc001aai.1??? ???uc001aah.4??? ???uc009vir.3??? > ???uc009viq.3??? > > ???uc001aac.4??? ???uc009viv.2??? ???uc009viw.2??? ???uc009vix.2??? > ???uc009viy.2??? > > ???uc009viz.2??? ???uc010nxs.1??? ???uc009vje.2??? ???uc009vjf.2??? > ???uc009vjb.1??? > > ???uc001aak.3??? ???uc021oeg.2??? ???uc001aaq.2??? ???uc001aar.2??? > ???uc021oeh.1??? > > ???uc009vjk.2??? ???uc001aau.3??? ???uc001aax.1??? ???uc021oej.1??? > ???uc021oek.1??? > > ???uc021oel.1??? ???uc001abb.3??? ???uc001abe.4??? ???uc001abi.2??? > ???uc001abj.3??? > > ???uc009vjm.3??? ???uc010nxw.2??? ???uc001abl.3??? ???uc002khh.3??? > ???uc001abm.2??? > > ???uc001abo.3??? ???uc031pjj.1??? ???uc001abp.2??? ???uc021oem.2??? > ???uc009vjn.2??? > > ???uc009vjo.2??? ???uc031pjk.1??? ???uc001abt.4??? ???uc001abu.1??? > ???u [... truncated] This warning is wrong. It's actually easy to check that all the CDS have a cumulative length that is a multiple of 3: > cds_by_tx <- cdsBy(txdb2, by="tx") > table(sum(width(cds_by_tx)) %% 3L) 0 63691 Seems to be a regression introduced in BioC 2.14. Someone in Seattle will work on a fix and we will notify the list when the fix is available. Otherwise, assuming the code in charge of issuing the warning is working properly, you can get a legitimate warning like this for some combination of UCSC organism/track (but AFAIK never for the knownGene track). If all you want to do is find/count overlaps between some rna seq reads and known genes, then you probably don't care about CDS at all. Cheers, H. > > > sessioninfo() > > Error: could not find function "sessioninfo" > > > sessionInfo() > > R version 3.0.3 (2014-03-06) > > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > > [9] LC_ADDRESS=C LC_TELEPHONE=C > > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > > [1] parallel stats graphics grDevices utils datasets methods > > [8] base > > other attached packages: > > [1] gplots_2.13.0 RColorBrewer_1.0-5 PADOG_1.4.0 > > [4] GSA_1.03 nlme_3.1-117 KEGGdzPathwaysGEO_1.1.3 > > [7] Heatplus_2.8.0 marray_1.40.0 limma_3.18.13 > > [10] org.Hs.eg.db_2.10.1 preprocessCore_1.24.0 GO.db_2.10.1 > > [13] SPIA_2.14.0 KEGGgraph_1.20.0 graph_1.40.1 > > [16] XML_3.98-1.1 KEGG.db_2.10.1 RSQLite_0.11.4 > > [19] DBI_0.2-7 R2HTML_2.2.1 rtracklayer_1.22.7 > > [22] Rsamtools_1.14.3 Biostrings_2.30.1 GenomicFeatures_1.14.5 > > [25] AnnotationDbi_1.24.0 Biobase_2.22.0 GenomicRanges_1.14.4 > > [28] XVector_0.2.0 IRanges_1.20.7 BiocGenerics_0.8.0 > > [31] BiocInstaller_1.12.1 multicore_0.2 > > loaded via a namespace (and not attached): > > [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 caTools_1.17 > > [5] gdata_2.13.3 grid_3.0.3 gtools_3.4.0 > KernSmooth_2.23-12 > > [9] lattice_0.20-29 RCurl_1.95-4.1 stats4_3.0.3 tools_3.0.3 > > *Adi Laurentiu TARCA, Ph.D.*** > > Assistant Professor (Research), > Department of Computer Science & Center for Molecular Medicine and > Genetics, Wayne State University, > Director, Bioinformatics and Computational Biology Unit, Perinatology > Research Branch (NICHD), > > 3990 John R., Office 4809, > Detroit, Michigan 48201 > Tel: 1-313-5775305 > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Genetics GO Regression Cancer TranscriptDb Genetics GO Regression Cancer TranscriptDb • 2.3k views
ADD COMMENT
0
Entering edit mode
Tarca, Adi ▴ 570
@tarca-adi-1500
Last seen 13 months ago
United States
Dear Herv?, I have seen that type of error in google search but usually was for one or few transcripts. Seeing that the problem was for maybe all of the transcripts, I was not sure that the table was properly downloaded. Thank you for the clarification and for making others aware of the issue. Best regards, Adi -----Original Message----- From: Hervé Pagès [mailto:hpages@fhcrc.org] Sent: Wednesday, May 28, 2014 1:59 AM To: Tarca, Adi Cc: bioconductor at r-project.org Subject: Re: UCSC data anomaly in 50638 transcript(s): the cds cumulative length is Hi Adi, Hope you don't mind that I'm cc'ing the list. On 05/27/2014 04:17 PM, Tarca, Adi wrote: > Dear Herv?, > > Should I worry about the warning below? > > I just want to overall some rna seq reads with know genes. Do you mean "overlap"? > > Thanks, > > Adi > > > txdb2=makeTranscriptDbFromUCSC( > > + genome="hg19", > > + tablename="knownGene") Note that we provide a few "TxDb" packages that contain pre-computed TranscriptDb objects for a few organisms and tracks: http://bioconductor.org/packages/release/BiocViews.html#___Transcri ptDb There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene package. > > Download the knownGene table ... OK > > Download the knownToLocusLink table ... OK > > Extract the 'transcripts' data frame ... OK > > Extract the 'splicings' data frame ... OK > > Download and preprocess the 'chrominfo' data frame ... OK > > Prepare the 'metadata' data frame ... OK > > Make the TranscriptDb object ... OK > > Warning message: > > In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : > > UCSC data anomaly in 50638 transcript(s): the cds cumulative length > is > > not a multiple of 3 for transcripts ......u [... truncated] This warning is wrong. It's actually easy to check that all the CDS have a cumulative length that is a multiple of 3: > cds_by_tx <- cdsBy(txdb2, by="tx") > table(sum(width(cds_by_tx)) %% 3L) 0 63691 Seems to be a regression introduced in BioC 2.14. Someone in Seattle will work on a fix and we will notify the list when the fix is available. Otherwise, assuming the code in charge of issuing the warning is working properly, you can get a legitimate warning like this for some combination of UCSC organism/track (but AFAIK never for the knownGene track). If all you want to do is find/count overlaps between some rna seq reads and known genes, then you probably don't care about CDS at all. Cheers, H.
ADD COMMENT
0
Entering edit mode
Hi Adi, This issue was being caused by some overly zealous warning code. It was throwing a warning whenever a CDS was absent (and not *only* when it was a non-viable length - as the warning says). I have fixed this so that the code is more reasonable about what it thinks you need to be warned about. Marc On 05/28/2014 08:52 AM, Tarca, Adi wrote: > Dear Herv?, > I have seen that type of error in google search but usually was for one or few transcripts. > Seeing that the problem was for maybe all of the transcripts, I was not sure that the table was properly downloaded. > Thank you for the clarification and for making others aware of the issue. > Best regards, > Adi > > -----Original Message----- > From: Hervé Pagès [mailto:hpages at fhcrc.org] > Sent: Wednesday, May 28, 2014 1:59 AM > To: Tarca, Adi > Cc: bioconductor at r-project.org > Subject: Re: UCSC data anomaly in 50638 transcript(s): the cds cumulative length is > > Hi Adi, > > Hope you don't mind that I'm cc'ing the list. > > On 05/27/2014 04:17 PM, Tarca, Adi wrote: >> Dear Herv?, >> >> Should I worry about the warning below? >> >> I just want to overall some rna seq reads with know genes. > Do you mean "overlap"? > >> Thanks, >> >> Adi >> >> > txdb2=makeTranscriptDbFromUCSC( >> >> + genome="hg19", >> >> + tablename="knownGene") > Note that we provide a few "TxDb" packages that contain pre-computed TranscriptDb objects for a few organisms and tracks: > > http://bioconductor.org/packages/release/BiocViews.html#___Trans criptDb > > There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene package. > >> Download the knownGene table ... OK >> >> Download the knownToLocusLink table ... OK >> >> Extract the 'transcripts' data frame ... OK >> >> Extract the 'splicings' data frame ... OK >> >> Download and preprocess the 'chrominfo' data frame ... OK >> >> Prepare the 'metadata' data frame ... OK >> >> Make the TranscriptDb object ... OK >> >> Warning message: >> >> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) : >> >> UCSC data anomaly in 50638 transcript(s): the cds cumulative length >> is >> >> not a multiple of 3 for transcripts ......u [... truncated] > This warning is wrong. It's actually easy to check that all the CDS have a cumulative length that is a multiple of 3: > > > cds_by_tx <- cdsBy(txdb2, by="tx") > > table(sum(width(cds_by_tx)) %% 3L) > 0 > 63691 > > Seems to be a regression introduced in BioC 2.14. Someone in Seattle will work on a fix and we will notify the list when the fix is available. > > Otherwise, assuming the code in charge of issuing the warning is working properly, you can get a legitimate warning like this for some combination of UCSC organism/track (but AFAIK never for the knownGene track). If all you want to do is find/count overlaps between some rna seq reads and known genes, then you probably don't care about CDS at all. > > Cheers, > H. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY

Login before adding your answer.

Traffic: 447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6