Question: UCSC data anomaly in 50638 transcript(s): the cds cumulative length is

0

Hervé Pagès ♦♦

**14k**wrote:Hi Adi,
Hope you don't mind that I'm cc'ing the list.
On 05/27/2014 04:17 PM, Tarca, Adi wrote:
> Dear Herv?,
>
> Should I worry about the warning below?
>
> I just want to overall some rna seq reads with know genes.
Do you mean "overlap"?
>
> Thanks,
>
> Adi
>
> > txdb2=makeTranscriptDbFromUCSC(
>
> + genome="hg19",
>
> + tablename="knownGene")
Note that we provide a few "TxDb" packages that contain pre-computed
TranscriptDb objects for a few organisms and tracks:
http://bioconductor.org/packages/release/BiocViews.html#___Transcri
ptDb
There is one for hg19/knownGene: the TxDb.Hsapiens.UCSC.hg19.knownGene
package.
>
> Download the knownGene table ... OK
>
> Download the knownToLocusLink table ... OK
>
> Extract the 'transcripts' data frame ... OK
>
> Extract the 'splicings' data frame ... OK
>
> Download and preprocess the 'chrominfo' data frame ... OK
>
> Prepare the 'metadata' data frame ... OK
>
> Make the TranscriptDb object ... OK
>
> Warning message:
>
> In .extractCdsLocsFromUCSCTxTable(ucsc_txtable, exon_locs) :
>
> UCSC data anomaly in 50638 transcript(s): the cds cumulative
length is
>
> not a multiple of 3 for transcripts ???uc001aaa.3???
???uc010nxr.1???
>
> ???uc009vis.3??? ???uc009vjc.1??? ???uc009vjd.2???
???uc009vit.3???
> ???uc009viu.3???
>
> ???uc001aae.4??? ???uc001aai.1??? ???uc001aah.4???
???uc009vir.3???
> ???uc009viq.3???
>
> ???uc001aac.4??? ???uc009viv.2??? ???uc009viw.2???
???uc009vix.2???
> ???uc009viy.2???
>
> ???uc009viz.2??? ???uc010nxs.1??? ???uc009vje.2???
???uc009vjf.2???
> ???uc009vjb.1???
>
> ???uc001aak.3??? ???uc021oeg.2??? ???uc001aaq.2???
???uc001aar.2???
> ???uc021oeh.1???
>
> ???uc009vjk.2??? ???uc001aau.3??? ???uc001aax.1???
???uc021oej.1???
> ???uc021oek.1???
>
> ???uc021oel.1??? ???uc001abb.3??? ???uc001abe.4???
???uc001abi.2???
> ???uc001abj.3???
>
> ???uc009vjm.3??? ???uc010nxw.2??? ???uc001abl.3???
???uc002khh.3???
> ???uc001abm.2???
>
> ???uc001abo.3??? ???uc031pjj.1??? ???uc001abp.2???
???uc021oem.2???
> ???uc009vjn.2???
>
> ???uc009vjo.2??? ???uc031pjk.1??? ???uc001abt.4???
???uc001abu.1???
> ???u [... truncated]
This warning is wrong. It's actually easy to check that all the CDS
have a cumulative length that is a multiple of 3:
> cds_by_tx <- cdsBy(txdb2, by="tx")
> table(sum(width(cds_by_tx)) %% 3L)
0
63691
Seems to be a regression introduced in BioC 2.14. Someone in Seattle
will work on a fix and we will notify the list when the fix is
available.
Otherwise, assuming the code in charge of issuing the warning is
working properly, you can get a legitimate warning like this for
some combination of UCSC organism/track (but AFAIK never for the
knownGene track). If all you want to do is find/count overlaps between
some rna seq reads and known genes, then you probably don't care about
CDS at all.
Cheers,
H.
>
> > sessioninfo()
>
> Error: could not find function "sessioninfo"
>
> > sessionInfo()
>
> R version 3.0.3 (2014-03-06)
>
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>
> [9] LC_ADDRESS=C LC_TELEPHONE=C
>
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
>
> [1] parallel stats graphics grDevices utils datasets
methods
>
> [8] base
>
> other attached packages:
>
> [1] gplots_2.13.0 RColorBrewer_1.0-5 PADOG_1.4.0
>
> [4] GSA_1.03 nlme_3.1-117
KEGGdzPathwaysGEO_1.1.3
>
> [7] Heatplus_2.8.0 marray_1.40.0 limma_3.18.13
>
> [10] org.Hs.eg.db_2.10.1 preprocessCore_1.24.0 GO.db_2.10.1
>
> [13] SPIA_2.14.0 KEGGgraph_1.20.0 graph_1.40.1
>
> [16] XML_3.98-1.1 KEGG.db_2.10.1 RSQLite_0.11.4
>
> [19] DBI_0.2-7 R2HTML_2.2.1
rtracklayer_1.22.7
>
> [22] Rsamtools_1.14.3 Biostrings_2.30.1
GenomicFeatures_1.14.5
>
> [25] AnnotationDbi_1.24.0 Biobase_2.22.0
GenomicRanges_1.14.4
>
> [28] XVector_0.2.0 IRanges_1.20.7
BiocGenerics_0.8.0
>
> [31] BiocInstaller_1.12.1 multicore_0.2
>
> loaded via a namespace (and not attached):
>
> [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0
caTools_1.17
>
> [5] gdata_2.13.3 grid_3.0.3 gtools_3.4.0
> KernSmooth_2.23-12
>
> [9] lattice_0.20-29 RCurl_1.95-4.1 stats4_3.0.3
tools_3.0.3
>
> *Adi Laurentiu TARCA, Ph.D.***
>
> Assistant Professor (Research),
> Department of Computer Science & Center for Molecular Medicine and
> Genetics, Wayne State University,
> Director, Bioinformatics and Computational Biology Unit,
Perinatology
> Research Branch (NICHD),
>
> 3990 John R., Office 4809,
> Detroit, Michigan 48201
> Tel: 1-313-5775305
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319

ADD COMMENT
• link
•
modified 5.1 years ago
by
Tarca, Adi •

**570**• written 5.1 years ago by Hervé Pagès ♦♦**14k**