building a refseq-based transcriptDb: warnings of interest?

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 5 weeks ago

United States

> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene") Download the refGene table ... OK Download the refLink table ... OK Extract the 'transcripts' data frame ... OK Extract the 'splicings' data frame ... OK Download and preprocess the 'chrominfo' data frame ... OK Prepare the 'metadata' data frame ... OK Make the TranscriptDb object ... OK There were 50 or more warnings (use warnings() to see the first 50) > warnings() Warning messages: 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : UCSC data anomaly in transcript NM_017940: the cds cumulative length is not a multiple of 3 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : UCSC data anomaly in transcript NM_001037675: the cds cumulative length is not a multiple of 3 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : UCSC data anomaly in transcript NM_001039703: the cds cumulative length is not a multiple of 3 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], exon_locs$start[[i]], ... : and so on. Does this need to be reported to UCSC? > sessionInfo() R version 2.12.0 Under development (unstable) (2010-06-30 r52417) Platform: x86_64-apple-darwin10.3.0/x86_64 (64-bit) locale: [1] C attached base packages: [1] stats graphics grDevices datasets tools utils methods [8] base other attached packages: [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.15 IRanges_1.7.13 [4] weaver_1.15.0 codetools_0.2-2 digest_0.4.2 loaded via a namespace (and not attached): [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.26 DBI_0.2-5 [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 biomaRt_2.5.1 [9] rtracklayer_1.9.3

TranscriptDb TranscriptDb • 1.1k views

ADD COMMENT • link updated 13.7 years ago by Hervé Pagès 16k • written 13.7 years ago by Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 1 day ago

Seattle, WA, United States

Hi Vince, On 07/23/2010 02:50 AM, Vincent Carey wrote: >> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene") > Download the refGene table ... OK > Download the refLink table ... OK > Extract the 'transcripts' data frame ... OK > Extract the 'splicings' data frame ... OK > Download and preprocess the 'chrominfo' data frame ... OK > Prepare the 'metadata' data frame ... OK > Make the TranscriptDb object ... OK > There were 50 or more warnings (use warnings() to see the first 50) >> warnings() > Warning messages: > 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]], ... : > UCSC data anomaly in transcript NM_017940: the cds cumulative length > is not a multiple of 3 > 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]], ... : > UCSC data anomaly in transcript NM_001037675: the cds cumulative > length is not a multiple of 3 > 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]], ... : > UCSC data anomaly in transcript NM_001039703: the cds cumulative > length is not a multiple of 3 > 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], > exon_locs$start[[i]], ... : > > and so on. Does this need to be reported to UCSC? Glad you bring this in the discussion. If you look at the schema: http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgta_ track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema the refGene table has cdsStartStat and cdsStartEnd cols (in addition to the cdsStart and cdsEnd cols) which describe the status of each CDS. My understanding is that only CDS with status 'cmpl' (complete) are guaranteed to have a length that is a multiple of 3. Currently makeTranscriptDbFromUCSC() imports all CDS in the TranscriptDb object, regardless of their status, and issues a warning for each CDS that doesn't look right. Maybe not the best approach. Should we allow the user to filter CDSs based on this status? Or should we import only complete CDSs? Or we import all the CDSs but we store in the metadata table of the TranscriptDb object (and then display this in the show method) the fact that not all the CDSs are complete? Then all TranscriptDb objects made with makeTranscriptDbFromUCSC() would be marked that way, except those obtained from the knownGene table where, AFAIK, all the CDSs are guaranteed to be complete. One difficulty with the design of TranscriptDb objects was to come up with a db schema that would accommodate data coming from very different places like UCSC and biomaRt, and then to implement methods for extracting features from the db that would not be specific to one source or another. This is why adding the cdsStartStat and cdsStartEnd cols to our own db was discarded because those cols are specific to UCSC (IIRC biomaRt/Ensembl doesn't provide this info). Not even all transcript-like tables at UCSC have them. And tables that have them don't necessarily use the same set of values for this col (they use a MySQL enum type). I guess it all depends what people want to do with those CDSs. Cheers, H. >> sessionInfo() > R version 2.12.0 Under development (unstable) (2010-06-30 r52417) > Platform: x86_64-apple-darwin10.3.0/x86_64 (64-bit) > > locale: > [1] C > > attached base packages: > [1] stats graphics grDevices datasets tools utils methods > [8] base > > other attached packages: > [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.15 IRanges_1.7.13 > [4] weaver_1.15.0 codetools_0.2-2 digest_0.4.2 > > loaded via a namespace (and not attached): > [1] BSgenome_1.17.5 Biobase_2.9.0 Biostrings_2.17.26 DBI_0.2-5 > [5] RCurl_1.4-2 RSQLite_0.9-1 XML_3.1-0 biomaRt_2.5.1 > [9] rtracklayer_1.9.3 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.7 years ago Hervé Pagès 16k

0

Entering edit mode

2010/7/24 Hervé Pagès <hpages at="" fhcrc.org="">: > Hi Vince, > > On 07/23/2010 02:50 AM, Vincent Carey wrote: >>> >>> hg18r.txdb = makeTranscriptDbFromUCSC(tablename="refGene") >> >> Download the refGene table ... OK >> Download the refLink table ... OK >> Extract the 'transcripts' data frame ... OK >> Extract the 'splicings' data frame ... OK >> Download and preprocess the 'chrominfo' data frame ... OK >> Prepare the 'metadata' data frame ... OK >> Make the TranscriptDb object ... OK >> There were 50 or more warnings (use warnings() to see the first 50) >>> >>> warnings() >> >> Warning messages: >> 1: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ?... : >> ? UCSC data anomaly in transcript NM_017940: the cds cumulative length >> is not a multiple of 3 >> 2: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ?... : >> ? UCSC data anomaly in transcript NM_001037675: the cds cumulative >> length is not a multiple of 3 >> 3: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ?... : >> ? UCSC data anomaly in transcript NM_001039703: the cds cumulative >> length is not a multiple of 3 >> 4: In .extractUCSCCdsStartEnd(cdsStart[i], cdsEnd[i], >> exon_locs$start[[i]], ?... : >> >> and so on. ?Does this need to be reported to UCSC? > > Glad you bring this in the discussion. > > If you look at the schema: > > > http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=genes&hgt a_track=refGene&hgta_table=refGene&hgta_doSchema=describe+table+schema > > the refGene table has cdsStartStat and cdsStartEnd cols (in addition > to the ?cdsStart and cdsEnd cols) which describe the status of each > CDS. My understanding is that only CDS with status 'cmpl' (complete) > are guaranteed to have a length that is a multiple of 3. > > Currently makeTranscriptDbFromUCSC() imports all CDS in the TranscriptDb > object, regardless of their status, and issues a warning for each CDS > that doesn't look right. Maybe not the best approach. Should we allow > the user to filter CDSs based on this status? Or should we import only > complete CDSs? Or we import all the CDSs but we store in the metadata > table of the TranscriptDb object (and then display this in the show > method) the fact that not all the CDSs are complete? Then all > TranscriptDb objects made with makeTranscriptDbFromUCSC() would > be marked that way, except those obtained from the knownGene > table where, AFAIK, all the CDSs are guaranteed to be complete. > > One difficulty with the design of TranscriptDb objects was to come > up with a db schema that would accommodate data coming from very > different places like UCSC and biomaRt, and then to implement > methods for extracting features from the db that would not be > specific to one source or another. This is why adding the cdsStartStat > and cdsStartEnd cols to our own db was discarded because those > cols are specific to UCSC (IIRC biomaRt/Ensembl doesn't provide > this info). Not even all transcript-like tables at UCSC have them. > And tables that have them don't necessarily use the same set of > values for this col (they use a MySQL enum type). > > I guess it all depends what people want to do with those CDSs. One suggestion -- add a utility or a vignette fragment that shows how one can acquire the cdsStartStat and cdsEndStat columns separately so that a user could filter on the basis of this information. It may be hard to learn the implications of this status variable without experimenting with such filters; if some readers have direct experience with this status marker please chime in. > > Cheers, > H. > >>> ?sessionInfo() >> >> R version 2.12.0 Under development (unstable) (2010-06-30 r52417) >> Platform: x86_64-apple-darwin10.3.0/x86_64 (64-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats ? ? graphics ?grDevices datasets ?tools ? ? utils ? ? methods >> [8] base >> >> other attached packages: >> [1] GenomicFeatures_1.1.6 GenomicRanges_1.1.15 ?IRanges_1.7.13 >> [4] weaver_1.15.0 ? ? ? ? codetools_0.2-2 ? ? ? digest_0.4.2 >> >> loaded via a namespace (and not attached): >> [1] BSgenome_1.17.5 ? ?Biobase_2.9.0 ? ? ?Biostrings_2.17.26 DBI_0.2-5 >> [5] RCurl_1.4-2 ? ? ? ?RSQLite_0.9-1 ? ? ?XML_3.1-0 ? ? ? ? ?biomaRt_2.5.1 >> [9] rtracklayer_1.9.3 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: ?(206) 667-5791 > Fax: ? ?(206) 667-1319 >

ADD REPLY • link 13.7 years ago Vincent J. Carey, Jr. 6.7k

Login before adding your answer.