Question: pd.hta.2.0 oddity, possible RSQLite bug?
0
gravatar for Vincent J. Carey, Jr.
3.2 years ago by
United States
Vincent J. Carey, Jr.6.3k wrote:

I am finding that there is some loss of information when using high- vs. low-level DBI to get annotation from a pd.hta.2.0 sqlite database.

low level approach yields details on 'transcript_cluster_id"

> pd.hta.2.0

Class........: AffyHTAPDInfo 
Manufacturer.:  
Genome Build.: The Genome Build 
Chip Geometry: 2572 rows x  2680 columns
> con = pd.hta.2.0@getdb()
> dbListTables(con)
[1] "chrom_dict" "core_mps"   "featureSet" "level_dict" "mmfeature" 
[6] "pmfeature"  "table_info" "type_dict" 

> dbGetQuery(con, "select fsetid, transcript_cluster_id from featureSet limit 1 offset 350000")

    fsetid
1 19010967

                                                                                                                                              transcript_cluster_id

1 TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg

The high level approach:

> fstable = dbReadTable(con, "featureSet")

> fstable[350001,]

         fsetid               man_fsetid strand start stop

350001 19010967 JUC6_dbb_hap3000974.hg.1      1    NA   NA

       transcript_cluster_id exon_id crosshyb_type level junction_start_edge

350001                     0       0            NA    NA             2430189

       junction_stop_edge                junction_sequence has_cds chrom type

350001            2430474 TGAAAATCTTCAGGAGATATGCAAAGCAGAAA      NA   131    1

Note that the transcript_cluster_id now has value "0".  

 

oligo rsqlite pd.hta.2.0 • 507 views
ADD COMMENTlink modified 3.2 years ago by James W. MacDonald51k • written 3.2 years ago by Vincent J. Carey, Jr.6.3k

Hmm. Weird. Seems to have something to do with whether or not you have NA rows for the transcript_cluster_id column or not.

> z <- dbGetQuery(con, "select * from featureSet;")
> zzz <- Rle(ifelse(is.na(z$transcript_cluster_id), 1, 0))
> zzz
numeric-Rle of length 934065 with 939 runs
  Lengths: 19646   979     1 13849     2 ...     3     4     4  1225    64
  Values :     1     0     1     0     1 ...     1     0     1     0     1

> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19646;")
                           transcript_cluster_id
1                                  TC01000001.hg
2                                  TC01000001.hg
3  TC01000001.hg///TC01000001.hg///TC01000001.hg
4                                  TC01000001.hg
5  TC01000001.hg///TC01000001.hg///TC01000001.hg
6                                  TC01000001.hg
7                  TC01000002.hg///TC01000002.hg
8                                  TC01000002.hg
9                                  TC01000004.hg
10                                 TC01000005.hg
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
   transcript_cluster_id
1                     NA
2                      0
3                      0
4                      0
5                      0
6                      0
7                      0
8                      0
9                      0
10                     0

 

I don't know what the expectation should be in that case. Does the presence of an NA in the column cause dbGetQuery to coerce characters to numeric? That sounds suboptimal.

ADD REPLYlink written 3.2 years ago by James W. MacDonald51k

This seems to be right from the C code that does the query:

> debug(RSQLite:::sqliteFetch)
> con <- db(pd.hta.2.0)
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
debugging in: sqliteFetch(rs, n = -1)
debug: {
    check_valid(res)
    rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n))
    if (is.null(rel))
        return(data.frame())
    attr(rel, "row.names") <- .set_row_names(length(rel[[1]]))
    attr(rel, "class") <- "data.frame"
    rel
}
Browse[2]>
debug: check_valid(res)
Browse[2]>
debug: rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n))
Browse[2]>
debug: if (is.null(rel)) return(data.frame())
Browse[2]> rel
$transcript_cluster_id
 [1] NA  0  0  0  0  0  0  0  0  0

But apparently not a problem for sqlite itself:

SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> select transcript_cluster_id from featureSet limit 10 offset 19645;

TC01000001.hg
TC01000001.hg
TC01000001.hg///TC01000001.hg///TC01000001.hg
TC01000001.hg
TC01000001.hg///TC01000001.hg///TC01000001.hg
TC01000001.hg
TC01000002.hg///TC01000002.hg
TC01000002.hg
TC01000004.hg
sqlite>

So maybe this is a question for Seth or Hadley?

ADD REPLYlink written 3.2 years ago by James W. MacDonald51k

And when I update to Seth's github version of RSQLite

> con <- db(pd.hta.2.0)
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
                           transcript_cluster_id
1                                           <NA>
2                                  TC01000001.hg
3                                  TC01000001.hg
4  TC01000001.hg///TC01000001.hg///TC01000001.hg
5                                  TC01000001.hg
6  TC01000001.hg///TC01000001.hg///TC01000001.hg
7                                  TC01000001.hg
8                  TC01000002.hg///TC01000002.hg
9                                  TC01000002.hg
10                                 TC01000004.hg
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.12.1    DBI_0.5              oligo_1.37.2        
 [4] Biobase_2.32.0       oligoClasses_1.35.0  RSQLite_1.0.9011    
 [7] Biostrings_2.40.2    XVector_0.12.0       IRanges_2.6.1       
[10] S4Vectors_0.10.2     BiocGenerics_0.18.0  BiocInstaller_1.22.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7                affxparser_1.44.0         
 [3] knitr_1.13                 splines_3.3.0             
 [5] GenomicRanges_1.24.2       devtools_1.12.0           
 [7] zlibbioc_1.18.0            bit_1.1-12                
 [9] R6_2.1.2                   foreach_1.4.3             
[11] httr_1.2.1                 GenomeInfoDb_1.8.3        
[13] tools_3.3.0                SummarizedExperiment_1.2.3
[15] ff_2.2-13                  git2r_0.15.0              
[17] withr_1.0.2                iterators_1.0.8           
[19] digest_0.6.9               preprocessCore_1.34.0     
[21] affyio_1.42.0              codetools_0.2-14          
[23] curl_0.9.7                 memoise_1.0.0             
[25] compiler_3.3.0            
>
ADD REPLYlink written 3.2 years ago by James W. MacDonald51k

The current maintainer is Kirill Müller and the next release of RSQLite will be from  devtools::install_github("rstats-db/RSQLite") with the intended release on 2016-09-30.

ADD REPLYlink written 3.2 years ago by Martin Morgan ♦♦ 24k
thanks Jim and Martin -- this has been very helpful. On Wed, Sep 14, 2016 at 4:48 PM, Martin Morgan [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Martin Morgan <https: support.bioconductor.org="" u="" 1513=""/> wrote Comment: > pd.hta.2.0 oddity, possible RSQLite bug? > <https: support.bioconductor.org="" p="" 87091="" #87117="">: > > The current maintainer is Kirill Müller and the next release of RSQLite > will be from devtools::install_github("rstats-db/RSQLite") with the > intended release on 2016-09-30. > > ------------------------------ > > Post tags: oligo, rsqlite, pd.hta.2.0 > > You may reply via email or visit https://support.bioconductor. > org/p/87091/#87117 >
ADD REPLYlink written 3.2 years ago by Vincent J. Carey, Jr.6.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 341 users visited in the last hour