Question

pd.hta.2.0 oddity, possible RSQLite bug?

0

Entering edit mode

Vincent J. Carey, Jr. 6.7k

@vincent-j-carey-jr-4

Last seen 3 months ago

United States

I am finding that there is some loss of information when using high- vs. low-level DBI to get annotation from a pd.hta.2.0 sqlite database.

low level approach yields details on 'transcript_cluster_id"

> pd.hta.2.0

Class........: AffyHTAPDInfo 
Manufacturer.:  
Genome Build.: The Genome Build 
Chip Geometry: 2572 rows x  2680 columns
> con = pd.hta.2.0@getdb()
> dbListTables(con)
[1] "chrom_dict" "core_mps"   "featureSet" "level_dict" "mmfeature" 
[6] "pmfeature"  "table_info" "type_dict" 

> dbGetQuery(con, "select fsetid, transcript_cluster_id from featureSet limit 1 offset 350000")

    fsetid
1 19010967

                                                                                                                                              transcript_cluster_id

1 TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg

The high level approach:

> fstable = dbReadTable(con, "featureSet")

> fstable[350001,]

         fsetid               man_fsetid strand start stop

350001 19010967 JUC6_dbb_hap3000974.hg.1      1    NA   NA

       transcript_cluster_id exon_id crosshyb_type level junction_start_edge

350001                     0       0            NA    NA             2430189

       junction_stop_edge                junction_sequence has_cds chrom type

350001            2430474 TGAAAATCTTCAGGAGATATGCAAAGCAGAAA      NA   131    1

Note that the transcript_cluster_id now has value "0".

oligo rsqlite pd.hta.2.0 • 1.3k views

ADD COMMENT • link updated 8.3 years ago by James W. MacDonald 67k • written 8.3 years ago by Vincent J. Carey, Jr. 6.7k

0

Entering edit mode

Hmm. Weird. Seems to have something to do with whether or not you have NA rows for the transcript_cluster_id column or not.

> z <- dbGetQuery(con, "select * from featureSet;")
> zzz <- Rle(ifelse(is.na(z$transcript_cluster_id), 1, 0))
> zzz
numeric-Rle of length 934065 with 939 runs
  Lengths: 19646   979     1 13849     2 ...     3     4     4  1225    64
  Values :     1     0     1     0     1 ...     1     0     1     0     1

> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19646;")
                           transcript_cluster_id
1                                  TC01000001.hg
2                                  TC01000001.hg
3  TC01000001.hg///TC01000001.hg///TC01000001.hg
4                                  TC01000001.hg
5  TC01000001.hg///TC01000001.hg///TC01000001.hg
6                                  TC01000001.hg
7                  TC01000002.hg///TC01000002.hg
8                                  TC01000002.hg
9                                  TC01000004.hg
10                                 TC01000005.hg
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
   transcript_cluster_id
1                     NA
2                      0
3                      0
4                      0
5                      0
6                      0
7                      0
8                      0
9                      0
10                     0

I don't know what the expectation should be in that case. Does the presence of an NA in the column cause dbGetQuery to coerce characters to numeric? That sounds suboptimal.

ADD REPLY • link 8.3 years ago James W. MacDonald 67k

0

Entering edit mode

This seems to be right from the C code that does the query:

> debug(RSQLite:::sqliteFetch)
> con <- db(pd.hta.2.0)
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
debugging in: sqliteFetch(rs, n = -1)
debug: {
    check_valid(res)
    rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n))
    if (is.null(rel))
        return(data.frame())
    attr(rel, "row.names") <- .set_row_names(length(rel[[1]]))
    attr(rel, "class") <- "data.frame"
    rel
}
Browse[2]>
debug: check_valid(res)
Browse[2]>
debug: rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n))
Browse[2]>
debug: if (is.null(rel)) return(data.frame())
Browse[2]> rel
$transcript_cluster_id
 [1] NA  0  0  0  0  0  0  0  0  0

But apparently not a problem for sqlite itself:

SQLite version 3.8.7.1 2014-10-29 13:59:56
Enter ".help" for usage hints.
sqlite> select transcript_cluster_id from featureSet limit 10 offset 19645;

TC01000001.hg
TC01000001.hg
TC01000001.hg///TC01000001.hg///TC01000001.hg
TC01000001.hg
TC01000001.hg///TC01000001.hg///TC01000001.hg
TC01000001.hg
TC01000002.hg///TC01000002.hg
TC01000002.hg
TC01000004.hg
sqlite>

So maybe this is a question for Seth or Hadley?

ADD REPLY • link 8.3 years ago James W. MacDonald 67k

0

Entering edit mode

And when I update to Seth's github version of RSQLite

> con <- db(pd.hta.2.0)
> dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;")
                           transcript_cluster_id
1                                           <NA>
2                                  TC01000001.hg
3                                  TC01000001.hg
4  TC01000001.hg///TC01000001.hg///TC01000001.hg
5                                  TC01000001.hg
6  TC01000001.hg///TC01000001.hg///TC01000001.hg
7                                  TC01000001.hg
8                  TC01000002.hg///TC01000002.hg
9                                  TC01000002.hg
10                                 TC01000004.hg
> sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 8 (jessie)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.12.1    DBI_0.5              oligo_1.37.2        
 [4] Biobase_2.32.0       oligoClasses_1.35.0  RSQLite_1.0.9011    
 [7] Biostrings_2.40.2    XVector_0.12.0       IRanges_2.6.1       
[10] S4Vectors_0.10.2     BiocGenerics_0.18.0  BiocInstaller_1.22.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7                affxparser_1.44.0         
 [3] knitr_1.13                 splines_3.3.0             
 [5] GenomicRanges_1.24.2       devtools_1.12.0           
 [7] zlibbioc_1.18.0            bit_1.1-12                
 [9] R6_2.1.2                   foreach_1.4.3             
[11] httr_1.2.1                 GenomeInfoDb_1.8.3        
[13] tools_3.3.0                SummarizedExperiment_1.2.3
[15] ff_2.2-13                  git2r_0.15.0              
[17] withr_1.0.2                iterators_1.0.8           
[19] digest_0.6.9               preprocessCore_1.34.0     
[21] affyio_1.42.0              codetools_0.2-14          
[23] curl_0.9.7                 memoise_1.0.0             
[25] compiler_3.3.0            
>

ADD REPLY • link 8.3 years ago James W. MacDonald 67k

0

Entering edit mode

The current maintainer is Kirill Müller and the next release of RSQLite will be from devtools::install_github("rstats-db/RSQLite") with the intended release on 2016-09-30.

ADD REPLY • link 8.3 years ago Martin Morgan 25k

0

Entering edit mode

thanks Jim and Martin -- this has been very helpful. On Wed, Sep 14, 2016 at 4:48 PM, Martin Morgan [bioc] < noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User Martin Morgan <https: support.bioconductor.org="" u="" 1513=""/> wrote Comment: > pd.hta.2.0 oddity, possible RSQLite bug? > <https: support.bioconductor.org="" p="" 87091="" #87117="">: > > The current maintainer is Kirill Müller and the next release of RSQLite > will be from devtools::install_github("rstats-db/RSQLite") with the > intended release on 2016-09-30. > > ------------------------------ > > Post tags: oligo, rsqlite, pd.hta.2.0 > > You may reply via email or visit https://support.bioconductor. > org/p/87091/#87117 >

ADD REPLY • link 8.3 years ago Vincent J. Carey, Jr. 6.7k