Entering edit mode
I am finding that there is some loss of information when using high- vs. low-level DBI to get annotation from a pd.hta.2.0 sqlite database.
low level approach yields details on 'transcript_cluster_id"
> pd.hta.2.0
Class........: AffyHTAPDInfo
Manufacturer.:
Genome Build.: The Genome Build
Chip Geometry: 2572 rows x 2680 columns
> con = pd.hta.2.0@getdb()
> dbListTables(con)
[1] "chrom_dict" "core_mps" "featureSet" "level_dict" "mmfeature"
[6] "pmfeature" "table_info" "type_dict"
> dbGetQuery(con, "select fsetid, transcript_cluster_id from featureSet limit 1 offset 350000")
fsetid
1 19010967
transcript_cluster_id
1 TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg///TC6_dbb_hap3000142.hg
The high level approach:
> fstable = dbReadTable(con, "featureSet")
> fstable[350001,]
fsetid man_fsetid strand start stop
350001 19010967 JUC6_dbb_hap3000974.hg.1 1 NA NA
transcript_cluster_id exon_id crosshyb_type level junction_start_edge
350001 0 0 NA NA 2430189
junction_stop_edge junction_sequence has_cds chrom type
350001 2430474 TGAAAATCTTCAGGAGATATGCAAAGCAGAAA NA 131 1
Note that the transcript_cluster_id now has value "0".

Hmm. Weird. Seems to have something to do with whether or not you have NA rows for the transcript_cluster_id column or not.
> z <- dbGetQuery(con, "select * from featureSet;") > zzz <- Rle(ifelse(is.na(z$transcript_cluster_id), 1, 0)) > zzz numeric-Rle of length 934065 with 939 runs Lengths: 19646 979 1 13849 2 ... 3 4 4 1225 64 Values : 1 0 1 0 1 ... 1 0 1 0 1 > dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19646;") transcript_cluster_id 1 TC01000001.hg 2 TC01000001.hg 3 TC01000001.hg///TC01000001.hg///TC01000001.hg 4 TC01000001.hg 5 TC01000001.hg///TC01000001.hg///TC01000001.hg 6 TC01000001.hg 7 TC01000002.hg///TC01000002.hg 8 TC01000002.hg 9 TC01000004.hg 10 TC01000005.hg > dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;") transcript_cluster_id 1 NA 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0I don't know what the expectation should be in that case. Does the presence of an NA in the column cause dbGetQuery to coerce characters to numeric? That sounds suboptimal.
This seems to be right from the C code that does the query:
> debug(RSQLite:::sqliteFetch) > con <- db(pd.hta.2.0) > dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;") debugging in: sqliteFetch(rs, n = -1) debug: { check_valid(res) rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n)) if (is.null(rel)) return(data.frame()) attr(rel, "row.names") <- .set_row_names(length(rel[[1]])) attr(rel, "class") <- "data.frame" rel } Browse[2]> debug: check_valid(res) Browse[2]> debug: rel <- .Call(rsqlite_query_fetch, res@Id, nrec = as.integer(n)) Browse[2]> debug: if (is.null(rel)) return(data.frame()) Browse[2]> rel $transcript_cluster_id [1] NA 0 0 0 0 0 0 0 0 0But apparently not a problem for sqlite itself:
So maybe this is a question for Seth or Hadley?
And when I update to Seth's github version of RSQLite
> con <- db(pd.hta.2.0) > dbGetQuery(con, "select transcript_cluster_id from featureSet limit 10 offset 19645;") transcript_cluster_id 1 <NA> 2 TC01000001.hg 3 TC01000001.hg 4 TC01000001.hg///TC01000001.hg///TC01000001.hg 5 TC01000001.hg 6 TC01000001.hg///TC01000001.hg///TC01000001.hg 7 TC01000001.hg 8 TC01000002.hg///TC01000002.hg 9 TC01000002.hg 10 TC01000004.hg > sessionInfo() R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 8 (jessie) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] pd.hta.2.0_3.12.1 DBI_0.5 oligo_1.37.2 [4] Biobase_2.32.0 oligoClasses_1.35.0 RSQLite_1.0.9011 [7] Biostrings_2.40.2 XVector_0.12.0 IRanges_2.6.1 [10] S4Vectors_0.10.2 BiocGenerics_0.18.0 BiocInstaller_1.22.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.7 affxparser_1.44.0 [3] knitr_1.13 splines_3.3.0 [5] GenomicRanges_1.24.2 devtools_1.12.0 [7] zlibbioc_1.18.0 bit_1.1-12 [9] R6_2.1.2 foreach_1.4.3 [11] httr_1.2.1 GenomeInfoDb_1.8.3 [13] tools_3.3.0 SummarizedExperiment_1.2.3 [15] ff_2.2-13 git2r_0.15.0 [17] withr_1.0.2 iterators_1.0.8 [19] digest_0.6.9 preprocessCore_1.34.0 [21] affyio_1.42.0 codetools_0.2-14 [23] curl_0.9.7 memoise_1.0.0 [25] compiler_3.3.0 >The current maintainer is Kirill Müller and the next release of RSQLite will be from devtools::install_github("rstats-db/RSQLite") with the intended release on 2016-09-30.