Dear List,
I 'm developing a pipeline for analysing the HTA 2.0 chips. It works smoothly if I use the default settings of the rma function in the oligo package. But when I try to add some extra procedures for bg corrections of the non-specific binding using the probe informations, I found something strange:
The probe number in the pm matrix is less than the probes extracted from the getProbeInfo function:
> nrow(pm(dat)) [1] 6058440 > all.probes <- getProbeInfo(dat,field=c('type'),target='probeset') > nrow(all.probes) [1] 7576209
Only 'main' probes can be extracted:
> unique(all.probes$type) [1] "main" NA
Using raw sql calls confirmed that only 'main' probes are in pm matrix:
> conn <- db(get(annotation(dat))) > pbinfo <- dbGetQuery(conn,"SELECT fid, fsetid as man_fsetid FROM pmfeature ORDER BY fid ASC") > psetType <- dbGetQuery(conn,"SELECT fsetid, type FROM featureSet") > type2typeid <- dbGetQuery(conn,"SELECT type, type_id FROM type_dict") > idx <- match(pbinfo[,c('man_fsetid')],psetType[,c('fsetid')]) > probeTypeIdx <- match(psetType[idx,c('type')],type2typeid$type) > unique(type2typeid$type_id[probeTypeIdx]) [1] "main" NA
Where are the other types of probes?
> type2typeid$type_id [1] "main" "control->affx" [3] "control->chip" "control->bgp->antigenomic" [5] "control->bgp->genomic" "normgene->exon" [7] "normgene->intron" "rescue->FLmRNA->unmapped" [9] "control->affx->bac_spike" "oligo_spike_in" [11] "r1_bac_spike_at" "control->affx->polya_spike" [13] "control->affx->ercc" "control->affx->ercc->step"
By the way, the mm matrix was not empty:
> nrow(mm(dat)) [1] 1121 > head(mm(dat)) HTA2_Liver_BetaSamplePool_1.CEL HTA2_Liver_BetaSamplePool_2.CEL 49142 62 70 49772 130 154 50922 113 166 51012 250 228 51102 58 104
EDIT: Add session info:
> sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=zh_CN.UTF-8 LC_NUMERIC=C [3] LC_TIME=zh_CN.UTF-8 LC_COLLATE=zh_CN.UTF-8 [5] LC_MONETARY=zh_CN.UTF-8 LC_MESSAGES=zh_CN.UTF-8 [7] LC_PAPER=zh_CN.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=zh_CN.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] oligo_1.30.0 Biostrings_2.34.0 XVector_0.6.0 [4] IRanges_2.0.0 S4Vectors_0.4.0 Biobase_2.26.0 [7] oligoClasses_1.28.0 BiocGenerics_0.12.0 loaded via a namespace (and not attached): [1] affxparser_1.38.0 affyio_1.34.0 BiocInstaller_1.16.0 [4] bit_1.1-12 codetools_0.2-8 DBI_0.3.1 [7] ff_2.2-13 foreach_1.4.2 GenomeInfoDb_1.2.2 [10] GenomicRanges_1.18.1 iterators_1.0.7 preprocessCore_1.28.0 [13] splines_3.1.1 zlibbioc_1.12.0
There may be a problem with the package in release. If I build my own version, it works as expected. I have sent Benilton an email directly to see what he says.
But note that you should expect a different number of probes from pm() and getProbeInfo(). The former function is simply extracting all the PM probes. The latter function is extracting all the probes for each probeset, and since there is some re-use of probes (e.g., some probes are used in more than one probeset), you should expect more data from getProbeInfo().