I noticed an intriguing discrepancy when I normalized a set of HTA2.0 arrays (GSE111156, here).
Specifically, a different number of probesets is returned when using the rma or fitProbeLevelModel function in oligo, even if in both cases the argument target=core is used. I initially noticed this when I tried to annotate the normalized data at the transcript level, which returned only NAs when processed through fitProbeLevelModel, but when rma was used this was not the case. Then I noticed that the probesetIDs that are returned are not identical (though both time a 'core analysis' was performed).
Any idea what causes this discrepancy?
> library(oligo)
> library(affycoretools)
> library(hta20transcriptcluster.db)
>
> path <- "/mnt/files/GSE111156"
> affy.data <- read.celfiles(filenames = list.celfiles(path, full.names=TRUE, listGzipped=TRUE)[1:5] )
Reading in : /mnt/files/GSE111156/GSM3024184_Case_1_N.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024185_Case_1_A.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024186_Case_1_T.CEL.gz
<<snip>>
> normalized.data.plm <- fitProbeLevelModel(affy.data, target = "core")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
Estimates... OK
StdErrors... OK
Weights..... OK
Residuals... OK
Scale....... OK
>
> normalized.data.plm
Probe Level Model
Method..............: plm
# Samples...........: 5
# Probes (processed): 7576209
# Probesets.........: 925032
Annotation..........: pd.hta.2.0
>
> # Note: 925032 probesets
>
> #annotate
> normalized.data.plm <- opset2eset(normalized.data.plm) #1st convert PLMset to eSet!
> normalized.data.plm <- annotateEset(normalized.data.plm, hta20transcriptcluster.db)
>
> #check
> normalized.data.plm
ExpressionSet (storageMode: lockedEnvironment)
assayData: 925032 features, 5 samples
element names: exprs, se.exprs
protocolData
sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
... GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
... GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: index
varMetadata: labelDescription channel
featureData
featureNames: 2824539_st 2824540_st ... TCONS_l2_00005064_st (925032
total)
fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0
>
> # NO probeset could be annotated...??
> sumis.na(fData(normalized.data.plm)$ENTREZID))
[1] 925032
>
> featureNames(normalized.data.plm)[1:5]
[1] "2824539_st" "2824540_st" "2824541_st" "2824542_st" "2824543_st"
>
>
>
> # now process same input using rma function
> normalized.data.rma <- rma(affy.data, target = "core")
Background correcting
Normalizing
Calculating Expression
> normalized.data.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 70523 features, 5 samples
element names: exprs
protocolData
rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0
>
> # Note: 70523 probesets
>
> Annotate
> normalized.data.rma <- annotateEset(normalized.data.rma, hta20transcriptcluster.db)
>
> # 44% of the probesets could now be annotated...!?
> sumis.na(fData(normalized.data.rma)$ENTREZID))
[1] 39696
>
> featureNames(normalized.data.rma)[1:5]
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pd.hta.2.0_3.12.2 DBI_1.0.0
[3] RSQLite_2.1.1 hta20transcriptcluster.db_8.7.0
[5] org.Hs.eg.db_3.8.2 AnnotationDbi_1.46.0
[7] affycoretools_1.56.0 oligo_1.48.0
[9] Biostrings_2.52.0 XVector_0.24.0
[11] IRanges_2.18.1 S4Vectors_0.22.0
[13] Biobase_2.44.0 oligoClasses_1.46.0
[15] BiocGenerics_0.30.0

Thanks James, got it! Regarding putting it on a TODO list: on one hand, for the reasons you explained above, I agree it may feel it hasn't high priority. On the other hand, because these arrays can only be processed through the
oligoframework, and withinoligoonly the functionfitProbeLevelModel(but notrma) allows the calculation of quality-control metrics and images (pseudo-images, weights, residuals, RLE, NUSE plots) I would argue it is relevant to correct the SQL query... ;)