I noticed an intriguing discrepancy when I normalized a set of HTA2.0 arrays (GSE111156, here).
Specifically, a different number of probesets is returned when using the rma
or fitProbeLevelModel
function in oligo
, even if in both cases the argument target=core
is used. I initially noticed this when I tried to annotate the normalized data at the transcript level, which returned only NA
s when processed through fitProbeLevelModel
, but when rma
was used this was not the case. Then I noticed that the probesetIDs that are returned are not identical (though both time a 'core analysis' was performed).
Any idea what causes this discrepancy?
> library(oligo)
> library(affycoretools)
> library(hta20transcriptcluster.db)
>
> path <- "/mnt/files/GSE111156"
> affy.data <- read.celfiles(filenames = list.celfiles(path, full.names=TRUE, listGzipped=TRUE)[1:5] )
Reading in : /mnt/files/GSE111156/GSM3024184_Case_1_N.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024185_Case_1_A.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024186_Case_1_T.CEL.gz
<<snip>>
> normalized.data.plm <- fitProbeLevelModel(affy.data, target = "core")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
Estimates... OK
StdErrors... OK
Weights..... OK
Residuals... OK
Scale....... OK
>
> normalized.data.plm
Probe Level Model
Method..............: plm
# Samples...........: 5
# Probes (processed): 7576209
# Probesets.........: 925032
Annotation..........: pd.hta.2.0
>
> # Note: 925032 probesets
>
> #annotate
> normalized.data.plm <- opset2eset(normalized.data.plm) #1st convert PLMset to eSet!
> normalized.data.plm <- annotateEset(normalized.data.plm, hta20transcriptcluster.db)
>
> #check
> normalized.data.plm
ExpressionSet (storageMode: lockedEnvironment)
assayData: 925032 features, 5 samples
element names: exprs, se.exprs
protocolData
sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
... GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
... GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: index
varMetadata: labelDescription channel
featureData
featureNames: 2824539_st 2824540_st ... TCONS_l2_00005064_st (925032
total)
fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0
>
> # NO probeset could be annotated...??
> sumis.na(fData(normalized.data.plm)$ENTREZID))
[1] 925032
>
> featureNames(normalized.data.plm)[1:5]
[1] "2824539_st" "2824540_st" "2824541_st" "2824542_st" "2824543_st"
>
>
>
> # now process same input using rma function
> normalized.data.rma <- rma(affy.data, target = "core")
Background correcting
Normalizing
Calculating Expression
> normalized.data.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 70523 features, 5 samples
element names: exprs
protocolData
rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: exprs dates
varMetadata: labelDescription channel
phenoData
rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
GSM3024188_Case_2_A.CEL.gz (5 total)
varLabels: index
varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0
>
> # Note: 70523 probesets
>
> Annotate
> normalized.data.rma <- annotateEset(normalized.data.rma, hta20transcriptcluster.db)
>
> # 44% of the probesets could now be annotated...!?
> sumis.na(fData(normalized.data.rma)$ENTREZID))
[1] 39696
>
> featureNames(normalized.data.rma)[1:5]
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] pd.hta.2.0_3.12.2 DBI_1.0.0
[3] RSQLite_2.1.1 hta20transcriptcluster.db_8.7.0
[5] org.Hs.eg.db_3.8.2 AnnotationDbi_1.46.0
[7] affycoretools_1.56.0 oligo_1.48.0
[9] Biostrings_2.52.0 XVector_0.24.0
[11] IRanges_2.18.1 S4Vectors_0.22.0
[13] Biobase_2.44.0 oligoClasses_1.46.0
[15] BiocGenerics_0.30.0
Thanks James, got it! Regarding putting it on a TODO list: on one hand, for the reasons you explained above, I agree it may feel it hasn't high priority. On the other hand, because these arrays can only be processed through the
oligo
framework, and withinoligo
only the functionfitProbeLevelModel
(but notrma
) allows the calculation of quality-control metrics and images (pseudo-images, weights, residuals, RLE, NUSE plots) I would argue it is relevant to correct the SQL query... ;)