Inconsistent results when normalizing HTA2.0 arrays at "level=core"
1
0
Entering edit mode
Guido Hooiveld ★ 4.1k
@guido-hooiveld-2020
Last seen 7 minutes ago
Wageningen University, Wageningen, the …

I noticed an intriguing discrepancy when I normalized a set of HTA2.0 arrays (GSE111156, here). Specifically, a different number of probesets is returned when using the rma or fitProbeLevelModel function in oligo, even if in both cases the argument target=core is used. I initially noticed this when I tried to annotate the normalized data at the transcript level, which returned only NAs when processed through fitProbeLevelModel, but when rma was used this was not the case. Then I noticed that the probesetIDs that are returned are not identical (though both time a 'core analysis' was performed). Any idea what causes this discrepancy?

> library(oligo)
> library(affycoretools)
> library(hta20transcriptcluster.db)
>
> path <- "/mnt/files/GSE111156"
> affy.data <- read.celfiles(filenames = list.celfiles(path,  full.names=TRUE, listGzipped=TRUE)[1:5] )
Reading in : /mnt/files/GSE111156/GSM3024184_Case_1_N.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024185_Case_1_A.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024186_Case_1_T.CEL.gz
<<snip>>

> normalized.data.plm <- fitProbeLevelModel(affy.data, target = "core")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
  Estimates... OK
  StdErrors... OK
  Weights..... OK
  Residuals... OK
  Scale....... OK
> 
> normalized.data.plm
Probe Level Model
Method..............: plm
# Samples...........: 5
# Probes (processed): 7576209
# Probesets.........: 925032
Annotation..........: pd.hta.2.0
> 
> # Note: 925032 probesets
>
> #annotate
> normalized.data.plm <- opset2eset(normalized.data.plm) #1st convert PLMset to eSet!
> normalized.data.plm <- annotateEset(normalized.data.plm,  hta20transcriptcluster.db)
> 
> #check
> normalized.data.plm
ExpressionSet (storageMode: lockedEnvironment)
assayData: 925032 features, 5 samples 
  element names: exprs, se.exprs 
protocolData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData
  featureNames: 2824539_st 2824540_st ... TCONS_l2_00005064_st (925032
    total)
  fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
>
> # NO probeset could be annotated...??
> sumis.na(fData(normalized.data.plm)$ENTREZID))
[1] 925032
> 
> featureNames(normalized.data.plm)[1:5]
[1] "2824539_st" "2824540_st" "2824541_st" "2824542_st" "2824543_st"
>
> 
>
> # now process same input using rma function
> normalized.data.rma <- rma(affy.data, target = "core")
Background correcting
Normalizing
Calculating Expression
> normalized.data.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 70523 features, 5 samples 
  element names: exprs 
protocolData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
> 
> # Note: 70523 probesets
>
> Annotate
> normalized.data.rma <- annotateEset(normalized.data.rma,  hta20transcriptcluster.db)
>
> # 44% of the probesets could now be annotated...!?
> sumis.na(fData(normalized.data.rma)$ENTREZID))
[1] 39696
>
> featureNames(normalized.data.rma)[1:5]
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.12.2               DBI_1.0.0                      
 [3] RSQLite_2.1.1                   hta20transcriptcluster.db_8.7.0
 [5] org.Hs.eg.db_3.8.2              AnnotationDbi_1.46.0           
 [7] affycoretools_1.56.0            oligo_1.48.0                   
 [9] Biostrings_2.52.0               XVector_0.24.0                 
[11] IRanges_2.18.1                  S4Vectors_0.22.0               
[13] Biobase_2.44.0                  oligoClasses_1.46.0            
[15] BiocGenerics_0.30.0
oligo HTA2.0 affycoretools RMA fitProbeLevelModel • 1.6k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 11 hours ago
United States

I don't think fitProbeLevelModel has been updated to correctly process the HTA arrays. If you run RMA, the mapping of probes to probesets is done by the internal function getFidMetaProbesetCore, which does

> oligo:::getFidMetaProbesetCore
function (object, sortBy = "fsetid") 
{
    conn <- db(object)
    on.exit(dbDisconnect(conn))
    if (!is.null(sortBy)) 
        sortBy <- match.arg(sortBy, c("fid", "fsetid"))
    if (!is(object, "HTAFeatureSet")) {
        sql <- "SELECT fid, meta_fsetid as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    else {
        sql <- "SELECT fid, core_mps.transcript_cluster_id as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    featureInfo <- dbGetQuery(conn, sql)
    if (!is.null(sortBy)) {
        featureInfo <- featureInfo[order(featureInfo[[sortBy]]), 
            ]
        rownames(featureInfo) <- NULL
    }
    return(featureInfo)
}

The operating principle here being the SQL for HTA arrays being different from the other ST type arrays. If you use fitProbeLevelModel, it calls getProbeInfo, which if you follow the if/else tree, ends up doing the following,

        fields <- unique(c("fid", "man_fsetid", field))
        fields[fields == "fsetid"] <- paste(probeTable, "fsetid", 
            sep = ".")
        if (isST | class(object) == "TilingFeatureSet") 
            fields[fields == "man_fsetid"] <- paste(probeTable, 
                "fsetid as man_fsetid", sep = ".")
        fields <- paste(fields, collapse = ", ")
        sql <- paste("SELECT", fields, "FROM", probeTable, "INNER JOIN featureSet", 
            "USING(fsetid)")

Resulting in

Browse[2]> sql
[1] "SELECT fid, man_fsetid, pmfeature.fsetid FROM pmfeature INNER JOIN featureSet USING(fsetid)"

Which is (obviously) a different SQL query, and will not correctly map the probes to probesets. I can put this on my TODO list, but it's not going to be a high priority, given that rma does work correctly, and so far as I can tell, Affy/Fisher have never really got people to use these arrays anyway.

ADD COMMENT
0
Entering edit mode

Thanks James, got it! Regarding putting it on a TODO list: on one hand, for the reasons you explained above, I agree it may feel it hasn't high priority. On the other hand, because these arrays can only be processed through the oligo framework, and within oligo only the function fitProbeLevelModel (but not rma) allows the calculation of quality-control metrics and images (pseudo-images, weights, residuals, RLE, NUSE plots) I would argue it is relevant to correct the SQL query... ;)

ADD REPLY

Login before adding your answer.

Traffic: 692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6