Question: Inconsistent results when normalizing HTA2.0 arrays at "level=core"
0
gravatar for Guido Hooiveld
5 months ago by
Guido Hooiveld2.5k
Wageningen University, Wageningen, the Netherlands
Guido Hooiveld2.5k wrote:

I noticed an intriguing discrepancy when I normalized a set of HTA2.0 arrays (GSE111156, here). Specifically, a different number of probesets is returned when using the rma or fitProbeLevelModel function in oligo, even if in both cases the argument target=core is used. I initially noticed this when I tried to annotate the normalized data at the transcript level, which returned only NAs when processed through fitProbeLevelModel, but when rma was used this was not the case. Then I noticed that the probesetIDs that are returned are not identical (though both time a 'core analysis' was performed). Any idea what causes this discrepancy?

> library(oligo)
> library(affycoretools)
> library(hta20transcriptcluster.db)
>
> path <- "/mnt/files/GSE111156"
> affy.data <- read.celfiles(filenames = list.celfiles(path,  full.names=TRUE, listGzipped=TRUE)[1:5] )
Reading in : /mnt/files/GSE111156/GSM3024184_Case_1_N.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024185_Case_1_A.CEL.gz
Reading in : /mnt/files/GSE111156/GSM3024186_Case_1_T.CEL.gz
<<snip>>

> normalized.data.plm <- fitProbeLevelModel(affy.data, target = "core")
Background correcting... OK
Normalizing... OK
Summarizing... OK
Extracting...
  Estimates... OK
  StdErrors... OK
  Weights..... OK
  Residuals... OK
  Scale....... OK
> 
> normalized.data.plm
Probe Level Model
Method..............: plm
# Samples...........: 5
# Probes (processed): 7576209
# Probesets.........: 925032
Annotation..........: pd.hta.2.0
> 
> # Note: 925032 probesets
>
> #annotate
> normalized.data.plm <- opset2eset(normalized.data.plm) #1st convert PLMset to eSet!
> normalized.data.plm <- annotateEset(normalized.data.plm,  hta20transcriptcluster.db)
> 
> #check
> normalized.data.plm
ExpressionSet (storageMode: lockedEnvironment)
assayData: 925032 features, 5 samples 
  element names: exprs, se.exprs 
protocolData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  sampleNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz
    ... GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData
  featureNames: 2824539_st 2824540_st ... TCONS_l2_00005064_st (925032
    total)
  fvarLabels: PROBEID ENTREZID SYMBOL GENENAME
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
>
> # NO probeset could be annotated...??
> sumis.na(fData(normalized.data.plm)$ENTREZID))
[1] 925032
> 
> featureNames(normalized.data.plm)[1:5]
[1] "2824539_st" "2824540_st" "2824541_st" "2824542_st" "2824543_st"
>
> 
>
> # now process same input using rma function
> normalized.data.rma <- rma(affy.data, target = "core")
Background correcting
Normalizing
Calculating Expression
> normalized.data.rma
ExpressionSet (storageMode: lockedEnvironment)
assayData: 70523 features, 5 samples 
  element names: exprs 
protocolData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: exprs dates
  varMetadata: labelDescription channel
phenoData
  rowNames: GSM3024184_Case_1_N.CEL.gz GSM3024185_Case_1_A.CEL.gz ...
    GSM3024188_Case_2_A.CEL.gz (5 total)
  varLabels: index
  varMetadata: labelDescription channel
featureData: none
experimentData: use 'experimentData(object)'
Annotation: pd.hta.2.0 
> 
> # Note: 70523 probesets
>
> Annotate
> normalized.data.rma <- annotateEset(normalized.data.rma,  hta20transcriptcluster.db)
>
> # 44% of the probesets could now be annotated...!?
> sumis.na(fData(normalized.data.rma)$ENTREZID))
[1] 39696
>
> featureNames(normalized.data.rma)[1:5]
[1] "2824546_st" "2824549_st" "2824551_st" "2824554_st" "2827992_st"

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora 30 (Thirty)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] pd.hta.2.0_3.12.2               DBI_1.0.0                      
 [3] RSQLite_2.1.1                   hta20transcriptcluster.db_8.7.0
 [5] org.Hs.eg.db_3.8.2              AnnotationDbi_1.46.0           
 [7] affycoretools_1.56.0            oligo_1.48.0                   
 [9] Biostrings_2.52.0               XVector_0.24.0                 
[11] IRanges_2.18.1                  S4Vectors_0.22.0               
[13] Biobase_2.44.0                  oligoClasses_1.46.0            
[15] BiocGenerics_0.30.0
ADD COMMENTlink modified 5 months ago by James W. MacDonald52k • written 5 months ago by Guido Hooiveld2.5k
Answer: Inconsistent results when normalizing HTA2.0 arrays at "level=core"
1
gravatar for James W. MacDonald
5 months ago by
United States
James W. MacDonald52k wrote:

I don't think fitProbeLevelModel has been updated to correctly process the HTA arrays. If you run RMA, the mapping of probes to probesets is done by the internal function getFidMetaProbesetCore, which does

> oligo:::getFidMetaProbesetCore
function (object, sortBy = "fsetid") 
{
    conn <- db(object)
    on.exit(dbDisconnect(conn))
    if (!is.null(sortBy)) 
        sortBy <- match.arg(sortBy, c("fid", "fsetid"))
    if (!is(object, "HTAFeatureSet")) {
        sql <- "SELECT fid, meta_fsetid as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    else {
        sql <- "SELECT fid, core_mps.transcript_cluster_id as fsetid FROM pmfeature INNER JOIN core_mps USING(fsetid)"
    }
    featureInfo <- dbGetQuery(conn, sql)
    if (!is.null(sortBy)) {
        featureInfo <- featureInfo[order(featureInfo[[sortBy]]), 
            ]
        rownames(featureInfo) <- NULL
    }
    return(featureInfo)
}

The operating principle here being the SQL for HTA arrays being different from the other ST type arrays. If you use fitProbeLevelModel, it calls getProbeInfo, which if you follow the if/else tree, ends up doing the following,

        fields <- unique(c("fid", "man_fsetid", field))
        fields[fields == "fsetid"] <- paste(probeTable, "fsetid", 
            sep = ".")
        if (isST | class(object) == "TilingFeatureSet") 
            fields[fields == "man_fsetid"] <- paste(probeTable, 
                "fsetid as man_fsetid", sep = ".")
        fields <- paste(fields, collapse = ", ")
        sql <- paste("SELECT", fields, "FROM", probeTable, "INNER JOIN featureSet", 
            "USING(fsetid)")

Resulting in

Browse[2]> sql
[1] "SELECT fid, man_fsetid, pmfeature.fsetid FROM pmfeature INNER JOIN featureSet USING(fsetid)"

Which is (obviously) a different SQL query, and will not correctly map the probes to probesets. I can put this on my TODO list, but it's not going to be a high priority, given that rma does work correctly, and so far as I can tell, Affy/Fisher have never really got people to use these arrays anyway.

ADD COMMENTlink written 5 months ago by James W. MacDonald52k

Thanks James, got it! Regarding putting it on a TODO list: on one hand, for the reasons you explained above, I agree it may feel it hasn't high priority. On the other hand, because these arrays can only be processed through the oligo framework, and within oligo only the function fitProbeLevelModel (but not rma) allows the calculation of quality-control metrics and images (pseudo-images, weights, residuals, RLE, NUSE plots) I would argue it is relevant to correct the SQL query... ;)

ADD REPLYlink written 5 months ago by Guido Hooiveld2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 211 users visited in the last hour