Question: affycoretools package, HTA 2.0 annotation query
0
4 months ago by
a.karanikolou0 wrote:

Hi all, I have recently run the HTA 2.0 annotation using the two packages affycoretools and hta20transcriptcluster.db as outlined in the post https://support.bioconductor.org/p/89308/. However, I have run into an error after running the rma() as follows:

> exprs_mainProbes<-getMainProbes(exprsgs, pd.hta.2.0)
Error in switch(level, core = dbGetQuery(con, paste("select distinct core_mps.transcript_cluster_id, type from featureSet inner join core_mps",  : EXPR must be a length 1 vector


Could you shed light into this problem? Thanks in advance.

annotation affy • 165 views
modified 4 months ago by James W. MacDonald51k • written 4 months ago by a.karanikolou0
Answer: affycoretools package, HTA 2.0 annotation query
0
4 months ago by
United States
James W. MacDonald51k wrote:

You need to become familiar with the help system, and to check there to ensure you are doing things correctly. From ?getMainProbes, we have:

Usage:

getMainProbes(input, level = "core")

Arguments:

input: Either a character string (e.g., "pd.hugene.1.0.st.v1") or a
FeatureSet object.

level: The summarization level used when calling rma.



And nowhere in the help page does it say that the second argument should be a pdInfoPackage.

Is it mandatory to include "level = "core"" in the "getMainProbes(input, level = "core")?

In the "help" menu, the argument "level" refers to "the summarization level used when calling rma". I'm not sure I understand it well on how to apply this argument in getMainProbes.

If a function has a default, then by definition it isn't necessary to supply a value for that argument. But perhaps you are actually asking what value you should use? Again, the answer to your question can be found by reading the help page, this time for rma:

Description:

Robust Multichip Average preprocessing methodology. This strategy
allows background subtraction, quantile normalization and
summarization (via median-polish).

Usage:

## S4 method for signature 'ExonFeatureSet'
rma(object, background=TRUE, normalize=TRUE, subset=NULL, target="core")
## S4 method for signature 'HTAFeatureSet'
rma(object, background=TRUE, normalize=TRUE, subset=NULL, target="core")



If you didn't specify a target argument for rma, then you by definition used the default, which matches the default for getMainProbes.

Dear James, thank you for your previous reply. We have used the function getMainProbes using the default as in rma, however instead of the 67,528 main probes that should have been retained, the result was 67,480 main probes. What would cause this difference? Thanks in advance.

I think your math is off.

> eset <- rma(read.celfiles(filenames = dir(".","gz$"))) > annotation(eset) [1] "pd.hta.2.0" > dim(eset) Features Samples 70523 2 > con <- db(pd.hta.2.0) > z <- dbGetQuery(con, "select distinct core_mps.transcript_cluster_id, type from featureSet inner join core_mps using(fsetid);") > table(z$type)

1     2     3     4     5     6     7
67516    23     4     4   155   698   646
> dim(getMainProbes(eset))
Features  Samples
67480        2


So there are 67480 main probesets out of the expected 67516, or 36 missing probesets. Where might those probesets be?

> reps <- z[duplicated(z[,1]) & !is.na(z\$type),1]
> table(z[z[,1] %in% reps,2], useNA = "ifany")

1 <NA>
36   36


So there are 36 probesets that are both 'main' and 'NA'. If we look closer, it's sort of weird

> zlst <- lapply(seq(along = reps), function(x) dbGetQuery(con, paste0("select fsetid, core_mps.transcript_cluster_id, type from featureSet inner join core_mps using(fsetid) where core_mps.transcript_cluster_id='", reps[x], "';")))
> zlst[[2]]
fsetid transcript_cluster_id type
1 19069316       TC01004261.hg.1   NA
2 19069317       TC01004261.hg.1    1
3 19069318       TC01004261.hg.1    1
4 19069319       TC01004261.hg.1    1
5 19069320       TC01004261.hg.1    1
> zlst[[3]]
fsetid transcript_cluster_id type
1 19072754       TC01005067.hg.1   NA
2 19072755       TC01005067.hg.1   NA
3 19072756       TC01005067.hg.1   NA
4 19072757       TC01005067.hg.1    1
> zlst[[4]]
fsetid transcript_cluster_id type
1  19112883       TC02002789.hg.1   NA
2  19112884       TC02002789.hg.1   NA
3  19112885       TC02002789.hg.1   NA
4  19112886       TC02002789.hg.1   NA
5  19112887       TC02002789.hg.1   NA
6  19112888       TC02002789.hg.1   NA
7  19112889       TC02002789.hg.1   NA
8  19112890       TC02002789.hg.1    1
9  19112891       TC02002789.hg.1   NA
10 19112892       TC02002789.hg.1   NA
11 19112893       TC02002789.hg.1   NA
12 19112894       TC02002789.hg.1   NA
13 19112895       TC02002789.hg.1    1
14 19112896       TC02002789.hg.1    1
15 19112897       TC02002789.hg.1    1
16 19112898       TC02002789.hg.1    1
17 19112899       TC02002789.hg.1    1


So for whatever reason, Affy is calling at least one probe per probeset an NA type, so the probesets are getting called both main and 'NA' and that is causing them to be excluded. Given that this only affects 36 out of 67528 probesets (and our core won't use these arrays anyway) it has never been worth it to me to find out any more about it than that.

Thank you a lot for this. The response from Affy was that and I quote "67528 Probe sets listed: 44699 Coding and 22829 NonCoding". The categories used as controls are sum up to 2,995. The is no clear understanding why Affy is calling at least one probe per probeset an NA type. But we will take your response and their response for consideration before moving forward with the analysis. Thank you once again.