Question

Filtering on present/absent calls methodology and perfect match only arrays

0

Entering edit mode

svlachavas ▴ 840

@svlachavas-7225

Last seen 13 months ago

Germany/Heidelberg/German Cancer Resear…

Dear Bioconductor community,

recenltly i started analyzing a dataset which has annotation: "primeview", which is a perfect-match-only(PM) array. Thus, i cant use filtering on present/absent calls with mas5calls function from the mas5algorithm, to filter based on absent/present calls. Is there another alternative methodology based on intensity but not on variance for filtering prior to statistical inference, regarding PM only arrays ? I found a paper by Wu, Z. and R. A. Irizarry (2005) about half-price, but i couldnt find any vignette or tutorial about the code or anything.

My second and also important question refers to a methodology i used both on HG-U133a & also HGU-133aplus2 microarray platform datasets, after normalization with GCRMA(as they both have PM & MM probes. More specifically, for instance(about the hgu133a)

eset <- gcrma(cancer_data) # cancer_data my affybatch object

mas5.eset <- mas5calls(cancer_data)

call.matrix <- exprs(mas5.eset)
AbsentCalls.index <-
( ifelse(call.matrix == 'A',1,0) | ifelse(call.matrix == 'M',1,0) )
AbsentCalls.index <- which(AbsentCalls.index == TRUE)
exprs(eset)[AbsentCalls.index] <- NA
filtered.eset <- eset

My basic consern is (because im new in r) if with the above code i "mark" correctly as NA absent & present calls, because i use a different normalization algorithm than mas5 ??

bioconductor mas5 perfectmatch affymetrixchip filter • 3.8k views

ADD COMMENT • link updated 9.8 years ago by James W. MacDonald 67k • written 9.8 years ago by svlachavas ▴ 840

0

Entering edit mode

Moreover, another important problem that i enganged during quality control with the primeview genechip:

during quality control, it gave me this error:

library(simpleaffy)

dataset.qc <- qc(dataset) # dataset is my affybatch object. Any suggestions ?

Error in setQCEnvironment(cdfn) :

Could not find array definition file ' primeviewcdf.qcdef '. Simpleaffy does not know the QC parameters for this array type.

Is this because it doesnt work with perfectmatch genechip arrays, or is something with my cdf file?

ADD REPLY • link 9.8 years ago svlachavas ▴ 840

score 2 · Answer 1 · 2015-02-23

dear svlachavas,

concerning your first question about filtering on intensity, what you could do is to plot the density of the probe set expression values after pre-processing and use the mode (i.e. highest peak of the distribution) as a value for a intensity based filter. example:

eset <- rma( abatch )
dens <- density( exprs( eset )[ , 1 ] )
mode.x <- dens$x[ which.max( dens$y ) ]
## plot it
plot( dens, xlab=expression( log[2]~expression ) )
abline( v=mode.x, col="red" )

you could then use this value to define which probe sets are very low or not at all expressed (assuming that less than ~50% of genes are expressed in the analyzed tissue).

AbsentOrLow.idx <- which( exprs( eset )[ , 1 ] < mode.x )

Regarding your second question: I don't think it is wise to combine data from different pre-processing approaches. To start with, I don't think you should use the mas5 pre-processing (plenty of papers out there) at all and you should also be careful with the absent call, as about 30% of MM probes yield higher signals then their PM counterpart [Naef and Magnasco, Phys. Rev. 2003, vol. 68 (1)]. If at all, I would use the approach from above, specifically so for the GCRMA pre-processing as you usually see there a bi-modal distribution of probe set expression values, with the lower peak most likely corresponding to non- or very low expressed genes.

hope this helps,

cheers, jo

score 1 · Answer 2 · 2015-02-23

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

The xps package has a function dabg.call(), which implements Affy's 'Detection above background' measure, which is what you should probably use if you want something like the P/A calls that you used to be able to get with the PM/MM arrays.

The method you use in your first post does something that I wouldn't recommend. I have a hard time coming up with a good rationale for taking data and converting to NA. This just makes downstream processing of the data more difficult, and may cause things to happen that you didn't expect. Instead, you should think of your data in terms of probesets over all samples, rather than individual probesets per sample.

What I mean by that is you should be looking for probesets where you have little or no evidence for expression over all the samples, and excluding those probesets from consideration, rather than picking and choosing probesets by subject that you don't like. As an example, say you have three groups of 5 subjects (treated, control, vehicle), and you find that probeset X is below detection in the control and vehicle but really high in treated. If you convert all the probeset data to NA for the control and vehicle, you now cannot make any comparisons, and you have effectively thrown that probeset out for no good reason.

Instead, I would recommend doing something like

ind <- rowSums(call.matrix == "P") > 4

eset <- eset[ind,]

In which case you are unbiasedly choosing probesets that appear to be present in at least 5 samples

The error you see with simpleaffy is saying that there isn't a set of QC parameters in the simpleaffy package for this array. See section 5.4 in the simpleaffy vignette (which you should be reading anyway, if you are using that package) http://bioconductor.org/packages/release/bioc/vignettes/simpleaffy/inst/doc/simpleAffy.pdf

You will likely get other errors with simpleaffy, as it was originally designed for PM/MM arrays, and IIRC uses MM measures as part of its processing. If so you might consider looking at arrayQualityMetrics.

ADD COMMENT • link 9.8 years ago James W. MacDonald 67k

0

Entering edit mode

I should note that the filtering criterion I specify above pertains to my example, and not to your data. You have to decide how many samples have to be present, but the usual recommendation is that you should choose the smallest group for that value. In other words, if you have 5 treated and 4 controls, you would require at least 4 samples to be present in order to keep that probeset.

ADD REPLY • link 9.8 years ago James W. MacDonald 67k

0

Entering edit mode

Dear James W. MacDonald,

please excuse me because my laptop didnt show any answers, so i'm answering with some delay. Because Mr johannes Rainer also stated an issue about using mas5calls with gcrma, heres my full code for evaluation:

eset <- gcrma(cancer_data)

mas5.eset <- mas5calls(cancer_data)

call.matrix <- exprs(mas5.eset)
AbsentCalls.index <-
( ifelse(call.matrix == 'A',1,0) | ifelse(call.matrix == 'M',1,0) )
AbsentCalls.index <- which(AbsentCalls.index == TRUE)
exprs(eset)[AbsentCalls.index] <- NA
filtered.eset <- eset

## Data imputation

require(impute)
data.imputed.list <- impute.knn(exprs(filtered.eset), k=10, rowmax=0.3, colmax=0.90)
data.imputed.matrix <- data.imputed.list$data
data.imputed.eset <- filtered.eset
exprs(data.imputed.eset) <- data.imputed.matrix

-----------TRUST FACTOR IMPLEMENTATION ----------
*Trust Factor* computation-The Trust Factor for genes = #Appearances in Replicates/#Replicates/Keep those with at least more than x% Present calls in one condition( trust factor threshold=[0,1])

condition <- as.factor(data.imputed.eset$condition)
samples <- factor(condition, levels(condition)[c(2,1)])

conditions.matrix <- matrix(NA, nrow=dim(call.matrix)[1], ncol=length(levels(samples)))
for( i in 1:length(levels(samples)) ) {

condition.index <- which( samples == levels(samples)[i] )

condition.call.matrix <- as.matrix(call.matrix[, condition.index])

conditions.myCalls.vector <- apply(
(ifelse(condition.call.matrix == "A",0,1) &
ifelse(condition.call.matrix =="M",0,1)), 1, sum)

conditions.matrix[, i] <- conditions.myCalls.vector/length(condition.index)
}

myCalls.vector <- apply(
ifelse( conditions.matrix >= TrustFactor.thres, 1, 0),
1, sum) # TrustFactor.thres=0.6

myCalls.index <- which(myCalls.vector >= 1 )

data.trusted.eset <- data.imputed.eset
data.trusted.matrix <- exprs(data.imputed.eset)[myCalls.index, ]
exprs(data.trusted.eset) <- data.trusted.matrix

To summarize, this is my filtered expression set for moving towards to limma statistical analysis for the above condition implemented in Trust Factor(cancer vs normal samples). Thus, it would be very valuable and important for me to check the whole process, and also regarding the issue Mr Johannes Rainer stated, about using mas5 and gcrma. Because this was an implementation of a routine process we have in our lab, thats why it is very crusial, because i thought that if i only use mas5calls to create P/A/M calls and then put them in an expressionset, i could then easily mark my normalized expressionset after GCRMA, because also GCRMA uses MM probes.

ADD REPLY • link 9.8 years ago svlachavas ▴ 840

0

Entering edit mode

Sorry, but it's time for some tough love. What you are doing here is completely crazy. You are imputing data that only needs to be imputed because you have artificially created missing data! What is your rationale for doing this? Did you see some script somewhere, or is this your own invention? Regardless, you need to stop that now.

In addition, you took something that requires TWO LINES of code to accomplish (and I gave you those two lines), and have blown that out into some huge mess of code. Why are you taking simple processes and making them so complicated? There are plenty of complications here without arbitrarily creating more. Also, don't do this:

data.trusted.eset <- data.imputed.eset
data.trusted.matrix <- exprs(data.imputed.eset)[myCalls.index, ]
exprs(data.trusted.eset) <- data.trusted.matrix

There are only very few reasons (and none that apply here) for you to be extracting data from an ExpressionSet, only to put it back into a different one. If you want to subset an ExpressionSet, just subset the thing.

data.trusted.eset <- data.imputed.eset[myCalls.index,]

While Johannes Rainer did say something about not mixing methods, I believe his main point was that mas5 calls are sort of worthless, given that about 30% of the MM probes are brighter than the corresponding PM probes. You don't have a good measure of background when the background measure is brighter than foreground 30% of the time.

I believe his other point was that you can use a simple rule to decide what expression level is equivalent to 'not expressed' and filter on that, rather than relying on an algorithm that even Affymetrix hasn't used in 10 years.

ADD REPLY • link 9.8 years ago James W. MacDonald 67k

0

Entering edit mode

Of course, not offended at all. As i mentioned, i didnt have any background on programming and im using R for the last 3-4 months, so every comment and help would be beneficial. The above code was a moderated script from one preprossesing pipeline which is used (http://grissomdevweb.ekt.gr/latest_galaxy/)(http://grissom.gr/armada/documentation.html) and the above script has been writen in R- regarding the fact for the imputed data, the rationale for using this is to mark as NA Marginal and Absent calls and then fill the NA values with some algorithm for missing value imputation. This is something that i not invented, and also after that the trust factor implementation, which is similar for the groups you have proposed above

ADD REPLY • link 9.8 years ago svlachavas ▴ 840

0

Entering edit mode

http://grissom.gr/armada/documentation.html

http://grissomdevweb.ekt.gr/latest_galaxy/

ADD REPLY • link 9.8 years ago svlachavas ▴ 840

0

Entering edit mode

Regarding the only PM genechip array("primeview") I also use arrayQualityMetrics, but i dont get any result. The only diagnostic plots i managed to get is with affyPLM package

ADD REPLY • link 9.8 years ago svlachavas ▴ 840