I'm currently preprocessing 34 cel files in R for finding differentially expressed genes using various statistical tests. I would like to ask before any statistical interference, which way of nonspecific filtering is optimal for my normalized ExpressionSet ?
Should I use criteria such as variance of standard deviation via genefilter package, or also filter regarding present/absent calls ??
There isn't really an 'optimal' filtering method. As with most things, there are tradeoffs involved when you are excluding data, and people tend to have their own opinions about what is and isn't a reasonable thing to do.
As you already know, there are methods in the genefilter package that can be used to filter data in a non-specific manner, and you can also remove probesets based on present/absent calls. Your goal as an analyst is to understand the tradeoffs involved with any filtering method you might care to use, and to have a defensible reason for those you choose.
Thank you for your answer !!! i understand that there is not a "gold standard" regarding non-specific filtering based on the individual and specific characteristics of the dataset under investigation and analysis. My questions refer more about the optional step for filtering based on present/absent calls(MAS5.0 or panp package in R), or after quality control and normalizing perform non-specific filtering based on various options ??
The filtering that is appropriate for a particular data set depends on the downstream analysis that you intend to do with the filtered results and, to a somewhat lesser extent, on how you preprocessed the Affymetrix data.
Filtering out consistently non-expressed probe-sets by far the most common filtering step, because keeping probe-sets in your analysis that are never expressed is hardly ever useful. Apart from that, it is better not to filter unless you know what you're doing.
If you plan to use limma for the differential expression analysis, then filtering is not much needed, especially if you use trend=TRUE in the eBayes step. I personally prefer to keep it simple. Do some some simple filtering on mean log-expression, or don't filter at all.
Thank you for your answer !!! i understand that there is not a "gold standard" regarding non-specific filtering based on the individual and specific characteristics of the dataset under investigation and analysis. My questions refer more about the optional step for filtering based on present/absent calls(MAS5.0 or panp package in R), or after quality control and normalizing perform non-specific filtering based on various options ??