I apologize if this is a duplicate question,I read limmas paper (filtering section) it referred that after normalizing microarray data, for downstream analysis, it is usually worthwhile to remove probes that appear not be expressed but on the other hand in this post Non-specific filtering methodogies for ExpressionSet in R/Bioconductor , it referred that, filtering is not much needed if we use limma for microarray analysis , now if you think it is better to apply filtering before the linear modelling and empirical Bayes steps, please let me know what kind of filtering I can use without conflict with distribution of variance ?

I think it is reasonable to filter out probe sets that can be confidently identified as non-expressed before the eBayes step. The purpose of eBayes is to squeeze the variance of each probe set toward the average variance of all probe sets, under the assumption that this average is a reasonable representation of the overall behavior of the data. Ideally, that average should only reflect the biology of your system, not the technical noise in probes that aren't measuring anything. So removal of non-expressed probes should help ensure that. However, it might make little difference in practice.

One thing you definitely want to avoid is to filter probe sets based on their variance. Doing so would significantly bias the eBayes estimate of the average variance of the data set and compromise your entire analysis.

actually the approach of "non-specific intensity filtering", is beneficial prior using limma, in order to remove non-interesting probes(i.e. probes that not expressed in the majority of your samples, & also to reduce the number of multiple hypothesis tests). On the other hand, is also data-dependent and it's up to you if and how you could filter your data prior statistical inference.

If you have more details about the data/technology you are using and your experimental design, would be also more helpful than some general suggestions.

Thank you Dear svlachavas for useful comment, I read the link, I found very helpful points to consider,I am doing a DEG analysis, my study is a class comparison between to condition, normal vs cancer on Affymetrix Human Genome U133 Plus 2.0 Array and Affymetrix Human Genome U133A Array, as you said, I think removing of non-informative probs could be helpful for reducing the number of multiple hypothesis tests
I read the link and I found very intresting notes that you and Dr. Gordon Smyth have mentioned and based of that I changed my scripts, it's kind of you if let me know about any problem or advice

Well, i personally believe that filtering on the A values from above would be too strigent, as you could possibly loose informative probesets, that are lowly expressed in only one of the two groups--except you have already performed DE analysis, and you get very few genes with FDR<5%--or, from an MDS plot or a PCA you don't "see" relative differences between your groups of comparison. Actually, as i saw that you use the affymetrix hgu133a & plus2 arrays, my suggestion (there are still many alternative ways to consider), is to use the nice R package panp (https://www.bioconductor.org/packages/release/bioc/html/panp.html), which has been created for these two specific platforms. In detail, you could create present/absent calls for each of your probesets in each sample (that is if a probeset is expressed or not in a given sample), and then you could follow very simply the following implementation:

# let's say for instance that each group has 10 samples

sel <- rowSums(mat=="P") >10 # where mat the matrix output of panp-see page 5 of the vignette
eset.filtered <- eSet[ind,] # where eSet your normalized expressionSet

Hi Efstathios, first I thank you for your nice offer about panp package,

but I have a question about using this package, in your above scripts, you said that for instance each group has 10 samples, I don't know your mean about each group... !! in my case I have 3 GSE that have 100, 120 and 130 samples ,I think again I need a cutoff to specify the number of samples !!? please let me know if there is an important note about this case

Sorry to reply with delay. I just proposed a simply example of comparison groups, with equal sample size each. So, in your case, which is the sample size of cancer and control samples, respectively ? I assume you analyze each dataset, separately, right ?

The first one>> 20 sample control+ 30 sample ductal carcinoma insitu(DCIS)+30 sample invasive ductal carcinoma(IDC)=100

The second one>>50sample(DCIS) + 70 sample(IDC)=120

The third one>> 30 sample control+ 45 (DCIS)+55 sample (IDC)=130

I want to merge these datasets with combat methode and then DEG analysis(DCIS vs control and IDC vs control)

From the previous post I found that if I use non-spesific filtering I will give a better output from my analyze, so as you suggest me, I decide to perform pa.calls function on my dataset I read its vignette but as I mentioned above I am confused with selecting the correct sample size

So, you have already performed merging of your datasets and ComBat, right ? So, you have a unified dataset: if so, you could very "naively", set a cut-off number of samples by each group-however, as i see from above, the number of samples in each group varies and it is not the same. So, you could "arbitary" choose a number of samples to apply the cutoff--or alternatively (which is a bit more complicated), you could utilize the matrix of the P/A values from PANP, and keep the probesets which are present in half of the samples in at least one condition. The last, sounds more appropriate to me.

excuse me for these noob questions, I tried to perform pa.calls function on my merged eSet but it gave me an error

PAPA<-pa.calls(test1)
Error in AllExprs[NSMPnames, ] : subscript out of bounds
In addition: Warning messages:
1: In if (chip == "hgu133b") { :
the condition has length > 1 and only the first element will be used
2: In if ((chip != "hgu133a") & (chip != "hgu133atag") & (chip != "hgu133plus2")) { :
the condition has length > 1 and only the first element will be used
3: In if ((chip == "hgu133a") | (chip == "hgu133atag")) { :
the condition has length > 1 and only the first element will be used

also I don't know how can I deal with those esets that have more than one group, in your above codes you have supposed that each eSet has one covariate i.e. cancer or control, for example in this code

sel <- rowSums(mat=="P") >10

if the eset has the samples from case and control,this logical condition perform on both of group (case and control) I want to know there is any problem if I use this code for my dataset that has sample from case and control?

What platforms your individual datasets comprise of ? If they don't have the same, i suspect that this is the reason for the above error. If you type test1, it would return a slot called Annotation:

so, there you will probably have more that than one annotation platforms. Thus, if my notion is correct, then you will have to use a different kind of filterning, because PANP can't worked with merged eSets comprising of multiple annotations. In this case, you could consider something similar:

ind <- rowSums(exprs(test1)> expression.cutoff) >=N # where in the expression.cutoff you could use something like 6, and the number of samples is up to you.

Alternatively, if you don't have specific types of comparison you want to perform that the merging of datasets would be beneficial to increase sample size, you should also consider analyzing first each dataset separately.

Thank you Dear svlachavas for useful comment, I read the link, I found very helpful points to consider,I am doing a DEG analysis, my study is a class comparison between to condition, normal vs cancer on Affymetrix Human Genome U133 Plus 2.0 Array and Affymetrix Human Genome U133A Array, as you said, I think removing of non-informative probs could be helpful for reducing the number of multiple hypothesis tests

I read the link and I found very intresting notes that you and Dr. Gordon Smyth have mentioned and based of that I changed my scripts, it's kind of you if let me know about any problem or advice

design <- model.matrix(~factor(test$Disease))

fit <- lmFit(test, design)

keep <- fit$Amean > median(fit$Amean)

fit <- eBayes(fit[keep,], robust=TRUE, trend=TRUE)

tab <- topTable(fit, coef=2, adjust="fdr", n=150)

all the best

Well, i personally believe that filtering on the A values from above would be too strigent, as you could possibly loose informative probesets, that are lowly expressed in only one of the two groups--except you have already performed DE analysis, and you get very few genes with FDR<5%--or, from an MDS plot or a PCA you don't "see" relative differences between your groups of comparison. Actually, as i saw that you use the affymetrix hgu133a & plus2 arrays, my suggestion (there are still many alternative ways to consider), is to use the nice R package panp (

https://www.bioconductor.org/packages/release/bioc/html/panp.html), which has been created for these two specific platforms. In detail, you could create present/absent calls for each of your probesets in each sample (that is if a probeset is expressed or not in a given sample), and then you could follow very simply the following implementation:# let's say for instance that each group has 10 samples

i hope that helps,

Efstathios

ّّI'll try with panp, Thank you alot

Hi Efstathios, first I thank you for your nice offer about panp package,

but I have a question about using this package, in your above scripts, you said that for instance each group has 10 samples, I don't know your mean about each group... !! in my case I have 3 GSE that have 100, 120 and 130 samples ,I think again I need a cutoff to specify the number of samples !!? please let me know if there is an important note about this case

Sincerely

Shamim Sarhadi

Sorry to reply with delay. I just proposed a simply example of comparison groups, with equal sample size each. So, in your case, which is the sample size of cancer and control samples, respectively ? I assume you analyze each dataset, separately, right ?

Hi,Thanks for your reply

Let me explain my case more clearly

I have 3 datasets

The first one>> 20 sample control+ 30 sample ductal carcinoma insitu(DCIS)+30 sample invasive ductal carcinoma(IDC)=100

The second one>>50sample(DCIS) + 70 sample(IDC)=120

The third one>> 30 sample control+ 45 (DCIS)+55 sample (IDC)=130

I want to merge these datasets with combat methode and then DEG analysis(DCIS vs control and IDC vs control)

From the previous post I found that if I use non-spesific filtering I will give a better output from my analyze, so as you suggest me, I decide to perform pa.calls function on my dataset I read its vignette but as I mentioned above I am confused with selecting the correct sample size

Thank you alot

Sincerely

So, you have already performed merging of your datasets and ComBat, right ? So, you have a unified dataset: if so, you could very "naively", set a cut-off number of samples by each group-however, as i see from above, the number of samples in each group varies and it is not the same. So, you could "arbitary" choose a number of samples to apply the cutoff--or alternatively (which is a bit more complicated), you could utilize the matrix of the P/A values from PANP, and keep the probesets which are present in half of the samples in at least one condition. The last, sounds more appropriate to me.

excuse me for these noob questions, I tried to perform pa.calls function on my merged eSet but it gave me an error

also I don't know how can I deal with those esets that have more than one group, in your above codes you have supposed that each eSet has one covariate i.e. cancer or control, for example in this code

`sel <- rowSums(mat=="P") >10`

if the eset has the samples from case and control,this logical condition perform on both of group (case and control) I want to know there is any problem if I use this code for my dataset that has sample from case and control?

Thanks in advance

all the best

What platforms your individual datasets comprise of ? If they don't have the same, i suspect that this is the reason for the above error. If you type test1, it would return a slot called Annotation:

so, there you will probably have more that than one annotation platforms. Thus, if my notion is correct, then you will have to use a different kind of filterning, because PANP can't worked with merged eSets comprising of multiple annotations. In this case, you could consider something similar:

`ind <- rowSums(exprs(test1)> expression.cutoff) >=N`

# where in the expression.cutoff you could use something like 6, and the number of samples is up to you.Alternatively, if you don't have specific types of comparison you want to perform that the merging of datasets would be beneficial to increase sample size, you should also consider analyzing first each dataset separately.

Best,

Efstathios

Hi Efstathios

I am so grateful for your thorough answer, I tried with panp on each eset seperatly and it works,thanks for your answer

Best