I have performing DGE analysis of Microarray data from GEO having 26 samples containing around 56000 genes. After applying nsFilter() the genes reduce to around 8000. So my question is " Is it necessary to perform gene filtering step?".
I have performing DGE analysis of Microarray data from GEO having 26 samples containing around 56000 genes. After applying nsFilter() the genes reduce to around 8000. So my question is " Is it necessary to perform gene filtering step?".
I am the author of the limma package and I definitely do not recommend the use of nsFilter(). nsFilter() is actively harmful to an analysis rather than helpful. It does too much filtering and the wrong sort of filtering.
It is much better to perform less heavy-handed filtering based on simple biological principles. limma can run on most microarray datasets without any filtering, but filtering of probes that no longer have recognized annotation or probes that have consistently low intensity in your data is usually helpful.
If you explain which GEO dataset you are looking at, then it would be possible to give more specific advice. Optimal filtering does depend on the microarray platform and on the background correction and normalization methods that have been used. I would almost always retain more than 8000 probes in an analysis.
Some previous answers on this topic:
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Gordon, Thank you for your response and advice. I want to analyze Dataset GSE18090 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) which use GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array. I have normalize data using "RMA" and Batch corrected using "NOISeq". My doubts is : Is probes or lowly express genes or least variable genes interfere analysis ? Should I filtered them out?
RMA normalization is good but batch correction using NOISeq isn't right. Data shouldn't be batch corrected before the limma analysis and NOISeq is only for sequencing data.
With RMA data, I would suggest just a little filtering of low expressed genes by
See also limma: vooma mean-variance trend and data filtration
Please do not filter "least variable genes" as that is unnecessary and will interfere with the analysis.
Hello Gordon, The PCA plot for RMA normalized data shows Mixing of samples. Is it Batch Effect ? Because when I apply ARSyNseq() and again plot PCA the samples separately grouped.
Code :
BATCH.cor <- readData(NORM.data, factor = PHENO)
BATCH.data <- ARSyNseq(BATCH.cor, factor="Group", batch = FALSE, norm = "n", logtransf = TRUE)
This is Affymetrix microarray data. It is nonsensical to apply functions like NOISeq::readData or NOISeq::ARSyNseq() to Affymetrix log-intensity data because those functions are strictly designed for sequence read count data. Amongst many other problems, you are log-transforming RMA normalized values that were already on the log-scale.
You need to analyse the data by standard methods for microarrays, for example, as explained in the limma User's Guide. There is plenty of advice about analysing Affymetrix data on the Bioconductor website. It is pretty straightforward really. If there is a batch effect, then it can be included in the limma linear model.
Hi Gordon, I am using .CEL files for analysis. So should I go through same as we discussed above?
I already know that you using CEL files because you need them to perform RMA normalization. Yes, you should follow the advice I have already given you.