Question

DEG Filtering

0

Entering edit mode

Amit • 0

@b648b3f5

Last seen 9 days ago

India

I have performing DGE analysis of Microarray data from GEO having 26 samples containing around 56000 genes. After applying nsFilter() the genes reduce to around 8000. So my question is " Is it necessary to perform gene filtering step?".

MicroarrayData limma Genefiltering • 434 views

ADD COMMENT • link updated 4 weeks ago by Gordon Smyth 50k • written 5 weeks ago by Amit • 0

score 2 · Answer 1 · 2024-03-16

2

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

I am the author of the limma package and I definitely do not recommend the use of nsFilter(). nsFilter() is actively harmful to an analysis rather than helpful. It does too much filtering and the wrong sort of filtering.

It is much better to perform less heavy-handed filtering based on simple biological principles. limma can run on most microarray datasets without any filtering, but filtering of probes that no longer have recognized annotation or probes that have consistently low intensity in your data is usually helpful.

If you explain which GEO dataset you are looking at, then it would be possible to give more specific advice. Optimal filtering does depend on the microarray platform and on the background correction and normalization methods that have been used. I would almost always retain more than 8000 probes in an analysis.

Some previous answers on this topic:

ADD COMMENT • link 5 weeks ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, Thank you for your response and advice. I want to analyze Dataset GSE18090 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi) which use GPL570 [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array. I have normalize data using "RMA" and Batch corrected using "NOISeq". My doubts is : Is probes or lowly express genes or least variable genes interfere analysis ? Should I filtered them out?

ADD REPLY • link 5 weeks ago Amit • 0

1

Entering edit mode

RMA normalization is good but batch correction using NOISeq isn't right. Data shouldn't be batch corrected before the limma analysis and NOISeq is only for sequencing data.

With RMA data, I would suggest just a little filtering of low expressed genes by

keep <- rowMeans(exprs(eset)) > 3
eset <- eset[keep,]

See also limma: vooma mean-variance trend and data filtration

Please do not filter "least variable genes" as that is unnecessary and will interfere with the analysis.

ADD REPLY • link 5 weeks ago Gordon Smyth 50k

0

Entering edit mode

Hello Gordon, The PCA plot for RMA normalized data shows Mixing of samples. Is it Batch Effect ? Because when I apply ARSyNseq() and again plot PCA the samples separately grouped.

Code :

BATCH.cor <- readData(NORM.data, factor = PHENO)

BATCH.data <- ARSyNseq(BATCH.cor, factor="Group", batch = FALSE, norm = "n", logtransf = TRUE)

ADD REPLY • link 5 weeks ago Amit • 0

1

Entering edit mode

This is Affymetrix microarray data. It is nonsensical to apply functions like NOISeq::readData or NOISeq::ARSyNseq() to Affymetrix log-intensity data because those functions are strictly designed for sequence read count data. Amongst many other problems, you are log-transforming RMA normalized values that were already on the log-scale.

You need to analyse the data by standard methods for microarrays, for example, as explained in the limma User's Guide. There is plenty of advice about analysing Affymetrix data on the Bioconductor website. It is pretty straightforward really. If there is a batch effect, then it can be included in the limma linear model.

ADD REPLY • link 5 weeks ago Gordon Smyth 50k

0

Entering edit mode

Hi Gordon, I am using .CEL files for analysis. So should I go through same as we discussed above?

ADD REPLY • link 5 weeks ago Amit • 0

1

Entering edit mode

I already know that you using CEL files because you need them to perform RMA normalization. Yes, you should follow the advice I have already given you.

ADD REPLY • link 4 weeks ago Gordon Smyth 50k