Question

SD filtering before DEG analysis with limma : possible or not ?

0

Entering edit mode

alexia.brunel • 0

@9fc89008

Last seen 17 months ago

United States

Hi everyone,

I have microarray data from Affymetrix Human Genome U133 plus2.0 array to analyze from KO experiment with 2 different constructs and 3 replicates per condition. When I am doing DEG analysis with limma with all my probes (20422) I obtained only one significant gene (the one impacted by the KO) and all the others have the same adj.P.value (near 1) whereas pvalues are different and some have good FC (sup2). In the meantime if I filter my probes based on their STD and only keep fifty percent of the most variable (so 10 211 instead of 20 422) it completely changes statistics and now I have some significant DEG (not a lot, but with adj.p.value very low). I read in some papers that it could be possible to filter minimum variance across all samples, so if a gene isnt changing (constant expression) its inherently not interesting therefore no need to test. However, I am really not sure if I can do it from a mathematical point of view. Could you please confirm (or not) if it is possible or not ? Since the rest of the analysis (eg pathways) is based on this pvalue is really important for me !

Thank you very much for your help!

filtering STD DEG limma • 1.8k views

ADD COMMENT • link updated 18 months ago by Gordon Smyth 52k • written 18 months ago by alexia.brunel • 0

score 1 · Answer 1 · 2023-06-23

1

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 2 days ago

United States

Assuming that your question is in the post title, the answer is no. Filtering on SD will introduce bias. Filtering microarray data is a fraught process, and there's really no upside to doing so.

ADD COMMENT • link 18 months ago James W. MacDonald 67k

0

Entering edit mode

Sorry I had some trouble posting my question but now is good! So the only correct way to do is to keep all 20 422 probes ? Sorry I am new in this areas and I found everything and its opposite on papers and forums!

ADD REPLY • link 18 months ago alexia.brunel • 0

1

Entering edit mode

Presumably you are using the limma package to do the comparisons. If not, you should be. If you are, do note that the eBayes step is estimating a variance hyperparameter, which for purposes of this discussion we can think of as the average within-group variance over all genes. This hyperparameter is used as a variance prior, and all the observed variance estimates are then 'shrunken' towards the prior value (where 'shrunken' doesn't mean 'made smaller' but instead means 'made more similar to').

If you remove all the low variance genes, then the hyperparameter will be larger than it should be, and you will now adjust all the observed variance estimates towards the large variance prior. This is not good!

You would probably be better off using gcrma to account for background as a function of GC content, and then filter based on the number of subjects that are above some level that you want to consider to be background binding. As an example, say you have two groups with five subjects per. You could look at the distribution of the average expression level to say what you think is a level where the genes are either not expressed, or expressed at a level that is so low that noise dominates signal. You could then filter genes based on that cutoff as well as the number of subjects that have to exceed the cutoff. Let's say you decide the gene expression level is 4, and the N that have to exceed that value is 3. You could do

filtind <- rowSums(exprs(eset) > 4) > 3
eset <- eset[filtind,]

And now you have filtered the data in a way that won't affect the eBayes step, and since you used gcrma you can argue that the gene levels have been adjusted for GC content so a single arbitrary gene expression cutoff makes sense.

ADD REPLY • link 18 months ago James W. MacDonald 67k

1

Entering edit mode

Thanks for your elaborated your answer.

IMHO, the relevant article is Independent filtering increases detection power for high-throughput experiments from Bourgon et al.

ADD REPLY • link 18 months ago SamGG ▴ 360

0

Entering edit mode

That article is meant to support the genefilter package, which includes functionality to do row-wise t-tests. In that situation it's perfectly OK to filter on variance. But if the OP is using 'limma` (which is what the OP should be doing), then filtering on SD is a horrible idea.

ADD REPLY • link 18 months ago James W. MacDonald 67k

0

Entering edit mode

Thank you for your valuable comments pointing out the inconsistency of merging stages from different pipelines as I did. This might be the case for the OP.

ADD REPLY • link 18 months ago SamGG ▴ 360

0

Entering edit mode

It is ! Thank you to mention this important inconsistency indeed!

ADD REPLY • link 18 months ago alexia.brunel • 0

0

Entering edit mode

Thank you so much for all your explanation! I did what you mention (with rma instead of gcrma), and filter the data with gene expression level at least at 7.5 in 3 samples (I guess it is a lot but according the results I've obtained with the histogram of the median intensities, I have no choice I think). After this filtering, I obtained 10 995 on which I perform limma analysis and it result at almost exactly same result than without filtering (= no DEG).

Histogram of the median intensities

ADD REPLY • link 18 months ago alexia.brunel • 0

0

Entering edit mode

I'd look at a PCA plot of your data, and probably a histogram of your p-values. It's likely there are no differences between the groups.

ADD REPLY • link 18 months ago James W. MacDonald 67k

0

Entering edit mode

I have 9 different conditions (2 different KO + ctrl, each time in 3 different mouse organ) so it is a lot, in some I have big differences and in other it is not so obvious

ADD REPLY • link 18 months ago alexia.brunel • 0

score 0 · Answer 2 · 2023-06-24

limma offers quite a few options that may increase statistical power, such as array weights, trended eBayes or robust eBayes. It would be worthwhile trying some of those in conjunction with the sort of filtering that James has described. Filter low expressed genes rather than SDs. It sounds like your samples are variable, so some data quality trouble-shooting might be in order, starting with plotMDS etc.