Hi,
I followed the instructions in the following post to filter out probesets in Affymetrix Mouse Gene 2.0 ST arrays based on the antigenomic probes, before the differential expression analysis with limma.
filtering probes in affymetrix data
One problem. I only have 6146 out of 41345 left after the filtering.
I calculated the 95 percentile of the antigenomic probes, and only kept probesets reaching that minimum threshold in at least 4 of my samples (size of my smallest group). The calculated threshold is 8.9
Is it normal to have so few probesets left? What am I doing wrong? I've done the MA plots and the clustering plots. There is a higher variation below 8.9, and the clustering is improved after the filtering, but there are so few genes left for the differential expression analysis. :( Can you propose a different method to calculate the threshold?
I've posted the code below.
Thank you for your help.
####################################################################
# Filter by using antigenomic probesets as a measure of background #
####################################################################
# Connect to the SQLite database
connection <- oligoClasses::dbpd.mogene.2.0.st)
# Select antigenomic probes (23 probes)
antigm <- DBI::dbGetQuery(connection, "select meta_fsetid from core_mps inner join featureSet on core_mps.fsetid=featureSet.fsetid where featureSet.type='4';")
# Calculate 95th quantile of antigenomic probes
bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile, probs=0.95)
minval <- max(bkg)
# Apply filter
ind <- genefilter::genefilter(eset, filterfun(kOverA(4, minval)))
eset.filtered<-eset[ind,]
Thank you.
Very helpful and clear explanation.
No wonder I had barely any probesets left after filtering.
I'm reluctant now to use the antigenomic probesets as a filtering criteria.
It's true that I could pick the antigenomic probeset with the most representative AGTC content.
I'm a bit hesitant though to try this technique since I haven't seen used in any publication.
Or has it been used before?
Given the number of people who analyze Mouse Gene 2.0 ST, as well as Human Gene 2.0 ST and Rat Gene 2.0 ST, is there no consensus on how to pick the threshold to use?
Or, is filtering completely unnecessary?
What is the most commonly used filtering approach before running limma? No filtering, filtering 20% less expressed transcripts, filtering 50% less expressed transcripts, ...
Maybe I should feel like the fool, since I was the one who first suggested it...
Anyway, I don't know of a consensus for filtering. I tend to go with simple naive things that can be readily explained in a manuscript. So if I have three groups of 5 arrays each, I would likely do something like
thereby requiring that at least five of my samples for a given gene have an expression level greater than 5. This doesn't bias the filtering by requiring that a certain set of five samples exceed that threshold, but still requires that all the genes appear to be 'expressed' in a significant subset of the samples.
In the manuscript I would then tend to say something like 'Prior to making comparisons we excluded probesets that appeared to be unexpressed in all sample groups.' Or something.
And why use a cutoff of 5? A couple of reasons, both of which are ad hoc. First, if you look at the background probesets that are about 25% ACGT, the expression level is right around 5. Second, if you look at the distribution of the normalized data, the mode is around 5. If you assume that the data are a convolution of a lognormal distribution of probesets that are measuring background and an exponential of probesets that are measuring actual expression, then the mode of the convoluted distribution isn't a completely horrible thing to pick.
I would recommend using some sort of filter, as decreasing the multiplicity is always helpful, and getting rid of probesets that are likely to be more noise than signal can't hurt either.
Thank you.
I just wanted to say that I chose this threshold since the rationale for picking it made sense to me.
(I also deleted the line about feeling like a fool. It's just a learning process.)