I'm going to have to back up a bit on that recommendation. The antigenomic control probes go from almost pure AT to pure GC. Here are the sequences for the 'first' one:
[1] "AAATAATATAATTAAGAGCAATTTA" "AAAATGAATATAATTAGCTAATAAA"
[3] "TTAAATATTAATAACGCATAAAAAT" "AATAATTGAATATGCTATAAAATAA"
[5] "TAATTTTAATAAAGCTAAGATTATA" "TAAAAGCTAATAATATACATAAAAT"
[7] "ATATTATAATGAATTAAAGCAATTT" "ATAAAATAGTTAAAAAGCATTAATT"
[9] "TAAAATAGAAATATAGCATTTAAAT" "TTATTAATATTGAAAAAGCTTATAT"
[11] "TAATAAATAATTAGCTGATAAATTT" "TAAAATAGCTAAATAATATGATTAT"
[13] "TATATAAGCTATAATATTGAAAAAT" "AATAAAAGGCTATAATATATAAAAA"
[15] "TATTAATAATTAGATTAGCAAAATA" "ATTTAATAATATTTAGCGAAAAAAT"
[17] "TTTTTAGCTTAATGATATATATATA" "ATTTTATATTAAAAAGCAGAATAAT"
[19] "ATAAATAATGCTAATAGAAAATAAA" "TTATAATATGAATTTAAAGCTAAAT"
[21] "ATTTAAGCTAAAAAATAGATATTAT" "AATAAATATTTAAATAGCGAAATTT"
[23] "ATTTAATAATTTGAAAAGCTAATAT" "TTATTAGCAATATTAGTTAAAATTA"
[25] "TTTTAAAGCTAATTAGTAATAAATA"
And here are the sequences for the 'last' one:
[1] "CGGCCGGGGGGGCCCGCCGCGCGCG" "CGCGCGCGGGGGGGGCCGGCCGCCG"
[3] "GCGCGGCCGGGGGGGCGCGCGGCGG" "GCGCCGCCGCCCGGGGGCGCCGGCG"
[5] "CGGCGGCCCGCCGGGCGGGCGCGGG" "CGGCGCGGGGGGGCCGCCGCGCGCG"
[7] "CGGCGCGCCGCGGGGCCGCGGGGGC" "CGCGCCCGGGGGGGCCCGCCGCGCG"
[9] "GCGCGCCGCGCGGGGGGGCCGCGGG"
If you do something like
matplot(exprs(eset)[as.character(antigm[,1]),], type = "l")
You will get a plot with the background probes on the horizontal axis (from 1 to 23), and the summarized expression values on the vertical axis. You can see that the binding goes from really low to quite high, which is sort of expected with high GC probes. Rafael Irizarry showed years ago that increases in GC content tended to result in increases in background binding, and evidently if you have pure GC probes they will bind to just about anything.
Since the antigenomic probes measure background binding at a certain percentage of GC content, you cannot just use them as naively as I suggested. The average percent of ACGT for each antigenomic probeset looks like this:
A C G T
[1,] 0.52160000 0.04320000 0.0768000 0.35840000
[2,] 0.45888199 0.06894410 0.0910559 0.38111801
[3,] 0.40887624 0.09917496 0.1008250 0.39112376
[4,] 0.38130584 0.11940435 0.1205956 0.37869416
[5,] 0.35789934 0.14026258 0.1397374 0.36210066
[6,] 0.34029787 0.16200000 0.1580000 0.33970213
[7,] 0.32158498 0.18002086 0.1799791 0.31841502
[8,] 0.29991597 0.19957983 0.2004202 0.30008403
[9,] 0.28054167 0.21625000 0.2237500 0.27945833
[10,] 0.25923947 0.23864337 0.2413566 0.26076053
[11,] 0.24268595 0.26053719 0.2594628 0.23731405
[12,] 0.21816667 0.27987500 0.2801250 0.22183333
[13,] 0.20252898 0.30048472 0.2995153 0.19747102
[14,] 0.17985462 0.32303219 0.3169678 0.18014538
[15,] 0.16114650 0.33503185 0.3449682 0.15885350
[16,] 0.14144737 0.35833333 0.3616667 0.13855263
[17,] 0.12098940 0.37955241 0.3804476 0.11901060
[18,] 0.09845018 0.39483395 0.4051661 0.10154982
[19,] 0.07942611 0.41319943 0.4268006 0.08057389
[20,] 0.05894017 0.42352137 0.4564786 0.06105983
[21,] 0.04245700 0.43626536 0.4837346 0.03754300
[22,] 0.02164179 0.44761194 0.5123881 0.01835821
[23,] 0.00000000 0.40888889 0.5911111 0.00000000
So you could hypothetically choose your cutoff based on the probeset(s) with the most 'representative' AGTC content. But simply using the 95th percentile is not likely to be a reasonable thing to do.
Thank you.
Very helpful and clear explanation.
No wonder I had barely any probesets left after filtering.
I'm reluctant now to use the antigenomic probesets as a filtering criteria.
It's true that I could pick the antigenomic probeset with the most representative AGTC content.
I'm a bit hesitant though to try this technique since I haven't seen used in any publication.
Or has it been used before?
Given the number of people who analyze Mouse Gene 2.0 ST, as well as Human Gene 2.0 ST and Rat Gene 2.0 ST, is there no consensus on how to pick the threshold to use?
Or, is filtering completely unnecessary?
What is the most commonly used filtering approach before running limma? No filtering, filtering 20% less expressed transcripts, filtering 50% less expressed transcripts, ...
Maybe I should feel like the fool, since I was the one who first suggested it...
Anyway, I don't know of a consensus for filtering. I tend to go with simple naive things that can be readily explained in a manuscript. So if I have three groups of 5 arrays each, I would likely do something like
thereby requiring that at least five of my samples for a given gene have an expression level greater than 5. This doesn't bias the filtering by requiring that a certain set of five samples exceed that threshold, but still requires that all the genes appear to be 'expressed' in a significant subset of the samples.
In the manuscript I would then tend to say something like 'Prior to making comparisons we excluded probesets that appeared to be unexpressed in all sample groups.' Or something.
And why use a cutoff of 5? A couple of reasons, both of which are ad hoc. First, if you look at the background probesets that are about 25% ACGT, the expression level is right around 5. Second, if you look at the distribution of the normalized data, the mode is around 5. If you assume that the data are a convolution of a lognormal distribution of probesets that are measuring background and an exponential of probesets that are measuring actual expression, then the mode of the convoluted distribution isn't a completely horrible thing to pick.
I would recommend using some sort of filter, as decreasing the multiplicity is always helpful, and getting rid of probesets that are likely to be more noise than signal can't hurt either.
Thank you.
I just wanted to say that I chose this threshold since the rationale for picking it made sense to me.
(I also deleted the line about feeling like a fool. It's just a learning process.)