Analysis of Affymetrix Mouse Gene 2.0 ST arrays
1
0
Entering edit mode
@ablanchetcohen-7697
Last seen 6.2 years ago

Hi,

I followed the instructions in the following post to filter out probesets in Affymetrix Mouse Gene 2.0 ST arrays based on the antigenomic probes, before the differential expression analysis with limma.

filtering probes in affymetrix data

One problem. I only have 6146 out of 41345 left after the filtering.

I calculated the 95 percentile of the antigenomic probes, and only kept probesets reaching that minimum threshold in at least 4 of my samples (size of my smallest group). The calculated threshold is 8.9

Is it normal to have so few probesets left? What am I doing wrong? I've done the MA plots and the clustering plots. There is a higher variation below 8.9, and the clustering is improved after the filtering, but there are so few genes left for the differential expression analysis. :( Can you propose a different method to calculate the threshold?

I've posted the code below.

####################################################################

# Filter by using antigenomic probesets as a measure of background #
####################################################################
# Connect to the SQLite database
connection <- oligoClasses::dbpd.mogene.2.0.st)

# Select antigenomic probes (23 probes)
antigm <- DBI::dbGetQuery(connection, "select meta_fsetid from core_mps inner join featureSet on core_mps.fsetid=featureSet.fsetid where featureSet.type='4';")

# Calculate 95th quantile of antigenomic probes
bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile, probs=0.95)
minval <- max(bkg)

# Apply filter
ind <- genefilter::genefilter(eset, filterfun(kOverA(4, minval)))
eset.filtered<-eset[ind,]

affymetrix limma filtering • 1.5k views
2
Entering edit mode
@james-w-macdonald-5106
Last seen 43 minutes ago
United States

I'm going to have to back up a bit on that recommendation. The antigenomic control probes go from almost pure AT to pure GC. Here are the sequences for the 'first' one:

 [1] "AAATAATATAATTAAGAGCAATTTA" "AAAATGAATATAATTAGCTAATAAA"
[3] "TTAAATATTAATAACGCATAAAAAT" "AATAATTGAATATGCTATAAAATAA"
[5] "TAATTTTAATAAAGCTAAGATTATA" "TAAAAGCTAATAATATACATAAAAT"
[7] "ATATTATAATGAATTAAAGCAATTT" "ATAAAATAGTTAAAAAGCATTAATT"
[9] "TAAAATAGAAATATAGCATTTAAAT" "TTATTAATATTGAAAAAGCTTATAT"
[11] "TAATAAATAATTAGCTGATAAATTT" "TAAAATAGCTAAATAATATGATTAT"
[13] "TATATAAGCTATAATATTGAAAAAT" "AATAAAAGGCTATAATATATAAAAA"
[15] "TATTAATAATTAGATTAGCAAAATA" "ATTTAATAATATTTAGCGAAAAAAT"
[17] "TTTTTAGCTTAATGATATATATATA" "ATTTTATATTAAAAAGCAGAATAAT"
[19] "ATAAATAATGCTAATAGAAAATAAA" "TTATAATATGAATTTAAAGCTAAAT"
[21] "ATTTAAGCTAAAAAATAGATATTAT" "AATAAATATTTAAATAGCGAAATTT"
[23] "ATTTAATAATTTGAAAAGCTAATAT" "TTATTAGCAATATTAGTTAAAATTA"
[25] "TTTTAAAGCTAATTAGTAATAAATA"

And here are the sequences for the 'last' one:

[1] "CGGCCGGGGGGGCCCGCCGCGCGCG" "CGCGCGCGGGGGGGGCCGGCCGCCG"
[3] "GCGCGGCCGGGGGGGCGCGCGGCGG" "GCGCCGCCGCCCGGGGGCGCCGGCG"
[5] "CGGCGGCCCGCCGGGCGGGCGCGGG" "CGGCGCGGGGGGGCCGCCGCGCGCG"
[7] "CGGCGCGCCGCGGGGCCGCGGGGGC" "CGCGCCCGGGGGGGCCCGCCGCGCG"
[9] "GCGCGCCGCGCGGGGGGGCCGCGGG"

If you do something like

matplot(exprs(eset)[as.character(antigm[,1]),], type = "l")

You will get a plot with the background probes on the horizontal axis (from 1 to 23), and the summarized expression values on the vertical axis. You can see that the binding goes from really low to quite high, which is sort of expected with high GC probes. Rafael Irizarry showed years ago that increases in GC content tended to result in increases in background binding, and evidently if you have pure GC probes they will bind to just about anything.

Since the antigenomic probes measure background binding at a certain percentage of GC content, you cannot just use them as naively as I suggested. The average percent of ACGT for each antigenomic probeset looks like this:

              A          C         G          T
[1,] 0.52160000 0.04320000 0.0768000 0.35840000
[2,] 0.45888199 0.06894410 0.0910559 0.38111801
[3,] 0.40887624 0.09917496 0.1008250 0.39112376
[4,] 0.38130584 0.11940435 0.1205956 0.37869416
[5,] 0.35789934 0.14026258 0.1397374 0.36210066
[6,] 0.34029787 0.16200000 0.1580000 0.33970213
[7,] 0.32158498 0.18002086 0.1799791 0.31841502
[8,] 0.29991597 0.19957983 0.2004202 0.30008403
[9,] 0.28054167 0.21625000 0.2237500 0.27945833
[10,] 0.25923947 0.23864337 0.2413566 0.26076053
[11,] 0.24268595 0.26053719 0.2594628 0.23731405
[12,] 0.21816667 0.27987500 0.2801250 0.22183333
[13,] 0.20252898 0.30048472 0.2995153 0.19747102
[14,] 0.17985462 0.32303219 0.3169678 0.18014538
[15,] 0.16114650 0.33503185 0.3449682 0.15885350
[16,] 0.14144737 0.35833333 0.3616667 0.13855263
[17,] 0.12098940 0.37955241 0.3804476 0.11901060
[18,] 0.09845018 0.39483395 0.4051661 0.10154982
[19,] 0.07942611 0.41319943 0.4268006 0.08057389
[20,] 0.05894017 0.42352137 0.4564786 0.06105983
[21,] 0.04245700 0.43626536 0.4837346 0.03754300
[22,] 0.02164179 0.44761194 0.5123881 0.01835821
[23,] 0.00000000 0.40888889 0.5911111 0.00000000

So you could hypothetically choose your cutoff based on the probeset(s) with the most 'representative' AGTC content. But simply using the 95th percentile is not likely to be a reasonable thing to do.

0
Entering edit mode

Thank you.

No wonder I had barely any probesets left after filtering.

I'm reluctant now to use the antigenomic probesets as a filtering criteria.
It's true that I could pick the antigenomic probeset with the most representative AGTC content.
I'm a bit hesitant though to try this technique since I haven't seen used in any publication.
Or has it been used before?

Given the number of people who analyze Mouse Gene 2.0 ST, as well as Human Gene 2.0 ST and Rat Gene 2.0 ST,  is there no consensus on how to pick the threshold to use?
Or, is filtering completely unnecessary?

What is the most commonly used filtering approach before running limma? No filtering, filtering 20% less expressed transcripts, filtering 50% less expressed transcripts, ...

0
Entering edit mode

Maybe I should feel like the fool, since I was the one who first suggested it...

Anyway, I don't know of a consensus for filtering. I tend to go with simple naive things that can be readily explained in a manuscript. So if I have three groups of 5 arrays each, I would likely do something like

ind <- rowSums(exprs(eset) > 5) >= 5
eset.filtered <- eset[ind,]

thereby requiring that at least five of my samples for a given gene have an expression level greater than 5. This doesn't bias the filtering by requiring that a certain set of five samples exceed that threshold, but still requires that all the genes appear to be 'expressed' in a significant subset of the samples.

In the manuscript I would then tend to say something like 'Prior to making comparisons we excluded probesets that appeared to be unexpressed in all sample groups.' Or something.

And why use a cutoff of 5? A couple of reasons, both of which are ad hoc. First, if you look at the background probesets that are about 25% ACGT, the expression level is right around 5. Second, if you look at the distribution of the normalized data, the mode is around 5. If you assume that the data are a convolution of a lognormal distribution of probesets that are measuring background and an exponential of probesets that are measuring actual expression, then the mode of the convoluted distribution isn't a completely horrible thing to pick.

I would recommend using some sort of filter, as decreasing the multiplicity is always helpful, and getting rid of probesets that are likely to be more noise than signal can't hurt either.

0
Entering edit mode

Thank you.
I just wanted to say that I chose this threshold since the rationale for picking it made sense to me.
(I also deleted the line about feeling like a fool. It's just a learning process.)