Question

Analysis of Affymetrix Mouse Gene 2.0 ST arrays

0

Entering edit mode

ablanchetcohen • 0

@ablanchetcohen-7697

Last seen 10.6 years ago

Canada

Hi,

I followed the instructions in the following post to filter out probesets in Affymetrix Mouse Gene 2.0 ST arrays based on the antigenomic probes, before the differential expression analysis with limma.

filtering probes in affymetrix data

One problem. I only have 6146 out of 41345 left after the filtering.

I calculated the 95 percentile of the antigenomic probes, and only kept probesets reaching that minimum threshold in at least 4 of my samples (size of my smallest group). The calculated threshold is 8.9

Is it normal to have so few probesets left? What am I doing wrong? I've done the MA plots and the clustering plots. There is a higher variation below 8.9, and the clustering is improved after the filtering, but there are so few genes left for the differential expression analysis. :( Can you propose a different method to calculate the threshold?

I've posted the code below.

Thank you for your help.

####################################################################

# Filter by using antigenomic probesets as a measure of background #
####################################################################
# Connect to the SQLite database
connection <- oligoClasses::dbpd.mogene.2.0.st)

# Select antigenomic probes (23 probes)
antigm <- DBI::dbGetQuery(connection, "select meta_fsetid from core_mps inner join featureSet on core_mps.fsetid=featureSet.fsetid where featureSet.type='4';")

# Calculate 95th quantile of antigenomic probes
bkg <- apply(exprs(eset)[as.character(antigm[,1]),], 2, quantile, probs=0.95)
minval <- max(bkg)

# Apply filter
ind <- genefilter::genefilter(eset, filterfun(kOverA(4, minval)))
eset.filtered<-eset[ind,]

affymetrix limma filtering • 2.7k views

ADD COMMENT • link updated 10.6 years ago by James W. MacDonald 68k • written 10.6 years ago by ablanchetcohen • 0

score 2 · Answer 1 · 2015-05-05

I'm going to have to back up a bit on that recommendation. The antigenomic control probes go from almost pure AT to pure GC. Here are the sequences for the 'first' one:

 [1] "AAATAATATAATTAAGAGCAATTTA" "AAAATGAATATAATTAGCTAATAAA"
 [3] "TTAAATATTAATAACGCATAAAAAT" "AATAATTGAATATGCTATAAAATAA"
 [5] "TAATTTTAATAAAGCTAAGATTATA" "TAAAAGCTAATAATATACATAAAAT"
 [7] "ATATTATAATGAATTAAAGCAATTT" "ATAAAATAGTTAAAAAGCATTAATT"
 [9] "TAAAATAGAAATATAGCATTTAAAT" "TTATTAATATTGAAAAAGCTTATAT"
[11] "TAATAAATAATTAGCTGATAAATTT" "TAAAATAGCTAAATAATATGATTAT"
[13] "TATATAAGCTATAATATTGAAAAAT" "AATAAAAGGCTATAATATATAAAAA"
[15] "TATTAATAATTAGATTAGCAAAATA" "ATTTAATAATATTTAGCGAAAAAAT"
[17] "TTTTTAGCTTAATGATATATATATA" "ATTTTATATTAAAAAGCAGAATAAT"
[19] "ATAAATAATGCTAATAGAAAATAAA" "TTATAATATGAATTTAAAGCTAAAT"
[21] "ATTTAAGCTAAAAAATAGATATTAT" "AATAAATATTTAAATAGCGAAATTT"
[23] "ATTTAATAATTTGAAAAGCTAATAT" "TTATTAGCAATATTAGTTAAAATTA"
[25] "TTTTAAAGCTAATTAGTAATAAATA"

And here are the sequences for the 'last' one:

[1] "CGGCCGGGGGGGCCCGCCGCGCGCG" "CGCGCGCGGGGGGGGCCGGCCGCCG"
[3] "GCGCGGCCGGGGGGGCGCGCGGCGG" "GCGCCGCCGCCCGGGGGCGCCGGCG"
[5] "CGGCGGCCCGCCGGGCGGGCGCGGG" "CGGCGCGGGGGGGCCGCCGCGCGCG"
[7] "CGGCGCGCCGCGGGGCCGCGGGGGC" "CGCGCCCGGGGGGGCCCGCCGCGCG"
[9] "GCGCGCCGCGCGGGGGGGCCGCGGG"

If you do something like

matplot(exprs(eset)[as.character(antigm[,1]),], type = "l")

You will get a plot with the background probes on the horizontal axis (from 1 to 23), and the summarized expression values on the vertical axis. You can see that the binding goes from really low to quite high, which is sort of expected with high GC probes. Rafael Irizarry showed years ago that increases in GC content tended to result in increases in background binding, and evidently if you have pure GC probes they will bind to just about anything.

Since the antigenomic probes measure background binding at a certain percentage of GC content, you cannot just use them as naively as I suggested. The average percent of ACGT for each antigenomic probeset looks like this:

              A          C         G          T
 [1,] 0.52160000 0.04320000 0.0768000 0.35840000
 [2,] 0.45888199 0.06894410 0.0910559 0.38111801
 [3,] 0.40887624 0.09917496 0.1008250 0.39112376
 [4,] 0.38130584 0.11940435 0.1205956 0.37869416
 [5,] 0.35789934 0.14026258 0.1397374 0.36210066
 [6,] 0.34029787 0.16200000 0.1580000 0.33970213
 [7,] 0.32158498 0.18002086 0.1799791 0.31841502
 [8,] 0.29991597 0.19957983 0.2004202 0.30008403
 [9,] 0.28054167 0.21625000 0.2237500 0.27945833
[10,] 0.25923947 0.23864337 0.2413566 0.26076053
[11,] 0.24268595 0.26053719 0.2594628 0.23731405
[12,] 0.21816667 0.27987500 0.2801250 0.22183333
[13,] 0.20252898 0.30048472 0.2995153 0.19747102
[14,] 0.17985462 0.32303219 0.3169678 0.18014538
[15,] 0.16114650 0.33503185 0.3449682 0.15885350
[16,] 0.14144737 0.35833333 0.3616667 0.13855263
[17,] 0.12098940 0.37955241 0.3804476 0.11901060
[18,] 0.09845018 0.39483395 0.4051661 0.10154982
[19,] 0.07942611 0.41319943 0.4268006 0.08057389
[20,] 0.05894017 0.42352137 0.4564786 0.06105983
[21,] 0.04245700 0.43626536 0.4837346 0.03754300
[22,] 0.02164179 0.44761194 0.5123881 0.01835821
[23,] 0.00000000 0.40888889 0.5911111 0.00000000

So you could hypothetically choose your cutoff based on the probeset(s) with the most 'representative' AGTC content. But simply using the 95th percentile is not likely to be a reasonable thing to do.