Hi all,
I am trying to make P/A calls at probeset level for Mouse gene ST 2.1 array using Oligo.
Here is what I did for probeset level P/A calls:
celFiles <- list.celfiles()
affyRaw <- read.celfiles(celFiles)
genePS <- rma(affyRaw, target = "probeset")
genePS_dabg <- paCalls(affyRaw,"PSDABG")
genePS_dabg_call = exprs(genePS_dabg)
But when I ran "genePS_dabg_call = exprs(genePS_dabg)", an error came out. It was written as
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘exprs’ for signature ‘"matrix"’
What's the problem here?
Vincent
I get your idea, James. That's why I'm now artificially assigning P/M/A to genePS_dabg by writing
genePS_dabg_call = ifelse(genePS_dabg<=0.06,ifelse(genePS_dabg<0.04,"P","M"),"A").
The values used for assigning P/M/A are based on mas5calls function.
But I'm not sure if paCalls uses the same algorithm to calculate the same p values as mas5calls does. If not, I doubt if I could use the same values to assign P/M/A.
Actually, I also want to try xps package to get the P/M/A results from my ST array but I could not install it in my computer :(
The implementation in oligo is different from mas5calls() in affy. But so is (by definition) the implementation in xps. This is because you are using PM-only arrays.
With mas5calls(), each PM probe had a corresponding MM probe, so it was a simple exercise of saying 'do the PM probes appear in aggregate to have a higher binding than MM probes?'. But the Gene ST arrays don't have MM probes, so what you have to do is find some background probes that are similar in some respect to the PM probes, and then make the test.
Both oligo and xps are implementing Affymetrix's DABG algorithm (detected above background), which 'bins' the foreground and background probes, based on GC content, and then tests to see if the foreground probes in a given bin are arguably binding at a higher level than the background probes. In oligo, DABG does this at the probe level, and PSDABG collapses to the probeset (exon, or PSR) level. There is nothing in oligo that collapses to the transcript level, but xps will compute calls at probe, probeset, and transcript level.
Hi James,
How could I know which probes are background probes?
Concerning DABG and PSDABG algorithms, you said DABG analyzes data in probe level while PSDABG analyzes data in probeset level. Actually, I'm not very clear about the differences about probe, probeset, and transcript levels. Could you clarify that?
I know a transcript cluster of a gene contains a number of probesets. And one probeset contains a number of probes. So when I try to analyze a gene at the transcript level, does it mean I'm grouping all the signals from all the probesets of that gene into one value and that value represents the expression level of that particular gene?
Does one probeset of a gene cover only one particular exon of that gene?
Thank you very much!
Vincent
The background probes are the 'antigenomic' set. I don't have pd.mogene.2.1.st installed, so I will show you using the 2.0 version.
The control->bgp->antigenomic probes are those that are used for testing background, as they have varying amounts of GC content, and are not expected to be found in the mouse genome.
And you are exactly right about probes, probesets and transcripts. Probes are individual 25-mers. Probesets are intended to measure 'probe set regions' (PSRs), which are part of an exon (or the whole thing, if it is short). Transcripts are where all the PSRs for a given gene are piled into one probeset and summarized to give an aggregate measure of the expression for that gene.
I would only recommend using the transcript (core) level for the Gene ST arrays. Here is the distribution of probes/probeset at the PSR level:
Or more to the point
So 81% of the PSR level probesets have < 4 probes (on the Exon ST arrays, the PSRs have mostly 4 probes/probeset, so we don't really expect more than that.). I'm personally not enthused with RMA on 4 probes/probeset, let alone 3 or fewer.
Dear James,
I have read through the Q&A you were discussing with Juan (paCalls oligo package) and I find you guys are discussing the use of "psdabg" and "dabg" in paCalls().
Am I right to say that rma (target = "core") must be used with "psdabg" in paCalls() while rma(target = "probeset") must be used with "dabg"?
If yes, why do they have to be in these combinations? "target = probeset" summarizes data in probeset level but dabg calculates p values in probe level. They are in two different levels. How can I use the p values using dabg to judge if that particular probeset is present or not in a particular RNA sample?
Moreover, you mentioned
"Personally I wouldn't use rma(target = "probeset") for the Gene ST arrays, because tons of the probesets only have one probe at that summarization level."
What does it mean? What is the problem of having only one probe at that summarization level?
Also, you suggested these command lines, which are
N <- 5 ## or whatever
ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N)
What exactly is N? For example, in my own microarray analysis, I would like to compare the transcriptomes of 4 types of mice. One type is wild-type control mice. The second type is wild-type mice drug treatment. The third type is disease control mice. And the last type is disease mice with the same drug treatment. And each type has 3 samples. So there are a total of 12 samples.
So what value should I put in N?
Also, I don't quite understand this command line "ind <- apply(dagbPS, 1, function(x) sum(x < 0.05) > N)".
What do you mean by function(x) sum (x<0.05) > N?
I am sorry to ask so many questions. Thank you very much in advance.
Best regards,
Vincent
You would use PSDABG with rma(target = "probeset"). There is no paCalls() for the transcript level summarization in oligo. If you want that, you must use xps.
The reason there are multiple probes/probeset is because Affy realized early on that a 25-mer isn't really long enough give reliable results. By using multiple 25-mers, the idea was that in aggregate you would get a more accurate estimate of gene expression than what you would get from just one 25-mer. The original Affy arrays had 16 probes/probeset, which they reduced to 11 for the last set of 3'-biased arrays, and down to 4 for the random primer arrays (of which the Exon arrays were the first).
The RMA algorithm is intended to estimate (and then ignore) the probe-specific binding. But if you only have a small number of probes/probeset it is difficult to correctly account for the probe-specific binding, so the expression values are likely to still be highly confounded with technical attributes of the individual probes. If you want to do things at the exon level, get an Exon ST array.
In general when you are filtering out probesets, the goal is to do so in an unbiased manner. One way to do that is to say that the expression of a given gene has to be greater than a particular cutoff for at least N of the arrays (where N is the number of samples in the smallest group you have). You aren't saying that a particular set of arrays have to be expressed (which would introduce bias), but you are requiring that at least some of your samples are expressing the gene.
As for the code, this is where you earn your keep. If you plan on using R, the best way to do so is to look at other people's code and figure out what it is doing. I have already told you what it is supposed to be doing, so now you can figure out how.
Hi James,
I used the transcript level for analysis before but later when I checked different probesets representing the same gene, I often found significant differences between them in terms of the signals of the probesets. And I'm afraid if I simply analyze the Gene ST array in transcript level, some significant changes of some probesets would be masked. Moreover, when I tried to perform qPCR to validate microarray data, it's difficult to decide a specific region for amplification if I use the transcript level. But I could specifically design primers to amplify a particular region covered by a probeset. That's why I now switch to probeset level for analysis.
Concerning the background probes, do you mean I could make use of these probes to run mas5calls function? I guess R won't allow me to do so because it would recognize my cel files are from ST arrays instead of expression array. Isn't it?
Vincent