Hi Jim,
I would like to thank you a lot for the deailed explanation. I think I
understand now better, not completely, but better, the differences
between
the PM-MM calculation and the reason to use the RMA algorithm.
cu
Assa
On Wed, Apr 18, 2012 at 16:10, James W. MacDonald <jmacdon@uw.edu>
wrote:
> Hi Assa,
>
>
> On 4/18/2012 9:37 AM, Assa Yeroslaviz wrote:
>
>> Hi Jim,
>>
>>
>> On Wed, Apr 18, 2012 at 15:10, James W. MacDonald
<jmacdon@uw.edu<mailto:>> jmacdon@uw.edu>> wrote:
>>
>> Hi Assa,
>>
>>
>> On 4/18/2012 2:33 AM, Assa Yeroslaviz wrote:
>>
>> Hi Jim,
>>
>>
>> On Tue, Apr 17, 2012 at 20:04, James W. MacDonald
>> <jmacdon@uw.edu <mailto:jmacdon@uw.edu="">
<mailto:jmacdon@uw.edu>> <mailto:jmacdon@uw.edu>>> wrote:
>>
>> Hi Assa,
>>
>>
>> On 4/17/2012 3:09 AM, Assa Yeroslaviz wrote:
>>
>> Hi Jim and bioC-Users,
>>
>>
>>
>> On Mon, Apr 16, 2012 at 17:54, James W. MacDonald
>> <jmacdon@uw.edu <mailto:jmacdon@uw.edu="">
<mailto:jmacdon@uw.edu>> <mailto:jmacdon@uw.edu>> <mailto:jmacdon@uw.edu>> <mailto:jmacdon@uw.edu>
>>
>>
>> <mailto:jmacdon@uw.edu <mailto:jmacdon@uw.edu="">>>> wrote:
>>
>> Hi Assa,
>>
>>
>> On 4/16/2012 9:05 AM, Assa Yeroslaviz wrote:
>>
>> Hi everybody,
>>
>> I have a question about the behavior of the
expresso
>> command when
>> extracting the raw data from an affyBatch.
>>
>> I wanted to evaluate the raw intensities
values of a
>> specific
>> gene from my
>> data set and tried to extract it like that:
>> rawdata<- expresso(totalExpressionData,
>> bg.correct=FALSE,
>> normalize=FALSE,
>>
pmcorrect.method="pmonly",
>> summary.method="avgdiff",
>> verbose=TRUE)
>>
>> I've got the result I wanted:
>> wt1 wt2 wt3 treat1
>> treat2 treat3
>> gene_at 125.5 101 123.5 52.5
63.5
>> 58
>>
>> The problem was that i expected them to be the
other
>> way around.
>>
>>
>> Why did you expect it to be the other way around?
What
>> exactly did
>> you expect this call to expresso() to do?
>>
>>
>> Sorry I didn't explain a bit more. The gene I was
>> investigating is the 'white' gene in Drosophila,
which sets
>> the eye color of the flies. It was used in our
>> experiment to
>> test the flies if they have a certain
>> knockout/mutant/treatment. By 'expect it to be the
>> other way
>> around' I meant that only the treated flies
should'veh ad a
>> high level of the white gene. In this data it is the
>> other way
>> around.
>> We are not sure, if the error is in the data or maybe
>> it was a
>> error made in the lab who analyzed the arrays and
made an
>> erroneous identification symbols for the arrays.
>>
>> That was a little background information about the
>> experiment.
>> I hope it clarify the problem better.
>>
>> Since affy is primarily based on S4 methods, it
can
>> be a bit
>> difficult to figure out what a given function is
>> going to
>> do, so I
>> can understand you not knowing what this call to
>> expresso() is
>> going to end up doing. However, what you are doing
>> is pretty
>> weird, no? The avgdiff method implies pm-mm, so
what
>> do you
>> expect
>> to happen if you then specify pmonly?
>>
>> Given that the pmcorrect.method controls how we
>> correct the PM
>> probes, and there is a subtractmm option, one
would
>> normally
>> assume that the 'difference' part of avgdiff might
>> happen
>> in that
>> step. But you said not to compute that, so all you
>> are left
>> with
>> is the 'avg' part of avgdiff.
>>
>>
>> Now for the point of my command. To see if I did
wrongly
>> classify the experiments or it was given to me as
such, I
>> wanted to extract the raw data from the arrays. As I
said
>> before, I know one can extract the PM and MM values
for
>> each
>> of the probes of a probe-set. What I really would
like
>> to have
>> is _one value for each probe-set and array each_.
>>
>>
>> For that reason I used the summarization of the
values with
>> the expresso command. I will try to answer your
>> questions as I
>> understand them.
>> I used the pmonly correction method, because if I
>> understood
>> it the right way, this is what the RMA algorithm is
>> using when
>> running the normalization procedure on the data. Am I
wrong
>> about that?
>>
>> Which way is the best way to extract the 'real' raw
values
>> from the arrays?
>> I know that asking such question get you never a
>> straightforward answer, but a near one will be good
:-)
>>
>>
>> Why do you need a raw value from the array to see if you
mixed
>> samples? To put the question a different way, if
normalized,
>> background corrected expression values are good enough
for
>> doing
>> statistics, why are they not good enough to determine if
>> you mixed
>> samples up?
>>
>>
>> I basically wanted to show the raw values to the biologists
in
>> the lab. The problem was, that they suggested that I somehow
>> might have switched the way of analyzing the data. I mean
they
>> thought, that instead of doing Treatment- Control I did
>> Control-Treatment, which would have easily explained the
>> results of the higher expression of this gene in the
control.
>>
>> By showing them the raw data, I can give them a proof, that
>> the switch wasn't made by me, but the arrays were switched,
or
>> that the gene is somehow truly down-regulated in the
treatment.
>>
>>
>>
>>
>> If my logical thinking is wrong, will this cmmand
will
>> present
>> me with a more accurate results?
>> rawdatasubtractmm <- expresso(totalExpressionData,
>> bg.correct=FALSE,
>> normalize=FALSE,
>>
>> pmcorrect.method="subtractmm",
>> summary.method="avgdiff",
>> verbose=TRUE)
>>
>> I will run it now and look for the differences.
>>
>>
>> But let's set the logical assumptions aside and
look
>> at the
>> actual
>> code. Going through the code to expresso() is a
bit like
>> following
>> Alice down the rabbit hole, so I will cut to the
>> chase. In the
>> end, you will be calling two pieces of code that
will
>> handle the
>> pm adjustment and the summary statistic
calculation. In
>> your call
>> to expresso() these will be (respectively):
>>
>> > pmcorrect.pmonly
>> function (object)
>> {
>> return(pm(object))
>> }
>>
>> and
>>
>> > generateExprVal.method.avgdiff
>> function (probes, ...)
>> {
>> list(exprs = apply(probes, 2, median), se.exprs
=
>> apply(probes,
>> 2, sd)/sqrt(nrow(probes)))
>> }
>>
>> So the pmonly will just give you the pm probes,
and then
>> avgdiff
>> will give you the column medians of the pm data.
>> Therefore you
>> have basically told expresso() that you want the
median
>> value for
>> the non-background corrected, unnormalized pm
probes.
>>
>>
>> In the vignette from affy it said to give back the
>> average of
>> the values, so I expected it to give me back the
mean,
>> not the
>> median. May be this is why I didn't really understood
the
>> results. Now it make more sense.
>>
>> Was that your intention?
>>
>>
>> As I said earlier, I wanted to extract the 'real'
>> values for
>> the probe-sets and I thought that extracting the PM
values
>> with no correction or normalization will be the best
>> method to
>> achieve it.
>> I ran now several other possibilities:
>> pm.correction summary.method gene wt1 wt2
>> wt3 treat1 treat2 treat3
>> mas mas gene_at 52.01982773 34.16503645
>> 46.87094917 4.081420059 9.505471139 3.538333392
>> pmonly avgdiff gene_at 125.5 101 123.5
>> 52.5
>> 63.5 58
>> pmonly medianpolish gene_at 6.862247387
>> 6.564551864 6.926547933 6.093298307 6.355810999
>> 6.105348101 <tel:6.105348101>
>>
>> subtractmm avgdiff gene_at 36 15 22 2 8 2.5
>> <tel:36%2015%2022%202%208%202.**5>
>>
>>
<tel:36%20%20%20%2015%20%20%**20%2022%20%20%20%202%20%20%20%**>> 208%20%20%20%202.5>
>>
>>
>>
>> This are the results.
>> The mas and the subtractmm gives me the differences
>> between PM
>> and MM. Though the results are quite different, both
of
>> them
>> gives me the same behavior. the expression in higher
in the
>> wild-type.
>> The medianpolish gives me the log2 values of the PM
values.
>> This results I don't understand, a I can clearly see
a
>> difference between the wt and the treatemnt, but the
>> log-scale
>> values show almost no difference at all.
>>
>>
>> Here is where our interpretations of the raw data
diverge.
>> I only
>> see a few probes in your raw data that have very
different
>> expression levels between the sample types. The only
probes
>> that
>> look to be binding at any appreciable level are the first
>> two, and
>> there really isn't any difference between them. In
addition, if
>> you believe that MM probes measure background, it doesn't
>> really
>> look like you have much if any of this transcript being
>> expressed
>> in either sample.
>>
>>
>> Sorry,but this here I don't understand.
>> I didn't present any probes from my data, but just the
results
>> of _one gene_ after running the expresso command with
>> _different options_.
>>
>> How can you say that: "I only see a few probes in your raw
>> data that have very different expression levels between the
>> sample types"?
>> I mean after analyzing the data , I think you are also
right,
>> it just wonder me how could you have known that.
>>
>>
>> What are the data you show just below? You said they were the PM
>> and MM probe data for the gene in question.
>>
>> Oh sorry my mistake.
>> But isn't it very strange to have such high expression intensities
in the
>> MM fraction (2nd group) and a lower intensities in the PM (1st
group).
>> Doesn't it suppose to be the other way around?
>>
>
> It is supposed to be the other way around. But that is based on the
> assumption that MM probes only measure background, and that there is
no
> noise in the system. Rafa Irizarry has shown in several papers that
MM
> probes actually do measure transcript abundance (just not as
efficiently as
> PM probes). And there is _always_ noise in the system, and at the
low end
> of the scale it becomes difficult to distinguish noise from signal,
as you
> see here.
>
>
>
>> If the MM fraction is so high, does it mean that the non-specific
binding
>> is so high. Is that a very good reason to ignore the MM completely
or it is
>> exactly the reason why one should consider running a MAS5 (and not
RMA, as
>> I am using) normalization on the data?
>>
>
> There are two good reasons to ignore the MM probes. As noted above,
they
> do actually measure transcript abundance, so you are subtracting
signal
> when you use PM-MM. In addition, there is variance involved in any
> measurement, and variances are additive. So the variance of PM-MM is
> _higher_ than the variance of PM alone. If we estimate background
based on
> all probes (as in RMA), the variance of the background measure will
be much
> lower than each individual MM probe, so when we do PM-bg, the
resulting
> variance is much lower than the PM-MM.
>
> An overall background measure will be biased however, so there is an
> argument that the background should be estimated in smaller bins of
probes,
> one such way being gcrma(). Another way is to estimate variance
using
> different estimators, depending on the intensity, which is what
vsnrma()
> does.
>
>
>
>>
>> I was just wondering here, why the medianpolish option shows
>> basically no differences at all, while the others three with
>> the mas or subtractmm options shows such a huge difference.
>> Can it be explained?
>>
>>
>> First, these aren't huge differences. Sure, if you do fold
changes
>> they are, but remember that these data range from 0 to 2^14 or
so
>> (hypothetically 2^16, but it is rare to get much over 2^14). A
>> difference of 40 at the very low end of that range is
>> indistinguishable from noise. Second, both mas5 and avgdiff
suffer
>> from the same problems; they assume MM probes accurately measure
>> just background (not true), and they ignore probe-specific
>> binding, which adds nuisance data to the result. Once you
account
>> for probe-specific binding, the differences largely disappear.
>>
>>
>> Is there a certain threshold, were you can say that the gene is
present
>> or absent? You said that only the top tow rows bind by an
appreciable
>> level. But what about probes 11-14? They are higher than average
and also
>> show a higher intensity in the PM (top) than the MM (bottom).
>>
>
> There is a difference between presence of a gene transcript (the
genes are
> always present; we are attempting to measure the transcript
abundance) and
> the intensity of the signal. You cannot look at the intensity of a
signal
> and say definitively that the transcript was or was not there. One
could
> imagine a scenario where the transcript was actually expressed, but
the
> probes designed to interrogate the transcript were so poorly
designed that
> you hardly get any signal. Alternatively one could imagine a
scenario where
> the transcript of interest was not being expressed, but another
similar
> transcript was being expressed at a high level. If the similar
transcript
> could bind at a low (but non-zero) rate, then you could get some
signal
> even though the transcript of interest wasn't there.
>
> Also, remember that a 25-mer is really bad at binding to unique
sequences,
> and the binding affinity is highly correlated to the GC content.
That is
> why a probeset has 11 or so of these 25-mers. The idea is that in
> aggregate, the signal from all 11 probes will be more accurate than
the
> individual probes themselves. By only looking at four probes in
particular,
> you are cherry picking your data to fulfill a preconceived notion of
what
> you think is happening, rather than looking at the data in aggregate
> (summarized in a reasonable manner).
>
> To me, summarizing using rma() is a reasonable thing to do, and when
you
> do that, there is no evidence that the gene is expressed at
different
> levels in the two sample types. End of story.
>
>
> wt1 wt2 wt3 treat1 treat2 treat3
>> PM
>> *probe11 293 321 260 444 618 408
>>
>> probe12 171 167 169 49 75 72
>> probe13 198 197 307 40 67 50
>> probe14 247 265 348 53 60 62*
>>
>> MM
>> probe11 54 83 55 47 64 46
>> probe12 106 98 76 52 66 53
>> probe13 43 48 37 36 52 39
>> probe14 94 92 99 43 43 49
>>
>> I ran a present-absent call (mas5calls) on these arrays to see
whether or
>> not these gene is considered to be present. It is not!
>>
>
> PA calls are not measuring the expression of the gene. They tell you
if
> the PM probes are significantly different from the MM probes. There
can be
> any number of reasons that PM and MM probes have the same
expression, only
> one of which is no expression of the gene.
>
> Best,
>
> Jim
>
>
> Can the reason for that be the fact, that it show a reasonable
level of
>> expression in only two from 14 probes in the probe-set?
>> Is it because the MM fraction shows a much higher intensity than
the PM?
>>
>> Thanks
>>
>> Assa
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> Thanks,
>> Assa
>>
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> I hope it make some more sense and I would be happy
to hear
>> some more suggestions/comments on that topic
>>
>> thanks a lot,
>> Assa
>>
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> So I decided to look into the specific probe
>> values of the
>> probes for this
>> probe-set.
>> This are the values I've got from the PM and
MM
>> respectively:
>> wt1 wt2 wt3 treat1 treat2
treat3
>> probe1 403 379 220 420 530
316
>> probe2 117 84 104 52 57 54
>> probe3 49 49 73 38 58 52
>> probe4 87 67 110 55 43 49
>> probe5 66 61 51 46 72 62
>> probe6 118 100 104 69 87 74
>> probe7 180 142 170 45 46 45
>> probe8 133 102 137 95 132 81
>> probe9 80 71 65 52 54 46
>> probe10 63 45 56 53 53 54
>> probe11 293 321 260 444 618
408
>> probe12 171 167 169 49 75 72
>> probe13 198 197 307 40 67 50
>> probe14 247 265 348 53 60 62
>>
>> probe1 533 519 294 507 739
404
>> probe2 1789 1271 1468 1430 1666
>> 1552
>> probe3 56 66 59 51 45 48
>> probe4 49 52 64 47 45 47
>> probe5 54 47 33 49 65 55
>> probe6 84 72 90 53 92 71
>> probe7 73 72 65 40 53 54
>> probe8 83 108 115 81 111 94
>> probe9 49 56 43 52 41 53
>> probe10 56 46 62 68 77 57
>> probe11 54 83 55 47 64 46
>> probe12 106 98 76 52 66 53
>> probe13 43 48 37 36 52 39
>> probe14 94 92 99 43 43 49
>>
>> When I calculate the average of these two
tables
>> for each
>> array I don't get
>> the same values as presented in the top table.
>> I would like to understand how from the values
>> on the
>> last two
>> tables I
>> come to a summarized value I get. Even if I
>> ignore the
>> MM values
>> completely, which I think it does, I still
don't
>> see how it
>> comes to these
>> values. The two probes (Nr. 1 and 11 of the PM
>> values) are
>> strongly differ
>> from the rest of the probes for this probe-
set. Are
>> they being
>> ignored in
>> the summarization?
>>
>> Thanks in advance
>>
>> Assa
>>
>> [[alternative HTML version deleted]]
>>
>>
______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@r-project.org
<mailto:bioconductor@r-**project.org<bioconductor@r-project.org>
>> >
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">
>> >>
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">>
>>
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">
>> <mailto:bioconductor@r-**project.org <bioconductor@r-project.org="">
>> >>>
>>
>>
https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.ethz.ch="" mailman="" listinfo="" bioconductor="">
>> Search the
>> archives:http://news.gmane.**
>> org/gmane.science.biology.**informatics.conductor<http: news.gmane="" .org="" gmane.science.biology.informatics.conductor="">
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
>
>
[[alternative HTML version deleted]]