Entering edit mode
Hi everybody.
As a newbie to bioinformatics, it is not uncommon to find difficulties
in the way biological knowledge mixes with statistics. I come from the
Machine Learning field, and usually have problems with the naming
conventions (well, among several other things, I must admit). Besides,
I am not an expert in statistics, having used the barely necessary for
the validation of my work.
Well, let's try to be more precise. One of the topics I am working
more right now is the analysis of methylation array data. As you
surely now, the final processed (and normalized) beta values are
presented in a pxn matrix, where there are p different probes and n
different samples or individuals from which we have obtained the beta-
values. I am not currently working with the raw data.
Imagine, for a moment, that we have identified two regions of probes,
A and B, with a group of nA probes belonging to A, another group (of
nB probes) that belongs to B, and the intersection is empty. Say that
we want to find a way to show there is a statistically significant
difference between the methylation values of both regions.
As far as I have seen in the literature, comparisons (statistical
tests) are always done comparing the same probe values between case
and control groups of individuals or samples. For example, when we are
trying to find differentiated probes.
However, if I think of directly comparing all the beta values from
region A (nA * n values) against the ones in region B (nB * n values)
with a, say, t test, I get the suspicion that something is not being
done the way it should. My knowledge of Biology and Statistics is
still limited and I cannot explain why, but I have the feeling that
there is something formally wrong in this approximation. Am I right?
What I have done in similar experiments has been to find
differentiated probes, and then do a test to the proportion of
differentiated probes to total number of them, so I could assign a
p-value to prove that there was a significant influence of the region
of reference.
Several questions here: which could be a coherent approximation to the
regions A and B problem stated above? Is there any problem with
methylation data I am not aware of which makes only the in-probe
analysis valid? Any bibliographic references that could help me seeing
the subtleties around?
As you can see, concepts are quite interleaved in my mind, so any help
would be very appreciated.
Regards,
Gustavo
---------------------------
Enviado con Sparrow (http://www.sparrowmailapp.com/?sig)