This is in reply to: dataset dim for siggenes
We frequently use limma to analyze ct values from PCR arrays similar to yours (although usually with fewer samples), so the analysis that you have already done is basically what we recommend.
We use cyclicloess normalization with house-keeping genes up-weighted. This gives a nice compromise between global normalization and normalization on house-keeping genes. After normalization, we use MDS plots to search for unexpected batch effects. Then we use the usual limma pipeline except that we set robust=TRUE and trend=TRUE when running eBayes(). Sometimes we use treat() instead of eBayes() to give more emphasis to larger fold changes.
People worry too much about normality. Limma makes similar assumptions to anova, and both are quite robust against non-normality for a two group comparison.
There are more important things to worry about, for example heteroscedasticity (large ct values are less precise than small ct values), outliers and batch efffects. Trend=TRUE is intended to deal with heteroscedasicity, robust=TRUE with outliers, and limma allows you to correct for batch effects if they exist.
> Date: Fri, 12 Sep 2014 19:45:27 -0300 (BRT)
> From: ferreirafm at usp.br
> To: jmacdon at u.washington.edu
> Cc: bioconductor <bioconductor at="" r-project.org="">
> Subject: Re: [BioC] dataset dim for siggenes
> Hi Jim,
> Thank you very much for your really nice explanation.
> I'm going to study your answer and, if you don't mind, I would like turn
> back to it later.
> I thought that bayesian approach implemented on LIMMA would have
> different assumptions from t-test and ANOVA .
> Also, in fact, normality condition doesn't hold true for all miRNAs
> along patients. I'll turn back to ANOVA assumptions to make additional
> tests.What to do if they fail?
> About sampling, we are trying to gather patients as similar as possible
> to that from the first experiment, using several criteria like age, sex,
> weight, heart flow and other factors commonly used for phenotyping. I
> hope we are luck in the sense that you pointed.
> ----- Mensagem original -----
>> De: "James W. MacDonald" <jmacdon at="" uw.edu="">
>> Para: ferreirafm at usp.br
>> Cc: "bioconductor" <bioconductor at="" r-project.org="">
>> Enviadas: Sexta-feira, 12 de Setembro de 2014 18:11:29
>> Assunto: Re: [BioC] dataset dim for siggenes
>> Hi Fred,
>> I'll take the second question first. The methods that have been
>> developed for analyzing microarray data are all just modifications of
>> the existing linear modeling methods that people have used for years
>> (t-test, ANOVA, linear modeling of continuous covariates, etc). The
>> reason that people have developed these methods is because in general,
>> with microarray data you run into the problem of making tons of
>> comparisons with very little replication. The problem with doing
>> something like that is you a) need to adjust the p-values to reflect
>> that you are making (possibly thousands) of simultaneous comparisons,
>> and b) you often have maybe 3 or 4 replicates for each group, so your
>> power to detect differences is probably really low. So the goal was to
>> figure out ways to improve the power for these comparisons in a
>> statistically rigorous manner, and there were lots of ways that people
>> developed to do that.
>> There was also some concern that the usual assumption of normally
>> distributed data might not hold for all the genes being compared, so
>> different groups developed ways to increase power and also generate
>> permuted null distributions, so you wouldn't have to make an assumption
>> that might not hold.
>> But in the end, all these methods (limma, siggenes, multtest, etc) are
>> just fitting t-tests that are modified to help increase power. So they
>> are all doing essentially the same thing, but in a slightly different
>> manner. So if you run your samples through limma, and then siggenes,
>> and then multtest, any changes in your results will simply reflect
>> differences in the methods used, but won't give you any more
>> information about your samples. And since you have 15 replicates for
>> each group, you would probably get very similar results if you were to
>> just use 'regular' methods, because you aren't measuring that many
>> genes, and you have pretty good replication.
>> On the other hand, running a new set of samples will tell you a great
>> deal. This has to do with the underlying hypothesis that you are
>> (usually) testing. In general when you are doing a comparison, you are
>> trying to estimate a population parameter using a sample from that
>> population. In other words, you are trying to make a statement about
>> all the members of a population, based on a sample from that
>> population. There is always the possibility that you were unlucky and
>> chose a set of subjects from the two populations you are comparing that
>> are really different, but in truth there is no difference between the
>> two populations. You then make your measurements, and say 'look, gene X
>> appears to be expressed at a much higher level in population 1 as
>> compared to population 2'. But remember, you were unlucky in your
>> choice of subjects to represent the two populations, and there really
>> aren't any differences. So repeating the experiment with new subjects
>> will likely not have the same result, and you will be glad that you
>> didn't try to publish your results.
>> Or alternatively, if you re-run your analysis for the 10 top genes, and
>> they are all significant in the next set of samples, then you have
>> pretty good evidence that there really is a difference between the two
>> populations, because you got the same results with two separate sets of
>> subjects. But of course that assumes you are doing a reasonable job of
>> selecting subjects in an unbiased manner, which is a different topic
>> For the first question, there are any number of things you can and
>> should test. I won't go into them here because a simple google search
>> like 'R testing anova assumptions' is likely to bring up all the
>> results you need to answer that question.
>> On Fri, Sep 12, 2014 at 3:53 PM, < ferreirafm at usp.br > wrote:
>>> Hi Jim,
>>> Could you please possibly tell me which tests should I have to perform
>>> in order to ensure that my data fulfills the linear model assumptions?
>>> Turning back to my question "performing several different tests to
>>> decide which mirs to take", could you explain a little bit more why
>>> such approach doesn make sense.
>>>> De: "James W. MacDonald" < jmacdon at uw.edu >
>>>> Para: ferreirafm at usp.br
>>>> Cc: "bioconductor" < bioconductor at r-project.org >
>>>> Enviadas: Sexta-feira, 12 de Setembro de 2014 12:47:55
>>>> Assunto: Re: [BioC] dataset dim for siggenes
>>>> Hi Fred,
>>>> I am assuming you have 116 miRNAs, and 60 samples. In which case you
>>>> could probably just use a conventional t-test or linear model,
>>>> although using limma wouldn't be a controversial decision. Not too
>>>> sure about siggenes though. You have to estimate the proportion of
>>>> true nulls, and I don't know if 116 comparisons are enough.
>>>> But the larger question is the issue of running further statistical
>>>> tests for validation. I am not sure what you mean by that.
>>>> Quantitative PCR is (for better or worse) assumed to be the 'gold
>>>> standard' for quantification of nucleic acid sequences, so there
>>>> doesn't seem to be much more to do. Certainly re-running the analyses
>>>> using a slightly different method isn't useful. That's like weighing
>>>> yourself on a bunch of different scales; it tells you way more about
>>>> the scales than it does about your weight.
>>>> I think the next step (or really, the first step if you haven't
>>>> already done so) is to ensure that your data meet all the underlying
>>>> assumptions for linear modelling, so that you can have confidence in
>>>> the conclusions you draw from the results.
>>>> On Fri, Sep 12, 2014 at 11:18 AM, < ferreirafm at usp.br > wrote:
>>>>> Hi list,
>>>>> I have a qPCR 116 x60 data set processed with limma. Results showed
>>>>> 30 DE miRNAs. My idea is to pick-up 10 of them for validation
>>>>> running further statistical tests and taking the most recurrent mirs
>>>>> from all analyses (does it make sense?). Well, I was thinking of
>>>>> using siggenes, however, their authors recommend it for high-
>>>>> dimensional data. Will siggenes be suitable for my data? if not,
>>>>> could someone suggest others packages and perhaps tests more
>>>>> appropriated to this size data?
>>>> James W. MacDonald, M.S.
>>>> University of Washington
>>>> Environmental and Occupational Health Sciences
>>>> 4225 Roosevelt Way NE, # 100
>>>> Seattle WA 98105-6099