Hi Ryan,
Your contrast doesn't seem so odd to me. We used a similar contrast for example to compare Basal breast cancer with the average of all other breast cancer subtypes:
http://nar.oxfordjournals.org/content/40/17/e133
My worry is that with this contrast, I'm effectively just testing two groups against each other, and by having 4 groups in the design I will be estimating dispersions that are not appropriate for the test that I'm doing, and hence I will overstate my confidence.
The dispersions remain unchanged regardless of the contrast you test. The dispersions have been estimated after removing all differences between the four groups, i.e., without bias.
edgeR is giving you a correct test of the contrast you have specified. You are testing whether an equal mix of the first two groups has the same average expression as an equal mix of the third and fourth groups.
Note that you are not testing whether the difference between the two big groups is large compared to variation within the big groups. The test does not care how large the differences are between A and B, or between C and D.
Or, to put it another way, am I doing something equivalent to testing a main effect in a model where an interaction term is present?
No, the test does not suffer from the same objection. However you may need to be careful interpreting the test when there is lots of DE between A vs B or C vd D. It may be worthwhile first checking A vs B and C vs D.
Best wishes
Gordon
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Coding that doesn't reflect the true experimental design is likely to perform badly, and give less significance. That doesn't make it more correct.
You are not asking a well posed question. For one thing, "fair" and "unfair" are unhelpful concepts. The only consideration is whether the statistical test that is done answers the scientific question being asked. You haven't explained what scientific question you want to answer, so there is no basis for choosing a scientific test.
Fitting a model that matches the experimental conditions and then making comparisons between groups has been the anova method since anova was first invented. It answers what it answers, as I explained in my response to Ryan.
Gordon
It is quite clear, from your and Ryan's comments, that this difference is not the only scientific question to be answered, and so it cannot be the only hypothesis to be tested.
I strongly disagree, as I have already told you. Modeling of subgroups that have a strong effect on the outcome is always good science.
Would I be willing to give a blunderbuss recommendation that you should apply in all situations, regardless of the nature of the groups or the scientific questions at issue? No I wouldn't.
I have been trying to prompt you to clarify what your scientific questions actually are. Once you do do, the appropriate statistical procedure will be readily apparent.
I seem to have answered the same question three times now, without getting the message across. I will make one more attempt, but I will reply to Ryan's original post, not to this email, because much of Ryan's original question and my response has been deleted from the thread below.
Gordon
Dear Ming,
Something is seriously wrong -- you shouldn't get these warnings, you shouldn't get such a large dispersion estimate and, if you do, you shouldn't get such small p-values.
I suspect that the culprit is RSEM. edgeR is designed to accept raw read counts rather than estimated abundance levels as might be output from RSEM or Cufflinks.
There are a number of tools that produce genewise read counts from RNA-seq data. My lab routinely uses subread and featureCounts:
http://www.ncbi.nlm.nih.gov/pubmed/24227677
which are available from the Bioconductor package Rsubread.
Best wishes
Gordon
Dear Ming,
Thanks for the further information.
It is obvious that numbers like 4.67 are not raw integer counts. I suspect that they are posterior expected counts from RSEM. The column heading "raw_count" does seem rather misleading.
If you want to analyse these expected counts from RSEM, then limma-voom would be a far preferable pipeline than edgeR (or DESeq or any other negative binomial based package).
I understand that the RSEM authors did claim that their expected counts could be input to edgeR, but edgeR is designed to accept integers, and your results appear to confirm some of the problems that arise. The problems cannot be solved by changing the filtering.
The raw FastQ files for the TCGA samples should be publicly available from GEO and SRA. My lab has downloaded similar TCGA data as SRA files and converted them to FastQ.
Best wishes
Gordon
I think the point is that Ming has already downloaded the v2 data, and the so-called "raw counts" turned out not to be counts.
If you want to dig to find out what the "raw counts" are exactly, that would be a great service, because I am just guessing. The TCGA documentation just says they are from RSEM.
Gordon
Dear Ryan and Aaron,
Given Aaron's reactions to my previous responses, I will make one more attempt to answer in slightly more detail.
The first thing to appreciate is that every statistical test is an answer to a particular question. The contrast test that you mention certainly makes statistical sense, but this is not the issue. The issue is scientific rather than statistical. Whether or not this test is an appropriate answer to your scientific question depends on what your scientific question is. You have not yet laid this out in sufficient detail.
Here are some different scientific contexts that might or might not apply in your situation.
First, you might want to assert that C and D have higher expression than either A or B. If you want to claim that, then clearly you must do individual contrasts C vs A, C vs B, D vs A and D vs B. There is no shortcut. The contrast C+D vs A+B is not sufficient.
Or you might want to assert that the treatments cluster into two big groups, C and D vs A and B. Do establish this, you need to show that the CD vs AB separation is larger compared to CvsD and BvsA. You could do all pairwise comparisons, but a slighly more efficient method would be to test three contrasts B-A, D-C and (C+D)/2-(A+B)/2. You can make this assertion if the third contrast is far more significant than the first two. Even if B-A and D-C are statistically significant, you could still establish the claim by showing that the fold changes for (C+D)/2-(A+B)/2 are much larger than those for B-A or D-C.
Or you might want to assert that a population made up of equal parts C & D would have different expression to a population made of equal parts of A & B. To assert that, you only need to test (C+D)/2-(A+B)/2.
The four groups might arise from two original factors. Suppose that the groups A--D correspond to factors are Big = c(1,1,2,2) and Sub = c(1,2,1,2). You might want to assert that Big high increases expression over Big low regardless of the level of Sub. In that case you need to test the two contrasts C-A and D-B. If both are significantly up, then you can make the assertion.
Or you might want to assert that Big has the same effect on expression regardless of the Sub baseline. In that case you need to show that (C+D)/2-(A+B)/2 is significant but (D-B)-(C-A) is not.
Finally, if you were confident in advance that A and B were not different and C and D were not different, then you could simply pool the A and B samples together and the C and D samples together and do a two group test. This produces a statistically valid test only if there is no systematic differential expression between A and B or between C and C. But if you knew that in advance, why did you classify the samples into four groups in the first place??
Best wishes
Gordon