I have made a similar post but it was more of a programming question (https://support.bioconductor.org/p/70664/).
So now I have multiple clusters in a 270 leukaemia sample cohort comprises of 3 batches. I have groups of samples according to their cytogenetics or clustering results. Some clusters are bigger (> 20) and some are small (3), and some of these groups span across all batches and some are only found in one or two batches (some big groups can only be found in one batch). And there are samples that I just name them "others" because they do not belong to anything.
I want to look at the differentially expressed genes of group A, B, C and D compare to the entire cohort respectively. To better illustrate the structure of the data they look like something in the table below
I am only interested in group A to D compare to the entire cohort. I read before that by including all the groups limma can make a better estimation to the actual differential expression of genes. However, when I did that and ran
svseq() the first variable did not fit the 3 batches (upper box plot), as compare to when I only included A true/false in the design, I got a much better first variable that explains the 3 batches (lower box plot).
Includes all groups:
Includes only one group (Group A):
Should I only include one group for each run and instead of the including all the groups in one run and make a contrast matrix? And if I do make a contrast matrix, is the following line right
contr.matrix <- makeContrasts(AvsAll = A - (B+C+D+E+F+G+H+I+J+K+Others)/11, BvsAll = B - (A+C+D+E+F+G+H+I+J+K+Others)/11, CvsAll = C - (A+B+D+E+F+G+H+I+J+K+Others)/11, DvsAll = D - (A+B+C+E+F+G+H+I+J+K+Others)/11, levels = colnames(design))