Hi all, I am using edgeR to identify genes differentially expressed in my experiment. I have cells which were grown in 4 different cell media and I would like to perform an ANOVA to identify DE genes comparing all media vs all (e.g. media1 vs media2; media1 vs media3; media1 vs media4; media2 vs media3....).
I have been reading the manual for edgeR and this is my code:
> library(edgeR)
Loading required package: limma
> library(statmod)
> rawdata <- read.delim("RawCounts_wNames.txt", check.names=FALSE, stringsAsFactors=FALSE)
> gruppi <- sub("-.*S[[:digit:]]+","",colnames(rawdata[,3:14]))
> y <- DGEList(counts=rawdata[,3:14],genes=rawdata[,1:2],group=gruppi)
> keep <- rowSums(cpm(y)>1) >=2 #filtering
> y <- y[keep, , keep.lib.sizes=FALSE]
> rownames(y$counts) <- rownames(y$genes) <- y$genes$Genei #Replace row.names (==numbers) with gene names.
> y$genes$Geneid <- NULL #Same as above
> y <- calcNormFactors(y)
> group <- factor(y$samples$group) #Design for one-way ANOVA and decide contrasts:
> design <- model.matrix(~0+group, data=y$samples)
> colnames(design) <- levels(group)
> design
H I M T
H-23_S10 1 0 0 0
H-24_S2 1 0 0 0
H-48_S6 1 0 0 0
I-23_S12 0 1 0 0
I-24_S4 0 1 0 0
I-48_S8 0 1 0 0
M-23_S9 0 0 1 0
M-24_S1 0 0 1 0
M-48_S5 0 0 1 0
T-23_S11 0 0 0 1
T-24_S3 0 0 0 1
T-48_S7 0 0 0 1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"
> con <- makeContrasts(HvsI = H - I, HvsM = H - M, HvsT = H - T, levels=design)
> con2 <- makeContrasts(HvsI = H - I, levels=design)
> y <- estimateDisp(y,design,robust=TRUE)
> fit <- glmQLFit(y, design,robust=TRUE) #Fitta una glm
> qlf <- glmQLFTest(fit, contrast=con) #Test on glm. Not filtered on FC
> tr <- glmTreat(fit, contrast=con, lfc=log2(1.2))
Error in data.frame(logCPM = glmfit$AveLogCPM, PValue = p.value, row.names = rownames(glmfit)) :
row names supplied are of the wrong length
I cannot find any answer in relation to this error so any help would be mostly appreciated. Thanks Luca
Thanks James for your reply. I am sorry but I don't understand what you mean. According to this guide (link) the object
con
is a matrix...Plus, sorry, what do you mean with "it makes no sense in the context of an ANOVA-like test of multiple comparisons"? Thanks Luca
I'm not sure how you managed to post this three times, but will you delete the extras?
So yes, a contrast matrix is a matrix, and you will be able to find examples of such things in the edgeR User's Guide. But what you won't find is an example of
glmTreat
that uses a contrast matrix. Anyway, the help page is the definitive resource anyway, and here's what it has to say:Where 'numeric vector' means exactly that - a vector of numbers. The reason for this is you are testing the hypothesis of H0: β ≤ |c| versus the alternative of HA: β > |c|, for some log fold change c. In other words, you are asking for evidence that the log fold change for a particular comparison is greater than some value.
With edgeR (and limma) you can make ANOVA like comparisons where you specify multiple contrasts at one time (by passing in a contrast matrix with more than one column), and it will then perform a test that asks if there is any evidence that the coefficient for any of the specified comparisons is different from zero. So when you use
glmQLFTest
with a contrast matrix, you are asking for a quasi-likelihood F-test that any of the coefficients are different from zero.It is not possible (so far as I know) to specify an F-test based on a minimum fold change for a set of comparisons, and even if it is, Gordon hasn't implemented anything like that in edgeR, so that's why it makes no sense to ask for a minimum fold change in the context of an ANOVA-like test.
Sorry for the three posts, I honestly don't know how I did it... Thanks for the detailed explanation, now I get what you mean about the
contrast
. So basically withglmQLFTest
I will find all genes that are significantly differentially expressed (not up-reg or down-reg, just significantly DE) at least in one cell type. Whereas if I want to look in detail one particular comparison (e.g. media1 vs media2), I can performglmTreat
and implement a fold-change filter. Am I right? From a statistical point of view, given my experimental set-up, would this last option (usingglmTreat
to perform multiple times a comparison with a different couple each time) be correct? Or would it be like doing multiple t-test on a dataset instead of performing an ANOVA? Thanks Luca