edgeR paired samples with blocking
2
1
Entering edit mode
anou ▴ 10
@anou-13336
Last seen 3.7 years ago

I have the following experimental design and want to estimate the differences between nonallergic.BPE vs nonallergic.Medium, allergic.BPE vs. allergic.Medium etc. I would like to use patient as a blocking factor

patient    genotype condition
73644       1 nonallergic    Medium
73645       1 nonallergic    BPE250
73646       2 nonallergic    Medium
73647       2 nonallergic    BPE250
73648       3 nonallergic    Medium
73649       3 nonallergic    BPE250
73650       4    allergic    Medium
73651       4    allergic    BPE250
73652       5    allergic    Medium
73653       5    allergic    BPE250
73654       6    allergic    Medium
73655       6    allergic    BPE250
73656       7    allergic    Medium
73657       7    allergic    BPE250
73877       8 nonallergic    Medium
73878       8 nonallergic    BPE250

In limma-voom it did it like this:

group <- factor(paste(genotype, condition, sep="."))
design <- model.matrix(~0 + group)

vm <- voomWithQualityWeights(y, design=design)
corfit <- duplicateCorrelation(vm, design, block = ExpDesign$patient) vm <- voomWithQualityWeights(y, design, block = ExpDesign$patient, correlation = corfit$consensus, plot= TRUE) vfit <- lmFit(vm, design, block=ExpDesign$patient,correlation=corfit$consensus) my.contrasts <- makeContrasts( aMvsnaM = allergic.Medium-nonallergic.Medium, aBPEvsnaBPE = allergic.BPE250-nonallergic.BPE250, aBPEvsaM = allergic.BPE250-allergic.Medium, naBPEvsnaM = nonallergic.BPE250-nonallergic.Medium, levels=design) When I read about the design formula and matrix in edgeR, I understood that the blocking should be set up like this: design <- model.matrix(~0+ patient + group) Unfortunately this gives me a "Design matrix not of full rank" error for following matrix: patient1 patient2 patient3 patient4 patient5 patient6 patient7 patient8 groupallergic.Medium groupnonallergic.BPE250 groupnonallergic.Medium 1 1 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 1 0 3 0 1 0 0 0 0 0 0 0 0 1 4 0 1 0 0 0 0 0 0 0 1 0 5 0 0 1 0 0 0 0 0 0 0 1 6 0 0 1 0 0 0 0 0 0 1 0 7 0 0 0 1 0 0 0 0 1 0 0 8 0 0 0 1 0 0 0 0 0 0 0 9 0 0 0 0 1 0 0 0 1 0 0 10 0 0 0 0 1 0 0 0 0 0 0 11 0 0 0 0 0 1 0 0 1 0 0 12 0 0 0 0 0 1 0 0 0 0 0 13 0 0 0 0 0 0 1 0 1 0 0 14 0 0 0 0 0 0 1 0 0 0 0 15 0 0 0 0 0 0 0 1 0 0 1 16 0 0 0 0 0 0 0 1 0 1 0 attr(,"assign") [1] 1 1 1 1 1 1 1 1 2 2 2 attr(,"contrasts") attr(,"contrasts")$patient
[1] "contr.treatment"

attr(,"contrasts")$group [1] "contr.treatment" How do I have to set up my matrix in order to get the differences for my contrasts? Unfortunately I also don't know whether I should use a model with intercept or not, as we don't have any biological data that would suggest/dissuade that the treatment would have more effect in allergic vs nonallergic patients... Do you have any advice on it? How could I figure out which model would be the right one? Cheers limma edger blocking paired samples • 2.3k views ADD COMMENT 3 Entering edit mode Aaron Lun ★ 27k @alun Last seen 12 hours ago The city by the bay This is equivalent to the experimental design covered in Section 3.5 of the edgeR user's guide. design <- model.matrix(~0 + patient + genotype:condition) design <- design[,!grepl("Medium", colnames(design))] # get to full rank colnames(design) <- make.names(colnames(design)) # make colnames syntactically valid. The first 8 coefficients are the blocking terms for the patients, and can be interpreted as the log-average expression of the "Medium" condition in each patient. The remaining two coefficients represent the log-fold change of BPE250 over Medium within the allergic and non-allergic patients, respectively. Setting coef= to 8 or 9 will test for the effect of BPE250 within each level of genotype. Setting contrasts= to: makeContrasts(genotypeallergic.conditionBPE250 - genotypenonallergic.conditionBPE250, levels=design) will test whether there is any difference in BPE250 effect between genotypes, i.e., difference of differences. There is, however, no way to use edgeR to compare "BPE250 allergic" directly to "BPE250 non-allergic" with the full data set. Similarly, you can't directly compare "Medium allergic" to "Medium non-allergic". This is because there is no direct equivalent to duplicateCorrelation in edgeR, which means that we can't account for correlations between samples from the same patient without also eliminating interesting differences between genotypes. If you want these comparisons and you must use edgeR, the only solution is to subset the data set so that only one sample from each patient is present in each subset (e.g., to BPE250-only and Medium-only), and perform the comparisons of interest within each subset. ADD COMMENT 0 Entering edit mode Thank you for your explanation Aaron and the appropriate section in the guide. I don't have to use edgeR, but as we don't know what outcome we expect, I thought it might be a good idea to do the analysis with three packages (edgeR, DeSeq2 and limma-voom) to compare the resulting genes. I should be able to compare the subseted data from edgeR with the analysis from limma-voom, right? Is the analysis/code in limma-voom correct? I was just wondering because I don't get any upregulated interleukin genes (which I somewhat expected)... With this setup don't get a common dispersion factor. is that to be expected? y_edge <- estimateDisp(y_edge, design) y$common.dispersion
NULL
1
Entering edit mode

I'm often bemused by people who want to compare different DE analysis methods on data sets where they do not know the ground truth. If the methods yield different DE lists, it's impossible to say which method is more correct. In fact, it's possible for two DE lists to both be correct (i.e., controls the FDR below the nominal threshold) yet also very different in size and composition, depending on each method's power for particular types of genes, e.g., low/high-abundance, low/high-variance. One hopes that the differences would decrease as the sample size increases, but most RNA-seq experiments are underpowered and different methods will use different assumptions to "fill in the gaps", so to speak.

The justification I often hear for using multiple methods is something like "if we take the intersection of DE lists from all methods, we'll get a confident set of DE genes". This is a specious argument. If I want a more confident set of DE genes, I lower the FDR threshold and trust that my method is controlling the FDR correctly. If it isn't, then no amount of intersections will help. In fact, a naive intersection between DE lists (each detected at a certain FDR threshold) is not guaranteed to yield a set of genes where the FDR is controlled below that same threshold! To be correct, we need to use intersection-union tests instead, and I'm pretty sure that no one who advocates for intersections is actually doing that. Finally, an intersection of any kind will reduce power; a gene that fails to be detected in any method (e.g., because a particular method is underpowered for a particular expression profile) will not be in the intersection, even if it is a true positive.

Okay, that's enough ranting. Onto your questions.

I should be able to compare the subseted data from edgeR with the analysis from limma-voom, right?

Yes and no. Yes, because it is the same contrast. No, because the analysis on the subset will naturally have less power than the analysis of the full data set where more information is available for variance/dispersion estimation. So you'll naturally get different results, without even considering the differences in the models.

Is the analysis/code in limma-voom correct?

I would do a second duplicateCorrelation after the second voom, but otherwise it looks fine to me (assuming you ran contrasts.fit and eBayes correctly). If expected DE genes are not being detected, I would suggest looking at their expression values directly to see whether it is a modelling issue or if there wasn't a difference in the first place. Your design is still simple enough for direct examination of the log-CPMs.

With this setup don't get a common dispersion factor. is that to be expected?

That's because you saved the output of estimateDisp to y_edge, not y.

0
Entering edit mode

Fair enough and you have a point. I just figured that if the same genes were detected by multiple methods, they were more likely to be identified correctly.

I would suggest looking at their expression values directly to see whether it is a modelling issue or if there wasn't a difference in the first place

This is the normalization and filtering I did:
​cpm <- cpm(y)
lcpm <- cpm(y, log = TRUE)
keep <- filterByExpr(y)
summary(keep)
y <- y[keep, ]
keep <- rowSums(cpm(y)>1) >= 3
y <- y[keep, , keep.lib.sizes=FALSE]
y$samples$lib.size <- colSums(y$counts) I've had a look at the lcpm from a few interleuking genes and the values between the 16 samples differ from e.g 7 to 5. Is that difference not large enough to be picked up? ADD REPLY 1 Entering edit mode Depends on the variance within each group, compared to the log-fold change between groups. ADD REPLY 0 Entering edit mode So I ran the modeling with a second duplicateCorrelation as suggested, and got a corfit$consensus of 0.6093099. I was happy with the results, till I checked the pvalue distribution

pvals <- topTable(efit, coef=3, n=Inf)$P.Value sum(pvals <= 0.01)/length(pvals) # 0.1230808 and the of vm$E

Does that mean the duplicateCorrelation didn't work, or do I have more batch effects in my data I haven't removed yet? The MDS plots before and after voom look almost identical

0
Entering edit mode

The MDS plots won't directly respond to any blocking within duplicateCorrelation. In fact, it's impossible for them to do so, as you can't explicitly estimate and remove the batch effects here without also removing the effect of interest. Any difference will be subtle and due to the effect of duplicateCorrelation blocking on the precision of the fitted linear model when correlations between samples are taken into consideration.

I don't understand what you intend to show with the p-value distribution. If you have DE genes in that contrast, obviously you'll have more p-values below the threshold than expected under the null.

0
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

Personally I would do stick to limma-voom for this analysis (rather than edgeR or DESeq2) because of the ability to treat patient as a random effect.

Also, the recommended way to filter genes is:

keep <- filterByExpr(y, group=group)
y <- y[keep,, keep.lib.sizes=FALSE]

0
Entering edit mode

Thank you for the recommendation Gordon, but why would you not do the normalization based on cpm per sample?

keep <- rowSums(cpm(y)>1) >= 3
0
Entering edit mode

Why do you run filterByExpr() but then not use the result? That doesn't seem to make any sense.

The best cpm cutoff depends on the library sizes, and you haven't told us what they are., so I can't tell whether your filter is sensible or not. Also, why do you use ">3" when your minimum group size is 4 rather than 3?

Why don't you just use filterByExpr(), which chooses the cpm cutoff for you, so you can be sure it's right?