Question: EdgeR Multifactorial design with batch effect-trouble with defining contrasts
2
4.3 years ago by
mouchkam20
United States/Portland
mouchkam20 wrote:

Hello,

I'm using EdgeR to identify differentially expressed transcripts from RNASeq data.  I have a fairly complicated experimental design with multiple time points, treatments, and tissue types.  In addition, because of limited amounts of experimental material, the replicates were collected at three different time points, so I'm also trying to include batch effects in my model.  I think I have correctly set up my design matrix, but I'm having trouble defining my contrasts and thus making the appropriate comparisons between treatments groups that are most biologically relevant.

As a bit more biological information, we are interested in identifying genes that play an important role in a behavioral immune response of fruit flies to their parasitic wasps.  We exposed flies to wasps (or no wasps as controls) for 2 or 8 hrs.  We also separated the fly heads from the bodies to look at 'tissue' specific differences.  Therefore, for each time point, we have heads and bodies that are either control or wasp-exposed, with 3 replicates of each.  As previously mentioned, the replicates are independent and were collected on separate weeks.   As for the naming convention in the condition, Con = control, Wasp = wasp-exposed, 2 = 2 hrs, 8 = 8hrs, B = boides, H = heads, such that rep Con_2_B is fly control bodies collected at 2 hrs.  At this point, I'm most interested in comparing control vs. wasp-exposed heads or bodies for each time period, accounting for any batch effects due to collection week (e.g. Con_2_B vs. Wasp_2_B).

I have the count data in 24 separate files and I also have a metadata file, so I used readDGE function to combine the count data. Below is my code, as well as head() outputs for a few of vectors so that you have an idea of the data, etc.

> samples=read.csv("EdgeR_fly_tophat_metadata.csv", header=TRUE)

LibraryName Condition Week                        countf
1        C2B1   Con_2_B    1 C2B1_accepted_hits_counts.txt
2        C2B2   Con_2_B    2 C2B2_accepted_hits_counts.txt
3        C2B3   Con_2_B    3 C2B3_accepted_hits_counts.txt
4        C2H1   Con_2_H    1 C2H1_accepted_hits_counts.txt
5        C2H2   Con_2_H    2 C2H2_accepted_hits_counts.txt
6        C2H3   Con_2_H    3 C2H3_accepted_hits_counts.txt
> samples$countf=paste(samples$LibraryName, "_accepted_hits_counts.txt", sep="")
> counts=readDGE(samples$countf)$counts
> noint = rownames(counts) %in%
+     c("__no_feature","__ambiguous","__too_low_aQual",
+       "__not_aligned","__alignment_not_unique")
> cpms = cpm(counts)
> keep = rowSums(cpms>1)>=3 & !noint
> counts = counts[keep,]
> colnames(counts) = samples$LibraryName > head( counts[,order(samples$Condition)], 5 )

C2B1 C2B2 C2B3 C2H1 C2H2 C2H3 C8B1 C8B2 C8B3 C8H1 C8H2 C8H3
FBgn0000008  635  540  685  589  759  635  509  473  504  666  750  825
FBgn0000014  281  235  287    0   44    9  292  265  270    0    0    0
FBgn0000015  117   98  117    0   15    9  137  116  126    1    0    0
FBgn0000017 3599 3036 3303 3388 3891 3240 3727 3093 3386 3593 4053 4465
FBgn0000018  339  338  336  177  253  187  270  309  331  221  232  259
W2B1 W2B2 W2B3 W2H1 W2H2 W2H3 W8B1 W8B2 W8B3 W8H1 W8H2 W8H3
FBgn0000008  592  602  593  753  743  732  642  507  546  783  810  634
FBgn0000014  265  374  263    0    0    1  303  250  295    0    0    0
FBgn0000015  123  126   97    1    0    0  121  104  107    0    1    1
FBgn0000017 3334 3331 2730 3945 3713 3703 4130 2953 3247 3794 3584 3432
FBgn0000018  351  339  344  198  238  217  335  288  315  259  270  209

> d = DGEList(counts = counts, group = samples$Condition) > d2= calcNormFactors(d) > plotMDS(d2) > Treatment = factor(samples$Condition)

> Treatment
[1] Con_2_B  Con_2_B  Con_2_B  Con_2_H  Con_2_H  Con_2_H  Con_8_B
[8] Con_8_B  Con_8_B  Con_8_H  Con_8_H  Con_8_H  Wasp_2_B Wasp_2_B
[15] Wasp_2_B Wasp_2_H Wasp_2_H Wasp_2_H Wasp_8_B Wasp_8_B Wasp_8_B
[22] Wasp_8_H Wasp_8_H Wasp_8_H
8 Levels: Con_2_B Con_2_H Con_8_B Con_8_H Wasp_2_B ... Wasp_8_H
> Week = factor(samples$Week) > Week [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Levels: 1 2 3 > d3 = d = DGEList(counts = counts, group = Treatment) > design <- model.matrix(~Week+Treatment) > colnames(design) [1] "(Intercept)" "Week2" "Week3" [4] "TreatmentCon_2_H" "TreatmentCon_8_B" "TreatmentCon_8_H" [7] "TreatmentWasp_2_B" "TreatmentWasp_2_H" "TreatmentWasp_8_B" [10] "TreatmentWasp_8_H" > rownames(design) <- colnames(d2) > d3 <- estimateGLMCommonDisp(d2, design, verbose=TRUE) > d4 <- estimateGLMTrendedDisp(d3, design) > d5<- estimateGLMTagwiseDisp(d4, design) > plotBCV(d5) > fit <- glmFit(d5, design) > my.contrasts = makeContrasts( + Con2BvWasp2B = TreatmentWasp_2_B - TreatmentCon_2_B, + Con2HvWasp2H = TreatmentWasp_2_H - TreatmentCon_2_H, + Con8BvWasp8B = TreatmentWasp_8_B - TreatmentCon_8_B, + Con8HvWasp8H = TreatmentWasp_8_H - TreatmentCon_8_H, + Con2BvCon2H = TreatmentCon_2_H - TreatmentCon_2_B, + Wasp2BvWasp2H = TreatmentWasp_2_H - TreatmentWasp_2_B, + levels = design) Error in eval(expr, envir, enclos) : object 'TreatmentCon_2_B' not found In addition: Warning message: In makeContrasts(Con2BvWasp2B = TreatmentWasp_2_B - TreatmentCon_2_B, : Renaming (Intercept) to Intercept > sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets [7] methods base other attached packages: [1] edgeR_3.8.6 limma_3.22.7 loaded via a namespace (and not attached): [1] tools_3.1.2 I'm very new to glms, especially for such complicated designs. I tried to refer to the EdgeR manual, which is very informative and helpful, but I think I'm confused as to the intercept term and how to define the contrasts when using an additive linear model with week as a batch effect AND having multiple factors (the examples tend to have one or the other). For instance, there is no column in the design matrix for Con_2_B (why?), as I thought the intercept term represents Week 1. It seems from the warning message that the intercept also includes Con_2_B? I know there are multiple ways with with to make comparisons, but the makeContrasts method makes the most intuitive sense to me. I'm obviously doing something wrong, however. Also, it may be nice at some point to combine conditions and make comparisons, for instance combining the 2 and 8 hrs time points so that you can simply compare control and wasp-exposed heads (or bodies). How would you define contrasts for these type of comparisons? Any help would be much appreciated. Thanks! ADD COMMENTlink modified 4.3 years ago by James W. MacDonald51k • written 4.3 years ago by mouchkam20 Use the latest version of the software. We're up to Bioconductor 3.1 now, running on R 3.2.0. ADD REPLYlink written 4.3 years ago by Aaron Lun25k Answer: EdgeR Multifactorial design with batch effect-trouble with defining contrasts 4 4.3 years ago by United States James W. MacDonald51k wrote: You will probably be better off using a design without an intercept design <- model.matrix(~ 0+ week + treatment) And then you can make all the comparisons you would like. The intercept term doesn't represent Week 1, it represents the mean expression of Con_2_B at Week 1 (and the Week2 and Week3 coefficients are the difference between those weeks and Week1, thereby controlling for the batch effect). You can tell what the intercept is estimating by looking for the rows of the design matrix that have a one in the intercept column, and zeros in all other columns (note that the rows of the design matrix correspond to the samples in your experiment). The intercept is computing the mean expression of the samples that correspond to those rows (and they will all be Con_2_B samples from Week1). The Con_2_B samples from Week2 will have a 1 in the intercept, and a 1 in Week2, and zeros everywhere else, and the Con_2_B samples from Week3 will have a one in the intercept and a 1 in the Week3 column. So you are in essence saying that the expression of a gene from a Con_2_B sample from Week1 is Gene expression = mean(Con_2_B samples from week 1) + error and the same gene from a week 2 sample is Gene expression = mean(Con_2_B samples from week 1) + mean(Week2 samples - Week1 samples) + error And you are controlling for the differences between Week2 and Week1 so you can make comparisons between the different treatments, without biasing your results due to differences between the batches. Does that make sense? ADD COMMENTlink written 4.3 years ago by James W. MacDonald51k 1 You also need treatment before week in the formula, since only the first variable in an additive model benefits from the no-intercept parametrization. ADD REPLYlink written 4.3 years ago by Ryan C. Thompson7.4k Answer: EdgeR Multifactorial design with batch effect-trouble with defining contrasts 2 4.3 years ago by The Scripps Research Institute, La Jolla, CA Ryan C. Thompson7.4k wrote: Remember that when your design includes an intercept term, the first treatment group and week (TreatmentCon_2_B and Week1 respectively) are designated as the intercept, and then all the other coefficients represent differences from the intercept. It looks like you probably want a no-intercept design instead, using design <- model.matrix(~0 + Treatment + Week). This will give you a coefficient for each treatment effect, and then you can define contrasts between the treatments. ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Ryan C. Thompson7.4k Also, to answer the second set of questions; if you want to compare between control heads and wasp heads across time points, you can do something like: con <- makeContrasts( (TreatmentWasp_2_H + TreatmentWasp_8_H)/2 - (TreatmentCon_2_H + TreatmentCon_8_H)/2, levels=design) This assumes that you're using the design suggested by Ryan. The idea here is to compare the average of the wasp-exposed heads across time points against the corresponding average for the control heads. The same can be done with bodies. Of course, an increase here doesn't guarantee that there's an increase in each time point. One can imagine a case where, say, the 2-hour time point has a large increase in wasp over control, whereas the 8-hour time point has no increase (or even a small decrease). This would still result in an increase in the averages. If you want to ensure that there's an increase at each time point, you'll need to compare wasp against control at each time point separately, and then intersect the results. ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Aaron Lun25k Thank you so much Ryan, James and Aaron. Your responses are very helpful and informative. Just out of curiosity, is there any benefit to having an intercept design? And if I may ask one more question regarding design, I'm wondering if it makes more sense to analyze my head and body datasets separately? The genes being expressed in the head and bodies are very, very different and I will not be making comparisons in gene expression between the two body parts (not the biological question of interest). I'm not entirely sure if this is valid, however? Thanks again! ADD REPLYlink written 4.3 years ago by mouchkam20 In terms of the final results that you get, an intercept design is the same as a non-intercept design. The major difference lies in the ease of interpretation of the coefficients. In some cases, an intercept model will be easier to interpret, as the coefficients would represent log-fold changes over some baseline. In other cases, a non-intercept model is easier to interpret, as each coefficient represent the log-expression of a corresponding biological group in the experimental set-up. I use non-intercept designs as I think they're easier to interpret, but that's my personal preference. Your second question is harder to answer. In general, we try to analyze all of the samples together, in order to maximize the residual d.f. available for dispersion estimation. However, if the samples are very different, it will compromise TMM normalization (which assumes that the majority of genes are non-DE between pairs of samples). I'd recommend checking the MDS plots to see if the head-body difference dominates, and MA plots to see if you get a mass of genes with non-zero M-values between head and body samples. If this is the case, separate analyses may be justified. Note that there may be a way to have your cake and eat it too. You can subset your data into head and body samples, call calcNormFactors separately on each subset, and just use the resulting normalization factor estimate for each sample in the full analysis. This avoids violating TMM's non-DE assumption as you're not normalizing between head and body samples, while still getting residual d.f. from all samples. Of course, it also means that you can't compare between head and body samples in your DE contrasts (as they haven't been normalized to the same scale), but you weren't going to do that anyway. ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Aaron Lun25k Thanks for the feedback Aaron. All of the head samples cluster together (except for the sample C2H2, which is driving me crazy) and all of the body samples cluster together and far from the head samples in the MDS plot. In addition, if you compare expression between heads and bodies, you get far more genes that are differentially expressed than are not (it looks as though the entire MA plot is red). (See plots here: https://www.dropbox.com/sh/9da571qs7v088v4/AADvoIlrDkitOoH7jwQ5_UgIa?dl=0). At any rate, it looks as though the best plan of action is perform separate analysis or use the sample-specific normalization factors. I guess I'm a little confused as to how to use tissue-specific normalization factors in the full analysis. The normalize factors are used in the dispersion estimations, so would I estimate two different versions of the dispersions, one for the head normalization and one for the body? For instance: H = calcNormFactors(heads) B = calcNormFactors(heads) (with the design code in here) H2 = estimateGLMCommonDisp(H, design, verbose=TRUE) B2 = estimateGLMCommonDisp(B, design, verbose=TRUE) (with further dispersion iterations here) Hfit <- glmFit(H2, design) Bfit = glmFit(B2, design) Would I then set-up the contrasts and call the glm using the separate fits? Wouldn't this essentially be analyzing the data separately since not all of the data is included in the dispersion estimates? Or is it possible to include all of the data in the dispersion estimates? Also...regarding my one 'outlier' sample. I have no biological reason as to why this sample (C2H2) is so different. When I compare control and treatment heads at 2 hrs, a large number of differentially expressed genes are identified. However, a large proportion of these genes are differentially expressed because the counts in C2H2 are orders of magnitude larger than the counts in the treatment reps (but also orders of magnitude larger than the counts between control reps, with the other reps having counts much closer to those of the treatment group). For example, this gene was identified as being significantly DE and here is the count data: gene_ID C2H1 C2H2 C2H3 W2H1 W2H2 W2H3  FBgn0085358 0 358 1 0 0 0 I would think the large amount of variability between the biological replicates would lead to non-significance, but this doesn't seem to be the case. Do you have any advice for when a single sample is driving differential expression? I don't trust this is a biological meaningful result if only one rep is showing the response. I also never know how much weight to give results that are statistically significantly, but have very low counts (i.e. is the difference between average counts of 2 and 6 really that 'different?'). Thanks again. ADD REPLYlink written 4.3 years ago by mouchkam20 1 Yes, it seems like the best course of action may be to analyze the head and body samples separately. Don't worry about the sample-specific normalization factors; on further reflection, it seems like it's too much effort (not to implement, but to explain in a paper) for not much gain. If you really want to know, this is how it's done: H <- calcNormFactors(heads) B <- calcNormFactors(bodies) all.factors <- numeric(ncol(full.data)) all.factors[is.head] <- H$samples$norm.factors all.factors[is.body] <- B$samples$norm.factors full.data$samples\$norm.factors <- all.factors


This assumes that you have a DGEList object of all samples named full.data, and is.head and is.body vectors indicating which columns are head and body samples. As you can see, the normalization factors are estimated separately for head and body, but the ensuing steps (dispersion estimation, etc.) are done with all samples.

As for your outlier; check the top most variable genes (i.e., largest tagwise dispersions) and see if they correspond to some shared biological process. In mouse data, this is often be immunoglobulin variable chains, or some ribosomal RNA genes, or sex-related genes (if replicates are of different sexes). If you can find a set of genes that is driving the outlier behaviour in the offending sample, it should be a simple matter of discarding all genes in that set to recover the sample.

In any case, to protect against outlier expression in your analysis, switch to the glmQLFit and glmQLFTest pipeline. Make sure you set robust=TRUE in glmQLFit. This will avoiding shrinking the dispersions for genes with strong outlier expression, such that they'll have large dispersions and won't be detected as DE. If this doesn't work, then it may be necessary to just drop the offending sample from your analysis.

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by Aaron Lun25k