Hello,
I'm using EdgeR to identify differentially expressed transcripts from RNASeq data. I have a fairly complicated experimental design with multiple time points, treatments, and tissue types. In addition, because of limited amounts of experimental material, the replicates were collected at three different time points, so I'm also trying to include batch effects in my model. I think I have correctly set up my design matrix, but I'm having trouble defining my contrasts and thus making the appropriate comparisons between treatments groups that are most biologically relevant.
As a bit more biological information, we are interested in identifying genes that play an important role in a behavioral immune response of fruit flies to their parasitic wasps. We exposed flies to wasps (or no wasps as controls) for 2 or 8 hrs. We also separated the fly heads from the bodies to look at 'tissue' specific differences. Therefore, for each time point, we have heads and bodies that are either control or wasp-exposed, with 3 replicates of each. As previously mentioned, the replicates are independent and were collected on separate weeks. As for the naming convention in the condition, Con = control, Wasp = wasp-exposed, 2 = 2 hrs, 8 = 8hrs, B = boides, H = heads, such that rep Con_2_B is fly control bodies collected at 2 hrs. At this point, I'm most interested in comparing control vs. wasp-exposed heads or bodies for each time period, accounting for any batch effects due to collection week (e.g. Con_2_B vs. Wasp_2_B).
I have the count data in 24 separate files and I also have a metadata file, so I used readDGE function to combine the count data. Below is my code, as well as head() outputs for a few of vectors so that you have an idea of the data, etc.
> samples=read.csv("EdgeR_fly_tophat_metadata.csv", header=TRUE) > head(samples) LibraryName Condition Week countf 1 C2B1 Con_2_B 1 C2B1_accepted_hits_counts.txt 2 C2B2 Con_2_B 2 C2B2_accepted_hits_counts.txt 3 C2B3 Con_2_B 3 C2B3_accepted_hits_counts.txt 4 C2H1 Con_2_H 1 C2H1_accepted_hits_counts.txt 5 C2H2 Con_2_H 2 C2H2_accepted_hits_counts.txt 6 C2H3 Con_2_H 3 C2H3_accepted_hits_counts.txt > samples$countf=paste(samples$LibraryName, "_accepted_hits_counts.txt", sep="") > counts=readDGE(samples$countf)$counts > noint = rownames(counts) %in% + c("__no_feature","__ambiguous","__too_low_aQual", + "__not_aligned","__alignment_not_unique") > cpms = cpm(counts) > keep = rowSums(cpms>1)>=3 & !noint > counts = counts[keep,] > colnames(counts) = samples$LibraryName > head( counts[,order(samples$Condition)], 5 ) C2B1 C2B2 C2B3 C2H1 C2H2 C2H3 C8B1 C8B2 C8B3 C8H1 C8H2 C8H3 FBgn0000008 635 540 685 589 759 635 509 473 504 666 750 825 FBgn0000014 281 235 287 0 44 9 292 265 270 0 0 0 FBgn0000015 117 98 117 0 15 9 137 116 126 1 0 0 FBgn0000017 3599 3036 3303 3388 3891 3240 3727 3093 3386 3593 4053 4465 FBgn0000018 339 338 336 177 253 187 270 309 331 221 232 259 W2B1 W2B2 W2B3 W2H1 W2H2 W2H3 W8B1 W8B2 W8B3 W8H1 W8H2 W8H3 FBgn0000008 592 602 593 753 743 732 642 507 546 783 810 634 FBgn0000014 265 374 263 0 0 1 303 250 295 0 0 0 FBgn0000015 123 126 97 1 0 0 121 104 107 0 1 1 FBgn0000017 3334 3331 2730 3945 3713 3703 4130 2953 3247 3794 3584 3432 FBgn0000018 351 339 344 198 238 217 335 288 315 259 270 209 > d = DGEList(counts = counts, group = samples$Condition) > d2= calcNormFactors(d) > plotMDS(d2) > Treatment = factor(samples$Condition) > Treatment [1] Con_2_B Con_2_B Con_2_B Con_2_H Con_2_H Con_2_H Con_8_B [8] Con_8_B Con_8_B Con_8_H Con_8_H Con_8_H Wasp_2_B Wasp_2_B [15] Wasp_2_B Wasp_2_H Wasp_2_H Wasp_2_H Wasp_8_B Wasp_8_B Wasp_8_B [22] Wasp_8_H Wasp_8_H Wasp_8_H 8 Levels: Con_2_B Con_2_H Con_8_B Con_8_H Wasp_2_B ... Wasp_8_H > Week = factor(samples$Week) > Week [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Levels: 1 2 3
> d3 = d = DGEList(counts = counts, group = Treatment)
> design <- model.matrix(~Week+Treatment)
> colnames(design)
[1] "(Intercept)" "Week2" "Week3"
[4] "TreatmentCon_2_H" "TreatmentCon_8_B" "TreatmentCon_8_H"
[7] "TreatmentWasp_2_B" "TreatmentWasp_2_H" "TreatmentWasp_8_B"
[10] "TreatmentWasp_8_H"
> rownames(design) <- colnames(d2)
> d3 <- estimateGLMCommonDisp(d2, design, verbose=TRUE)
> d4 <- estimateGLMTrendedDisp(d3, design)
> d5<- estimateGLMTagwiseDisp(d4, design)
> plotBCV(d5)
> fit <- glmFit(d5, design)
> my.contrasts = makeContrasts(
+ Con2BvWasp2B = TreatmentWasp_2_B - TreatmentCon_2_B,
+ Con2HvWasp2H = TreatmentWasp_2_H - TreatmentCon_2_H,
+ Con8BvWasp8B = TreatmentWasp_8_B - TreatmentCon_8_B,
+ Con8HvWasp8H = TreatmentWasp_8_H - TreatmentCon_8_H,
+ Con2BvCon2H = TreatmentCon_2_H - TreatmentCon_2_B,
+ Wasp2BvWasp2H = TreatmentWasp_2_H - TreatmentWasp_2_B,
+ levels = design)
Error in eval(expr, envir, enclos) : object 'TreatmentCon_2_B' not found
In addition: Warning message:
In makeContrasts(Con2BvWasp2B = TreatmentWasp_2_B - TreatmentCon_2_B, :
Renaming (Intercept) to Intercept
> sessionInfo() R version 3.1.2 (2014-10-31) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets [7] methods base other attached packages: [1] edgeR_3.8.6 limma_3.22.7 loaded via a namespace (and not attached): [1] tools_3.1.2
I'm very new to glms, especially for such complicated designs. I tried to refer to the EdgeR manual, which is very informative and helpful, but I think I'm confused as to the intercept term and how to define the contrasts when using an additive linear model with week as a batch effect AND having multiple factors (the examples tend to have one or the other). For instance, there is no column in the design matrix for Con_2_B (why?), as I thought the intercept term represents Week 1. It seems from the warning message that the intercept also includes Con_2_B? I know there are multiple ways with with to make comparisons, but the makeContrasts method makes the most intuitive sense to me. I'm obviously doing something wrong, however. Also, it may be nice at some point to combine conditions and make comparisons, for instance combining the 2 and 8 hrs time points so that you can simply compare control and wasp-exposed heads (or bodies). How would you define contrasts for these type of comparisons? Any help would be much appreciated.
Thanks!
Use the latest version of the software. We're up to Bioconductor 3.1 now, running on R 3.2.0.