Dear Bioconductors,
I am working on the differential gene expression among Drosophila species after parasitization with wasps. I am following the EdgeR tutorial, and recently updated the package to edgeR_3.8.6 (R version 3.2.2) where some changes were introduced for the estimation of dispersion and quasi-likelihood test. I am interested in both: 1) the (orthologous) genes that are commonly expressed in all species after parasitization as well as 2) species-specific genes. For that I mapped the reads (3 biological replicates per species) to each of the Drosophila species reference genome, did the counts using HTSeq-count and then replaced the species gene for its orthologous gene in D. melanogaster, in order to be able to compare the species. Here is an example of my counts and targets, and the code I used
head(sp.counts) sample28 sample44 sample59 sample25 sample46 sample58 sample21 FBgn0003034 0 0 0 0 0 0 0 FBgn0000055 0 0 1 0 0 0 1600 FBgn0000056 0 0 0 0 0 0 1 FBgn0000078 362 1 13 557 0 24 7 FBgn0001149 6364 4217 2969 7452 867 2917 2377 FBgn0005391 1 3 3 4 4 2 0 head(sp.targets) sample line treatment time Batch sample46 melC1 par 5 2 sample58 melC1 par 5 3 sample21 sec ctl 5 1 sample30 sec ctl 5 2 sample29 sim par 5 2 sample75 sim par 5 3 sample24 yak ctl 5 1 sample55 yak ctl 5 2
sp.group<-paste(sp.targets$line, sp.targets$treatment, sep=".")
head(sp.group) "melC1.par" "melC1.par" "sec.ctl" "sec.ctl" "sec.ctl" "sec.par" "sec.par" "sec.par" "sim.ctl" "sim.ctl" sp.design<-model.matrix(~0+sp.group+Batch,data=sp.targets) colnames(sp.design)<-c("mel.ctl", "mel.par", "sec.ctl", "sec.par", "sim.ctl", "sim.par", "yak.ctl", "yak.par","Batch2", "Batch3") sp.contrasts<-makeContrasts(Par=(mel.par+sec.par+sim.par+yak.par)-(mel.ctl+sec.ctl+sim.ctl+yak.ctl), MelxPar=(mel.par-mel.ctl) ,SecxPar=(sec.par-sec.ctl), SimxPar=(sim.par-sim.ctl), YakxPar=(yak.par-yak.ctl),levels=sp.design) sp.dge<-DGEList(counts=sp.counts, group=sp.group, genes=rownames(sp.counts)) keep<-rowSums(cpm(sp.dge)>5) >= 3 sp.dge.keep<-sp.dge[keep,] sp.dge.keep$samples$lib.size <- colSums(sp.dge.keep$counts) sp.dge.keep<-calcNormFactors(sp.dge.keep) sp.disp <- estimateDisp(sp.dge.keep, sp.design, robust=TRUE) sp.fit <- glmQLFit(sp.disp, sp.design, robust=TRUE) sp.qlf <- glmQLFTest(sp.fit, contrast=sp.contrasts) sp.FDR <- p.adjust(sp.qlf$table$PValue, method="BH")
Because I am comparing different specie, I have a large BCV (about 0.7), but the differences within species are much less than between species (see MDS plot).
I obtained 4 differentially expressed genes, which is way less than what I obtained with the previous EdgeR version (using Tagwise dispersion). I understand that part of the difference relies on the df.prior because in this new version the prior is estimated. However I tried to "force" a change in the priors to see how it affects my results, by doing the following:
fit$s2.fit$df.prior<-5
fit$s2.fit$df.prior<-10
And of course this changed the amount of differentially expressed genes to 5 (df.prior=5) and 20 (df.prior=10). I plotted the CPM for each species and treatment (See "CPM_allspp"red: parasitized, blue: controls; each dot is one replicate) and I find that the results I obtain with df.prior=10 is something I would still interested in because it shows species-specific differential expression. But obviously I am unsure as whether I am violating an important statistic principle by "manually" selecting prior values. I would very much appreciate an opinion on this as well as my procedure (e.g. design and example contrast matrix).
Figures
Its the first time I upload figures, so I'm not sure this will work. These are the url for the figures:
MDS plot: http://i.imgur.com/Fpo3yWY.png
CPM_3prior: http://i.imgur.com/CSrH6cI.png
CPM_5prior: http://i.imgur.com/HeNnauL.png
CPM_10prior: http://i.imgur.com/dhIUa1C.png
Thank you very much!
Laura Salazar
The latest version of
edgeR
is 3.12.0; you are two Bioconductor releases behind. Our advice/comments will be more relevant if you update your R setup and analysis to use the latest versions of all packages.