Question

Comparison gene expression among different Drosophila species with EdgeR

0

Entering edit mode

lauraalazar • 0

@lauraalazar-9318

Last seen 8.1 years ago

Dear Bioconductors,

I am working on the differential gene expression among Drosophila species after parasitization with wasps. I am following the EdgeR tutorial, and recently updated the package to edgeR_3.8.6 (R version 3.2.2) where some changes were introduced for the estimation of dispersion and quasi-likelihood test. I am interested in both: 1) the (orthologous) genes that are commonly expressed in all species after parasitization as well as 2) species-specific genes. For that I mapped the reads (3 biological replicates per species) to each of the Drosophila species reference genome, did the counts using HTSeq-count and then replaced the species gene for its orthologous gene in D. melanogaster, in order to be able to compare the species. Here is an example of my counts and targets, and the code I used

head(sp.counts)

                   sample28 sample44 sample59 sample25 sample46 sample58 sample21
FBgn0003034        0        0        0        0        0        0        0
FBgn0000055        0        0        1        0        0        0     1600
FBgn0000056        0        0        0        0        0        0        1
FBgn0000078      362        1       13      557        0       24        7
FBgn0001149     6364     4217     2969     7452      867     2917     2377
FBgn0005391        1        3        3        4        4        2        0

head(sp.targets)

sample  line treatment time Batch

sample46 melC1       par    5     2
sample58 melC1       par    5     3
sample21   sec       ctl    5     1
sample30   sec       ctl    5     2
sample29   sim       par    5     2
sample75   sim       par    5     3
sample24   yak       ctl    5     1
sample55   yak       ctl    5     2

sp.group<-paste(sp.targets$line, sp.targets$treatment, sep=".")

head(sp.group)
"melC1.par" "melC1.par"  "sec.ctl"   "sec.ctl"   "sec.ctl"   "sec.par"   "sec.par"   "sec.par" "sim.ctl"   "sim.ctl"

sp.design<-model.matrix(~0+sp.group+Batch,data=sp.targets)
colnames(sp.design)<-c("mel.ctl", "mel.par", "sec.ctl", "sec.par", "sim.ctl", "sim.par", "yak.ctl", "yak.par","Batch2", "Batch3")

sp.contrasts<-makeContrasts(Par=(mel.par+sec.par+sim.par+yak.par)-(mel.ctl+sec.ctl+sim.ctl+yak.ctl), MelxPar=(mel.par-mel.ctl) ,SecxPar=(sec.par-sec.ctl), SimxPar=(sim.par-sim.ctl), YakxPar=(yak.par-yak.ctl),levels=sp.design)

sp.dge<-DGEList(counts=sp.counts, group=sp.group, genes=rownames(sp.counts))

keep<-rowSums(cpm(sp.dge)>5) >= 3
sp.dge.keep<-sp.dge[keep,]

sp.dge.keep$samples$lib.size <- colSums(sp.dge.keep$counts)

sp.dge.keep<-calcNormFactors(sp.dge.keep)

sp.disp <- estimateDisp(sp.dge.keep, sp.design, robust=TRUE)
sp.fit <- glmQLFit(sp.disp, sp.design, robust=TRUE)
sp.qlf <- glmQLFTest(sp.fit, contrast=sp.contrasts)      
sp.FDR <- p.adjust(sp.qlf$table$PValue, method="BH")

Because I am comparing different specie, I have a large BCV (about 0.7), but the differences within species are much less than between species (see MDS plot).
I obtained 4 differentially expressed genes, which is way less than what I obtained with the previous EdgeR version (using Tagwise dispersion). I understand that part of the difference relies on the df.prior because in this new version the prior is estimated. However I tried to "force" a change in the priors to see how it affects my results, by doing the following:

fit$s2.fit$df.prior<-5

fit$s2.fit$df.prior<-10

And of course this changed the amount of differentially expressed genes to 5 (df.prior=5) and 20 (df.prior=10). I plotted the CPM for each species and treatment (See "CPM_allspp"red: parasitized, blue: controls; each dot is one replicate) and I find that the results I obtain with df.prior=10 is something I would still interested in because it shows species-specific differential expression. But obviously I am unsure as whether I am violating an important statistic principle by "manually" selecting prior values. I would very much appreciate an opinion on this as well as my procedure (e.g. design and example contrast matrix).

Figures

Its the first time I upload figures, so I'm not sure this will work. These are the url for the figures:
MDS plot: http://i.imgur.com/Fpo3yWY.png

CPM_3prior: http://i.imgur.com/CSrH6cI.png

CPM_5prior: http://i.imgur.com/HeNnauL.png
CPM_10prior: http://i.imgur.com/dhIUa1C.png

Thank you very much!

Laura Salazar

edger • 1.3k views

ADD COMMENT • link updated 8.4 years ago by Aaron Lun ★ 28k • written 8.4 years ago by lauraalazar • 0

0

Entering edit mode

The latest version of edgeR is 3.12.0; you are two Bioconductor releases behind. Our advice/comments will be more relevant if you update your R setup and analysis to use the latest versions of all packages.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

score 1 · Answer 1 · 2015-12-04

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 25 minutes ago

The city by the bay

Several points in no particular order:

From the looks of it, each species/treatment combination is parametrized as a separate group. This means that your large BCV is probably not being driven by differences between species, but instead, by differences between replicates within each species/treatment combination. Which is surprising, actually, because the MDS plot looks fairly tight (but you don't colour according to treatment, so for all we know, the different treatments within each species might be mixed together).
You should divide your Par contrast by three, in order to compute the average log-fold change between treatments across the three species. Otherwise, the reported log-fold change will be the sum of the log-fold changes across those species, which doesn't seem particularly useful.
Manually setting the prior d.f. in the QL method makes no sense, because the whole point of the QL methodology is to model the observed variability in the dispersions. If you do this manually, you won't be able to interpret the p-values as being measures of significance, because they won't have any absolute meaning. If you want to get more DE genes, then the better way to do it would be to use the existing p-values from the QL framework and accept a higher FDR. Your DE list will then be less reliable, because it'll contain more false discoveries - but at least you can accurately quantify this unreliability (because you know the FDR). In contrast, if you fiddle with the prior d.f., who knows what the FDR would be?

Also, if you're going to show images, the more useful plots would be those from plotBCV and plotQLDisp.

ADD COMMENT • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thank very much Aaron for your reply and suggestions. I now plotted the MDS colouring the treatments, "par" (red) and "ctl" (blue). It does indeed show large difference among replicates. Since my biological replicates were sequenced in 3 different batches I tried to correct for batch effects in the design, is that the correct way to take batch effect into account? And maybe related to the same question, I find 167 genes with p-value<0.05, but only 4 genes with FDR<0,05 and the distribution of p-values is strongly skewed (figure attached). Is this "normal" behaviour ?

Just to clarify, I'm not trying to maximize the amount of DE genes, just to find a method with enough sensitivity to detect differential expression, which I (maybe naively) see in the CPM plots.

With respect to your suggestion of dividing Par by 3, did you actually mean by 4 (i.e. number of species)?, like this:

sp.contrasts<-makeContrasts(Par=(mel.par+sec.par+sim.par+yak.par)/4-(mel.ctl+sec.ctl+sim.ctl+yak.ctl)/4, design=sp.design)

I now also include the plotBCV and plotQLDisp:

MDS: http://imgur.com/mcvMNmK

BCV: http://imgur.com/InIM4fj

QLDisp: http://imgur.com/6R64zHH

Pvalues: http://imgur.com/E3WHcdv

Thank you again!

ADD REPLY • link 8.4 years ago lauraalazar • 0

0

Entering edit mode

Looking at the BCV plot indicates that most of your genes have relatively reasonable dispersions (BCV < 0.5). However, there's a substantial number of genes with very large BCVs, and that's pulling up the trended value. It's also contributing to a low prior d.f. estimate, which is probably responsible for the skew towards large p-values.

I would suggest looking closely at the expression pattern across all of your samples for these genes, and seeing if you have a problematic sample that is inflating the dispersions. This might not show up on a MDS plot, mostly because you have really big differences between species; outliers get pushed back into their group when the scale of the plot increases, but they're still present and will affect dispersion estimation. You'll need to examine the counts directly to pick this up.

If there are no problematic genes, then it just means that your dispersions are highly variable across genes. In such cases, edgeR is correct to use a low prior d.f., because you can't rely on information from one gene to estimate the dispersion for another gene (as it would probably be different). This should usually result in a uniform p-value distribution, but in your case, it seems that the QL dispersions are spread out in a manner that makes it difficult to model accurately with an F-distribution. This is probably why you end up with a funny-looking left-skewed distribution of p-values across all of your genes.

Other comments:

Yes, your approach to batch effects looks fine, provided that Batch is a factor and that it doesn't confound the species/treatment groupings. For example, if your treatment for one species is in one batch and your control is in another, that makes the comparison a bit sketchy (though not entirely impossible, depending on the other samples).
Don't bother looking at the raw p-values unless you know what you're doing. If you want to identify DE genes, focus on the FDRs. These will always be greater than or equal to the raw p-values, so you'll always get fewer significant genes.
Looking at potential DE genes by eye is always overly optimistic, unless you're capable of correcting for multiple testing in your head. In your case, you've got the additional problem of high variability (and high variability in the variability) which increases the likelihood of getting convincing DE patterns just by chance for some non-DE genes.
Yes, use the number of species.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Very useful comments, thanks a lot!

ADD REPLY • link 8.4 years ago lauraalazar • 0

0

Entering edit mode

I have a follow-up question to the original, I hope its ok to post it here without starting a new post:

The above analysis with all species led to about 13 genes with significant differential expression. Next I tried to analyse each species separated, contrasting the parasitized and control samples (3 replicates per treatment), but I did not get any significant genes. I expected that at least genes that had obvious differences in counts between treatments would show up, for example the 13 that were significant when considering all species together:

                 ctl       ctl      ctl      par      par    par
FBgn0001216       28        6        8       57        4       17
FBgn0010381        2        2        1      115        0       32
FBgn0012042        7        6        2       52        1       49
FBgn0014865        4        1        1      212        1       15
FBgn0028396        0        1        6     1145      315      635
FBgn0032638       22        4       22      892      135     1495
FBgn0033821      173       37        7      712      133      392
FBgn0034407        4        1        3      261       12       36
FBgn0034512       25        4       12      309        9      185
FBgn0041183       37       38       16      517      206     2914
FBgn0041579        3        1        1       96        1       89
FBgn0043578       60        9       18      319        8       72
FBgn0053462        2        1        1      107       32      136

So I did the following experiment: I took this list (13 genes) and started added random genes (100, 1000, 2000, 5000, 6000) to see the effect on the differential expression. My results showed that with 100 genes I recover 11 DE of these 13, with 1000 the DE reduces to 6, with 2000 to 4, with 5000 to 3 and with 6000 to 0. D. melanogaster has more than 10000 genes, so this means that I would never be able to pick up few genes with differential expression from my treatment, even when the counts are consistently higher for one treatment compared to the other (e.g., FBgn0028396, FBgn0032638, FBgn0033821, FBgn334407, FBgn0041183, FBgn0053462 in the table above). Is there a sort of minimum proportion of DE genes that need to be different in order to be identified? What happens in cases where there are few genes but with consistent differences between treatments, shouldn't they be identified as DE?

ADD REPLY • link 8.1 years ago lauraalazar • 0

0

Entering edit mode

Okay, first things first. I'll assume the only thing you've changed is the contrast (i.e., to compare between parasitized and control within species, versus comparing the average across species between parasitized and control). Now, imagine you have a small but consistent change between parasitized and control conditions in each species. If you try to compare within each species, you mightn't have enough evidence to reject the null hypothesis, but if you compare the averages, you can take advantage of the small amount of information in all species to reject the overall null across species. This may explain the differences between your within-species and across-species comparisons.

Furthermore, what seems to be "obvious" to you may be less so when viewed more quantitatively. Library size considerations aside, you can see that middle par sample (and to a lesser extent, the first ctl sample) is quite different from its replicates. This results in large dispersion estimates, which mitigate any large log-fold change when considering significant DE in your analysis. In fact, if the dispersions are large enough, you can easily generate large differences between groups by random chance. To convince yourself, try hitting rnbinom(6, mu=100, size=2) a few thousand times, and you'll eventually get a couple of genes with convincing (but completely spurious) DE between the first and last three counts.

As for your second observation, the more random genes you add, the less DE you will find as the multiple testing correction becomes more severe with an increasing number of hypotheses. This is an inevitability of testing thousands of hypotheses in genomics experiments, and is one of the reasons for independent filtering prior to analysis. If you want to detect subtle changes, then you'll have to increase the power of your study to overcome the correction penalty - most easily by increasing the number of samples, but also by reducing variability between replicates and/or increasing the effect of the treatment.

Finally, there is no requirement for a minimum proportion of DE genes. If a gene has a low enough p-value to offset the correction factor, it will be detected. This includes genes with consistent differences across species, if those differences are strong enough and/or the dispersions are low enough. That said, if you have more DE genes in your experiment, the BH correction will be less severe (simply because at a given FDR threshold, the allowable number of false positives increases with an increase in the total number of DE genes, so we can afford to be less stringent).

ADD REPLY • link 8.1 years ago Aaron Lun ★ 28k