edgeR and DESeq2: model design and estimation of dispersion

0

Entering edit mode

Iddo Ben-dov ▴ 20

@iddo-ben-dov-6603

Last seen 9.5 years ago

hi, in both edgeR and DESeq2, estimation of dispersion precedes negative binomial GLM fitting. my question is, can I use a design formula when estimating dispersion which is different from the formula used for GLM fitting? specifically, I would like to use a simplified design when estimating dispersion and a full design for GLM fitting. my motivation for doing so is that with the full design estimation of dispersion is too demanding for my computer and time. my dataset includes 400 mRNAseq profiles (~22,000 genes). there are 100 controls and 100 cases, and each was sampled twice - before and after intervention. thus, the full design is: ~ group*intervention + individual:group (blocking factor) as I mentioned, estimation of dispersion with the above design is not practical, and I thus would like to simplify to: ~ group*intervention and introduce the 'individual' blocking factor only for NB GLM fitting. is this statistically valid? appreciate any help, iddo

edgeR DESeq2 edgeR DESeq2 • 2.0k views

ADD COMMENT • link updated 9.8 years ago by Ryan C. Thompson ★ 7.9k • written 9.9 years ago by Iddo Ben-dov ▴ 20

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

hi Iddo, I wouldn't recommend using a different design for dispersion estimation and then for the GLM. One way to think about it is that, differences in counts which can be accounted for by the individual effect in the GLM will be observed as higher dispersion in the dispersion estimation, so in general you would end up overly conservative by taking that approach to dispersion estimation. As you have many samples and a large design matrix, you could try using linear models, as in voom/limma, which will be faster to fit. Mike On Thu, Jun 12, 2014 at 9:51 AM, Iddo Ben-dov <iddobe at="" ekmd.huji.ac.il=""> wrote: > hi, > > in both edgeR and DESeq2, estimation of dispersion precedes negative binomial GLM fitting. > > my question is, can I use a design formula when estimating dispersion which is different from the formula used for GLM fitting? specifically, I would like to use a simplified design when estimating dispersion and a full design for GLM fitting. > > my motivation for doing so is that with the full design estimation of dispersion is too demanding for my computer and time. > > my dataset includes 400 mRNAseq profiles (~22,000 genes). there are 100 controls and 100 cases, and each was sampled twice - before and after intervention. > > thus, the full design is: > ~ group*intervention + individual:group (blocking factor) > > as I mentioned, estimation of dispersion with the above design is not practical, and I thus would like to simplify to: > ~ group*intervention > > and introduce the 'individual' blocking factor only for NB GLM fitting. > > is this statistically valid? > > appreciate any help, > iddo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.9 years ago Michael Love 41k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi, The full design as you have specified it is not of full rank, so I would expect the dispersion estimation to fail with an error. This is because the individual factor is (I assume) nested within the group factor (i.e. every individual belongs to exactly one group). I think your situation is similar to a recent post on this list: https://stat.ethz.ch/pipermail/bioconductor/2014-May/059579.html In the case, again there are multiple individuals in each of two groups with before and after treatments. My answer is here: https://stat.ethz.ch/pipermail/bioconductor/2014-May/059587.html You could do the same thing for your data, except that you don't have to do the duplicateCorrelation step because you don't have technical replicates. You can use the same design for limma or edgeR. I don't know if there is a way to specify this design for DESeq2. -Ryan On 6/12/14, 6:51 AM, Iddo Ben-dov wrote: > hi, > > in both edgeR and DESeq2, estimation of dispersion precedes negative binomial GLM fitting. > > my question is, can I use a design formula when estimating dispersion which is different from the formula used for GLM fitting? specifically, I would like to use a simplified design when estimating dispersion and a full design for GLM fitting. > > my motivation for doing so is that with the full design estimation of dispersion is too demanding for my computer and time. > > my dataset includes 400 mRNAseq profiles (~22,000 genes). there are 100 controls and 100 cases, and each was sampled twice - before and after intervention. > > thus, the full design is: > ~ group*intervention + individual:group (blocking factor) > > as I mentioned, estimation of dispersion with the above design is not practical, and I thus would like to simplify to: > ~ group*intervention > > and introduce the 'individual' blocking factor only for NB GLM fitting. > > is this statistically valid? > > appreciate any help, > iddo > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 9.8 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

hi Ryan, i examined the design with: > is.fullrank(design) [1] TRUE > (in each group the individuals are numbered 1-100) this design was okay with voom(), which was indeed fast (as Michael and Gordon suggested), but with edgeR estimation of dispersion took too long, and in DESeq2 it failed with an error thank you, iddo On Jun 16, 2014, at 1:40 AM, Ryan <rct at="" thompsonclan.org=""> wrote: > Hi, > > The full design as you have specified it is not of full rank, so I would expect the dispersion estimation to fail with an error. This is because the individual factor is (I assume) nested within the group factor (i.e. every individual belongs to exactly one group). I think your situation is similar to a recent post on this list: > > https://stat.ethz.ch/pipermail/bioconductor/2014-May/059579.html > > In the case, again there are multiple individuals in each of two groups with before and after treatments. My answer is here: > > https://stat.ethz.ch/pipermail/bioconductor/2014-May/059587.html > > You could do the same thing for your data, except that you don't have to do the duplicateCorrelation step because you don't have technical replicates. You can use the same design for limma or edgeR. I don't know if there is a way to specify this design for DESeq2. > > -Ryan > > On 6/12/14, 6:51 AM, Iddo Ben-dov wrote: >> hi, >> >> in both edgeR and DESeq2, estimation of dispersion precedes negative binomial GLM fitting. >> >> my question is, can I use a design formula when estimating dispersion which is different from the formula used for GLM fitting? specifically, I would like to use a simplified design when estimating dispersion and a full design for GLM fitting. >> >> my motivation for doing so is that with the full design estimation of dispersion is too demanding for my computer and time. >> >> my dataset includes 400 mRNAseq profiles (~22,000 genes). there are 100 controls and 100 cases, and each was sampled twice - before and after intervention. >> >> thus, the full design is: >> ~ group*intervention + individual:group (blocking factor) >> >> as I mentioned, estimation of dispersion with the above design is not practical, and I thus would like to simplify to: >> ~ group*intervention >> >> and introduce the 'individual' blocking factor only for NB GLM fitting. >> >> is this statistically valid? >> >> appreciate any help, >> iddo >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 9.8 years ago Iddo Ben-dov ▴ 20

Login before adding your answer.