Question

Limitations in edgeR?

0

Entering edit mode

Eleanor Su ▴ 110

@eleanor-su-6460

Last seen 9.7 years ago

Hi All, I'm currently trying to analyze differential expression of piRNAs in some small data sets but am coming across issues that I didn't before when I analyzed with a larger data set. The larger data set contained 324 piRNA identities while the smaller data set contained half as many piRNA identities. Is there a minimum number of gene identities required in order to analyze differential expression in edgeR? Best, Eleanor [[alternative HTML version deleted]]

• 1.0k views

ADD COMMENT • link updated 10.1 years ago by Gordon Smyth 50k • written 10.1 years ago by Eleanor Su ▴ 110

score 0 · Answer 1 · 2014-04-02

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi Eleanor, On Tue, Apr 1, 2014 at 11:09 AM, Eleanor Su <eleanorjinsu at="" gmail.com=""> wrote: > Hi All, > > I'm currently trying to analyze differential expression of piRNAs in some > small data sets but am coming across issues that I didn't before when I > analyzed with a larger data set. The larger data set contained 324 piRNA > identities while the smaller data set contained half as many piRNA > identities. Is there a minimum number of gene identities required in order > to analyze differential expression in edgeR? It's hard to help without knowing what the issues are that you are running into, so ... what's going wrong? One way you could explore this question yourself is to use the larger (324 piRNA) dataset that "went well" and simply take half of the data from it and rerun the same analysis on the smaller set. Do you get different results? While you're playing with that idea, please provide a follow up email with more specific details about what the issues are that you are running into with your new (smaller) dataset. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 10.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, I'm running the same analysis on both datasets (the larger and the smaller). When I rerun the analysis on the smaller dataset (which actually IS half of the identities from the larger data set), I come across an error message when estimating glm trended dispersion. Here are the commands I'm using: > rawdata<-read.delim("piRNAtotalcount>10.txt", check.names=FALSE, stringsAsFactors=FALSE) > y <- DGEList(counts=rawdata[,2:11], genes=rawdata[,1]) > Family<-factor(c(6,6,9,9,11,11,26,26,28,28)) > Treatment<-factor(c("C","H","C","H","C","H","C","H","C","H")) > data.frame(Sample=colnames(y),Family,Treatment) Sample Family Treatment 1 6C 6 C 2 6H 6 H 3 9C 9 C 4 9H 9 H 5 11C 11 C 6 11H 11 H 7 26C 26 C 8 26H 26 H 9 28C 28 C 10 28H 28 H > design<-model.matrix(~Family+Treatment) > rownames(design)<-colnames(y) > y<-estimateGLMTrendedDisp(y,design) Error in optim(par0, fun, y = y.nonzero[i, ], design = design, offset = offset.nonzero[i, : function cannot be evaluated at initial parameters I only encounter this error when running the smaller dataset. Best, Eleanor On Wed, Apr 2, 2014 at 9:49 AM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi Eleanor, > > On Tue, Apr 1, 2014 at 11:09 AM, Eleanor Su <eleanorjinsu@gmail.com> > wrote: > > Hi All, > > > > I'm currently trying to analyze differential expression of piRNAs in some > > small data sets but am coming across issues that I didn't before when I > > analyzed with a larger data set. The larger data set contained 324 piRNA > > identities while the smaller data set contained half as many piRNA > > identities. Is there a minimum number of gene identities required in > order > > to analyze differential expression in edgeR? > > It's hard to help without knowing what the issues are that you are > running into, so ... what's going wrong? > > One way you could explore this question yourself is to use the larger > (324 piRNA) dataset that "went well" and simply take half of the data > from it and rerun the same analysis on the smaller set. Do you get > different results? > > While you're playing with that idea, please provide a follow up email > with more specific details about what the issues are that you are > running into with your new (smaller) dataset. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Eleanor Su ▴ 110

0

Entering edit mode

Hi, On Wed, Apr 2, 2014 at 9:58 AM, Eleanor Su <eleanorjinsu at="" gmail.com=""> wrote: > Hi Steve, > > I'm running the same analysis on both datasets (the larger and the > smaller). When I rerun the analysis on the smaller dataset (which actually > IS half of the identities from the larger data set), I come across an error > message when estimating glm trended dispersion. Is it the same data as the bigger data set but just cut in half? Or is this a separate dataset where you only measured half of the things from a previous experiment? Do you know what I mean? If it's the former, what happens when you take the other half? :-) The error you get is hard for me to debug as I'm not familiar with the internals of the code working there and what edge cases that it might be susceptible to and all I could really do is provide (rather non insightful) strategies to help smoke out potential problems in your data itself ... could you take various subsets of your data to see if you can sidestep the problem? Also, are you running on bioc2.13 (R 3.0.3) and the latest version of edgeR? (providing the output of sessionInfo() at the end of posts looking for help is usually a good idea) If you are not running the latest and greatest, upgrade. Also, from the looks of your design matrix: is it true that you don't have any replication? Was it the same exact design matrix that worked your previous dataset? And neither here nor there, is "piRNAtotalcount>10.txt" really the name of your file? Although it may work for you (are you on windows?) it's not a bad idea to get into the habit of using more boring filenames (">" is the redirection operator on linux). HTH, -steve > Here are the commands I'm > using: > >> rawdata<-read.delim("piRNAtotalcount>10.txt", check.names=FALSE, > stringsAsFactors=FALSE) >> y <- DGEList(counts=rawdata[,2:11], genes=rawdata[,1]) >> Family<-factor(c(6,6,9,9,11,11,26,26,28,28)) >> Treatment<-factor(c("C","H","C","H","C","H","C","H","C","H")) >> data.frame(Sample=colnames(y),Family,Treatment) > Sample Family Treatment > 1 6C 6 C > 2 6H 6 H > 3 9C 9 C > 4 9H 9 H > 5 11C 11 C > 6 11H 11 H > 7 26C 26 C > 8 26H 26 H > 9 28C 28 C > 10 28H 28 H >> design<-model.matrix(~Family+Treatment) >> rownames(design)<-colnames(y) >> y<-estimateGLMTrendedDisp(y,design) > Error in optim(par0, fun, y = y.nonzero[i, ], design = design, offset = > offset.nonzero[i, : > function cannot be evaluated at initial parameters > > I only encounter this error when running the smaller dataset. > > Best, > Eleanor -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 10.1 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve, It is indeed the former. I'll see what happens when I take the other half and do the analysis. I'll also look into the version of edgeR I'm running and determine if I need to upgrade. My replications are dictated by my treatments. I have 5 samples of a Treatment and 5 samples of a control. Technically I have only one treatment and control from each family, but I wanted to look at differential expression between Treatments and control with family as a blocking factor. Thus I followed the example in the edgeR user manual on page 56. Is this not the correct way of running the analysis? I also recently sent out an email about setting up a design matrix with multiple factors. Each of the families that I've listed has one of two mitochondrial haplotypes (Standard or Divergent). I want to set up a design matrix where I can look at the interaction between the Treatment (H and C) and mitochondrial haplotypes (S and D) with family as a blocking factor. These are the preliminary commands that I've set up: > rawdata<-read.delim("piRNAtotalcount.txt", check.names=FALSE, stringsAsFactors=FALSE) > y <- DGEList(counts=rawdata[,2:11], genes=rawdata[,1]) > Family<-factor(c(6,6,9,9,11,11,26,26,28,28)) > Treatment<-factor(c("C","H","C","H","C","H","C","H","C","H")) > mitoHap<-factor(c("S","S","S","S","S","S","D","D","D","D")) > data.frame(Sample=colnames(y),Family,Treatment,mitoHap) Sample Family Treatment mitoHap 1 6C (S) 6 C S 2 6H (S) 6 H S 3 9C (S) 9 C S 4 9H (S) 9 H S 5 11C (S) 11 C S 6 11H (S) 11 H S 7 26C (D) 26 C D 8 26H (D) 26 H D 9 28C (D) 28 C D 10 28H (D) 28 H D >design<-model.matrix( ) I've looked at other examples online especially one from Daniella in 2012, but I'm having a difficult time writing a design matrix that allows me to look at the interaction while still maintaining family as just a blocking factor. Any help would be greatly appreciated. Thank you for your time Steve. Best, Eleanor On Wed, Apr 2, 2014 at 11:43 AM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi, > > On Wed, Apr 2, 2014 at 9:58 AM, Eleanor Su <eleanorjinsu@gmail.com> wrote: > > Hi Steve, > > > > I'm running the same analysis on both datasets (the larger and the > > smaller). When I rerun the analysis on the smaller dataset (which > actually > > IS half of the identities from the larger data set), I come across an > error > > message when estimating glm trended dispersion. > > Is it the same data as the bigger data set but just cut in half? Or is > this a separate dataset where you only measured half of the things > from a previous experiment? Do you know what I mean? > > If it's the former, what happens when you take the other half? :-) > > The error you get is hard for me to debug as I'm not familiar with the > internals of the code working there and what edge cases that it might > be susceptible to and all I could really do is provide (rather non > insightful) strategies to help smoke out potential problems in your > data itself ... could you take various subsets of your data to see if > you can sidestep the problem? > > Also, are you running on bioc2.13 (R 3.0.3) and the latest version of > edgeR? (providing the output of sessionInfo() at the end of posts > looking for help is usually a good idea) > > If you are not running the latest and greatest, upgrade. > > Also, from the looks of your design matrix: is it true that you don't > have any replication? Was it the same exact design matrix that worked > your previous dataset? > > And neither here nor there, is "piRNAtotalcount>10.txt" really the > name of your file? Although it may work for you (are you on windows?) > it's not a bad idea to get into the habit of using more boring > filenames (">" is the redirection operator on linux). > > HTH, > -steve > > > Here are the commands I'm > > using: > > > >> rawdata<-read.delim("piRNAtotalcount>10.txt", check.names=FALSE, > > stringsAsFactors=FALSE) > >> y <- DGEList(counts=rawdata[,2:11], genes=rawdata[,1]) > >> Family<-factor(c(6,6,9,9,11,11,26,26,28,28)) > >> Treatment<-factor(c("C","H","C","H","C","H","C","H","C","H")) > >> data.frame(Sample=colnames(y),Family,Treatment) > > Sample Family Treatment > > 1 6C 6 C > > 2 6H 6 H > > 3 9C 9 C > > 4 9H 9 H > > 5 11C 11 C > > 6 11H 11 H > > 7 26C 26 C > > 8 26H 26 H > > 9 28C 28 C > > 10 28H 28 H > >> design<-model.matrix(~Family+Treatment) > >> rownames(design)<-colnames(y) > >> y<-estimateGLMTrendedDisp(y,design) > > Error in optim(par0, fun, y = y.nonzero[i, ], design = design, offset = > > offset.nonzero[i, : > > function cannot be evaluated at initial parameters > > > > I only encounter this error when running the smaller dataset. > > > > Best, > > Eleanor > > -- > Steve Lianoglou > Computational Biologist > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Eleanor Su ▴ 110

score 0 · Answer 2 · 2014-04-04

Dear Eleanor, Well, a couple of comments. First, edgeR does not have a limitation on the number of genes it can run on. I suggest that you upgrade the most recent version of edgeR, which I suspect you do not have, and run y <- estimateDisp(y,design) Second, given that you have already analyzed the full set of piRNAs successfully, why in the world would you need to rerun the analysis on just half of them? This does seem like a self-inflicted problem. Gordon > Date: Wed, 2 Apr 2014 09:58:23 -0700 > From: Eleanor Su <eleanorjinsu at="" gmail.com=""> > To: Steve Lianoglou <lianoglou.steve at="" gene.com=""> > Cc: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: Re: [BioC] Limitations in edgeR? > > Hi Steve, > > I'm running the same analysis on both datasets (the larger and the > smaller). When I rerun the analysis on the smaller dataset (which actually > IS half of the identities from the larger data set), I come across an error > message when estimating glm trended dispersion. Here are the commands I'm > using: > >> rawdata<-read.delim("piRNAtotalcount>10.txt", check.names=FALSE, > stringsAsFactors=FALSE) >> y <- DGEList(counts=rawdata[,2:11], genes=rawdata[,1]) >> Family<-factor(c(6,6,9,9,11,11,26,26,28,28)) >> Treatment<-factor(c("C","H","C","H","C","H","C","H","C","H")) >> data.frame(Sample=colnames(y),Family,Treatment) > Sample Family Treatment > 1 6C 6 C > 2 6H 6 H > 3 9C 9 C > 4 9H 9 H > 5 11C 11 C > 6 11H 11 H > 7 26C 26 C > 8 26H 26 H > 9 28C 28 C > 10 28H 28 H >> design<-model.matrix(~Family+Treatment) >> rownames(design)<-colnames(y) >> y<-estimateGLMTrendedDisp(y,design) > Error in optim(par0, fun, y = y.nonzero[i, ], design = design, offset = > offset.nonzero[i, : > function cannot be evaluated at initial parameters > > I only encounter this error when running the smaller dataset. > > Best, > Eleanor > > > > On Wed, Apr 2, 2014 at 9:49 AM, Steve Lianoglou <lianoglou.steve at="" gene.com="">wrote: > >> Hi Eleanor, >> >> On Tue, Apr 1, 2014 at 11:09 AM, Eleanor Su <eleanorjinsu at="" gmail.com=""> >> wrote: >>> Hi All, >>> >>> I'm currently trying to analyze differential expression of piRNAs in some >>> small data sets but am coming across issues that I didn't before when I >>> analyzed with a larger data set. The larger data set contained 324 piRNA >>> identities while the smaller data set contained half as many piRNA >>> identities. Is there a minimum number of gene identities required in >> order >>> to analyze differential expression in edgeR? >> >> It's hard to help without knowing what the issues are that you are >> running into, so ... what's going wrong? >> >> One way you could explore this question yourself is to use the larger >> (324 piRNA) dataset that "went well" and simply take half of the data >> from it and rerun the same analysis on the smaller set. Do you get >> different results? >> >> While you're playing with that idea, please provide a follow up email >> with more specific details about what the issues are that you are >> running into with your new (smaller) dataset. >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}