edgeR problem

0

Entering edit mode

Nima Rafati ▴ 20

@nima-rafati-5795

Last seen 10.5 years ago

Dear all, I have RNA-seq libraries of 12 individuals in two groups (6 replicates each). I would like to do differential expression analyses using a GLM with effect of group and sex on the transcripts. I followed the manual and in last step for calculation of dispersion (estimateGLMCommonDisp) I received a high value with a warning. Here comes all commands that I have used: data.TMM<-read.table("Mod-H-transcripts.ount.matrix.TMM_normalized.FPK M",row.names=1,header=T) sex<-factor(c("M","F","M","F","M","F","M","F","M","F","F","M")) grp<-factor(c("W","W","W","W","W","W","D","D","D","D","D","D")) y.TMM<-DGEList(count=data.TMM.new,group=group.D.W) data.frame(Sample=colnames(y.TMM),grp,sex) design<-model.matrix(~grp+sex) rownames(design)<-colnames(y.TMM) y.TMM <- estimateGLMCommonDisp(y.TMM, design, verbose=TRUE) Disp = 3.99994 , BCV = 2 There were 50 or more warnings (use warnings() to see the first 50) Despite of error, is the generated dispersion reliable? can I continue with analyses? Best regards, Nima [[alternative HTML version deleted]]

• 1.2k views

ADD COMMENT • link updated 12.1 years ago by h.soueidan@nki.nl ▴ 30 • written 12.1 years ago by Nima Rafati ▴ 20

0

Entering edit mode

h.soueidan@nki.nl ▴ 30

@hsoueidannkinl-5657

Last seen 10.5 years ago

Hi Nima, I never had such high value for the BCV. In my analysis (mouse and human RNA-Seq), the BCV is usually way below 1. From the name of your data file, it looks like you have normalized FPKM data. EdgeR expect raw counts data (integers). That might be causing problems. Could you provide a head of your data.TMM data.frame? Further could you 1) provide a session.info and 2) provide some of the warnings? Regards, Sam. On Feb 26, 2013, at 4:39 PM, Nima Rafati <nimarafati at="" gmail.com=""> wrote: > Dear all, > > I have RNA-seq libraries of 12 individuals in two groups (6 replicates > each). I would like to do differential expression analyses using a GLM with > effect of group and sex on the transcripts. I followed the manual and in > last step for calculation of dispersion (estimateGLMCommonDisp) I received > a high value with a warning. Here comes all commands that I have used: > > data.TMM<-read.table("Mod-H-transcripts.ount.matrix.TMM_normalized.F PKM",row.names=1,header=T) > sex<-factor(c("M","F","M","F","M","F","M","F","M","F","F","M")) > grp<-factor(c("W","W","W","W","W","W","D","D","D","D","D","D")) > y.TMM<-DGEList(count=data.TMM.new,group=group.D.W) > data.frame(Sample=colnames(y.TMM),grp,sex) > design<-model.matrix(~grp+sex) > rownames(design)<-colnames(y.TMM) > y.TMM <- estimateGLMCommonDisp(y.TMM, design, verbose=TRUE) > > Disp = 3.99994 , BCV = 2 > There were 50 or more warnings (use warnings() to see the first 50) > > Despite of error, is the generated dispersion reliable? can I continue with > analyses? > Best regards, > Nima > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.1 years ago h.soueidan@nki.nl ▴ 30

0

Entering edit mode

Dear Hayssam, Thanks for your reply. I have used trinity and following the instructions on their website: 1- I generated a matrix of counts from all 12 samples. 2- "Using the counts.matrix file created above, perform TMM normalization and generate the FPKM values per transcript and sample as follows: $TRINITY_HOME/Analysis/DifferentialExpression/run_TMM_normalization_wr ite_FPKM_matrix.pl --matrix counts.matrix --transcript_lengths feature_lengths.txt" (from Trinity website). 3- Then I followed the codes from edgeR manual and ended up in high values which I had posted. BUT I also tried the original count.matrix (raw counts data) without correction by using aforesaid script and received the same dispersion and BCV values. Here is the header of my count.matrix: ERR162262 ERR162225 ERR162226 ERR162215 ERR162243 ERR162235 ERR162224 ERR162219 ERR162218 ERR1 62266 ERR162239 ERR162263 Contig8320 43.00 71.00 44.21 39.00 35.00 25.00 18.00 19.92 28.00 28.00 7.00 37.00 comp28560_c2_seq1-len=504 239.00 231.00 239.00 214.00 223.00 155.00 211.00 203.00 212.00 225.00 11.00 294.00 comp36723_c0_seq1-len=635 83.67 79.02 38.28 52.13 72.07 27.00 46.88 55.23 51.12 46.00 24.50 63.12 comp24326_c0_seq2-len=1093 18.00 16.00 9.00 23.00 30.00 18.00 30.00 28.00 12.00 17.00 70.00 42.00 You also asked about the warnings: Warning messages: 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 0.940000 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 0.500000 3: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 0.010000 4: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 0.410000 5: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : non-integer x = 1.740000 I appreciate your help, Regards, Nima On Tue, Feb 26, 2013 at 11:01 PM, Hayssam Soueidan <h.soueidan@nki.nl>wrote: > Hi Nima, > > > I never had such high value for the BCV. In my analysis (mouse and human > RNA-Seq), the BCV is usually way below 1. From the name of your data file, > it looks like you have normalized FPKM data. EdgeR expect raw counts data > (integers). That might be causing problems. Could you provide a head of > your data.TMM data.frame? > Further could you 1) provide a session.info and 2) provide some of the > warnings? > > Regards, > Sam. > > On Feb 26, 2013, at 4:39 PM, Nima Rafati <nimarafati@gmail.com> wrote: > > > Dear all, > > > > I have RNA-seq libraries of 12 individuals in two groups (6 replicates > > each). I would like to do differential expression analyses using a GLM > with > > effect of group and sex on the transcripts. I followed the manual and in > > last step for calculation of dispersion (estimateGLMCommonDisp) I > received > > a high value with a warning. Here comes all commands that I have used: > > > > > data.TMM<-read.table("Mod-H-transcripts.ount.matrix.TMM_normalized.F PKM",row.names=1,header=T) > > sex<-factor(c("M","F","M","F","M","F","M","F","M","F","F","M")) > > grp<-factor(c("W","W","W","W","W","W","D","D","D","D","D","D")) > > y.TMM<-DGEList(count=data.TMM.new,group=group.D.W) > > data.frame(Sample=colnames(y.TMM),grp,sex) > > design<-model.matrix(~grp+sex) > > rownames(design)<-colnames(y.TMM) > > y.TMM <- estimateGLMCommonDisp(y.TMM, design, verbose=TRUE) > > > > Disp = 3.99994 , BCV = 2 > > There were 50 or more warnings (use warnings() to see the first 50) > > > > Despite of error, is the generated dispersion reliable? can I continue > with > > analyses? > > Best regards, > > Nima > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 12.1 years ago Nima Rafati ▴ 20

0

Entering edit mode

Hi Nima, The head of your data frame shows that you have both integers and non integers values (e.g. last sample of comp36723_c0_seq1). These non integers values are causing the bad estimation of BCV as well as the warnings. You can check how many non integers values you have with e.g. table(floor(count.matrix)==count.matrix) Trinity is supposed to play well with edgeR (see [1] ). How did you run trinity? [1] http://trinityrnaseq.sourceforge.net/analysis/diff_expression_anal ysis.html On Feb 26, 2013, at 11:55 PM, Nima Rafati <nimarafati at="" gmail.com=""> wrote: > Dear Hayssam, > > Thanks for your reply. I have used trinity and following the instructions on their website: > 1- I generated a matrix of counts from all 12 samples. > 2- "Using the counts.matrix file created above, perform TMM normalization and generate the FPKM values per transcript and sample as follows: > $TRINITY_HOME/Analysis/DifferentialExpression/run_TMM_normalization_ write_FPKM_matrix.pl --matrix counts.matrix --transcript_lengths feature_lengths.txt" (from Trinity website). > 3- Then I followed the codes from edgeR manual and ended up in high values which I had posted. > > BUT I also tried the original count.matrix (raw counts data) without correction by using aforesaid script and received the same dispersion and BCV values. > Here is the header of my count.matrix: > ERR162262 ERR162225 ERR162226 ERR162215 ERR162243 ERR162235 ERR162224 ERR162219 ERR162218 ERR1 > 62266 ERR162239 ERR162263 > Contig8320 43.00 71.00 44.21 39.00 35.00 25.00 18.00 19.92 28.00 28.00 7.00 37.00 > comp28560_c2_seq1-len=504 239.00 231.00 239.00 214.00 223.00 155.00 211.00 203.00 212.00 225.00 11.00 294.00 > comp36723_c0_seq1-len=635 83.67 79.02 38.28 52.13 72.07 27.00 46.88 55.23 51.12 46.00 24.50 63.12 > comp24326_c0_seq2-len=1093 18.00 16.00 9.00 23.00 30.00 18.00 30.00 28.00 12.00 17.00 70.00 42.00 > > You also asked about the warnings: > Warning messages: > 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > non-integer x = 0.940000 > 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > non-integer x = 0.500000 > 3: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > non-integer x = 0.010000 > 4: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > non-integer x = 0.410000 > 5: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > non-integer x = 1.740000 > > I appreciate your help, > Regards, > Nima > > On Tue, Feb 26, 2013 at 11:01 PM, Hayssam Soueidan <h.soueidan at="" nki.nl=""> wrote: > Hi Nima, > > > I never had such high value for the BCV. In my analysis (mouse and human RNA-Seq), the BCV is usually way below 1. From the name of your data file, it looks like you have normalized FPKM data. EdgeR expect raw counts data (integers). That might be causing problems. Could you provide a head of your data.TMM data.frame? > Further could you 1) provide a session.info and 2) provide some of the warnings? > > Regards, > Sam. > > On Feb 26, 2013, at 4:39 PM, Nima Rafati <nimarafati at="" gmail.com=""> wrote: > > > Dear all, > > > > I have RNA-seq libraries of 12 individuals in two groups (6 replicates > > each). I would like to do differential expression analyses using a GLM with > > effect of group and sex on the transcripts. I followed the manual and in > > last step for calculation of dispersion (estimateGLMCommonDisp) I received > > a high value with a warning. Here comes all commands that I have used: > > > > data.TMM<-read.table("Mod-H-transcripts.ount.matrix.TMM_normalized .FPKM",row.names=1,header=T) > > sex<-factor(c("M","F","M","F","M","F","M","F","M","F","F","M")) > > grp<-factor(c("W","W","W","W","W","W","D","D","D","D","D","D")) > > y.TMM<-DGEList(count=data.TMM.new,group=group.D.W) > > data.frame(Sample=colnames(y.TMM),grp,sex) > > design<-model.matrix(~grp+sex) > > rownames(design)<-colnames(y.TMM) > > y.TMM <- estimateGLMCommonDisp(y.TMM, design, verbose=TRUE) > > > > Disp = 3.99994 , BCV = 2 > > There were 50 or more warnings (use warnings() to see the first 50) > > > > Despite of error, is the generated dispersion reliable? can I continue with > > analyses? > > Best regards, > > Nima > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 12.1 years ago h.soueidan@nki.nl ▴ 30

Login before adding your answer.