edgeR on microRNA data

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Dear Helena,

Compared with mRNA-Seq, you have an unusually small number of transcripts but a relatively large number of biological replicates. This suggests that you should use a relative small value for prior.n but a relatively large value for prop.used. I am concerned that you have decreased prop.used its default value of 0.3. I would tend to increase this rather than decrease it.

On the other hand, you have increased prior.n from its default value, which for your data would be a little over 0.5. Is this simply because it gave better looking results? Anyway, increasing prior.n does not result in overfitting. The risk with larger prior.n is simply that it may start to return differentially expressed miRs that are increased or decreased in only a few of the samples, rather than consistently for all samples in a group.

Your experience with prior.n is unintuitive to me. Generally speaking, choosing prior.n small means that each miR gets to set its own dispersion, so that miR with large variance will not appear in the topTag list. When you say "variance outliers", do you mean large or small variance?

Since your minimum group sample size is 10, I would have required miRs to satisfy your cpm requirement in >= 10 samples rather than 5.

Best wishes
Gordon

> Date: Thu, 29 Sep 2011 05:25:14 +0000 > From: Helena Persson <helena.persson at ki.se> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at stat.math.ethz.ch> > Subject: [BioC] edgeR on microRNA data > > Hi, > I would be grateful for some input on using edgeR for small RNA sequence > data. I have been testing edgeR on a set of miRNA data (3 groups with > n=10, 15 and 15). After removing genes that are not expressed at >= 0.2 > cpm in >= 5 samples I have ~600 rows left. I tried calculating the > tagwise dispersion estimate with: > > cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, > grid=FALSE) > > Increasing the prior to e.g. 10 gives more differentially expressed > genes that do not look bad. Decreasing the prior to 0 leaves me with > extremely few differentially expressed genes that are mainly variance > outliers. I guess that miRNA data is likely to behave differently from > mRNA data since there are so few genes (but still a very large dynamic > range). Is it possible that I am over-fitting the estimate? Would you > recommend changing any other parameters? > > Best regards, > Helena > _________________________________ > > Helena Persson, PhD > > Karolinska Institutet > Dept of Biosciences and Nutrition > H?lsov?gen 7-9 > SE-141 83 Huddinge > Sweden > > Helena.Persson at ki.se

microRNA edgeR miRNA • 3.1k views

ADD COMMENT • link 13.6 years ago • updated 3.9 years ago Gordon Smyth 52k

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Dear Helena,

How large are the common and tagwise dispersions for your data?

Best wishes
Gordon

ADD COMMENT • link 13.6 years ago • updated 3.9 years ago Gordon Smyth 52k

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Works fine for me. My guess is that you can't install any Bioc package, not just edgeR. Gordon On Mon, 3 Oct 2011, Helena Persson wrote: > Well, I installed the devel version of R, but if I then do: > source("http://www.bioconductor.org/biocLite.R") > biocLite("edgeR") I get the error message: Using R version 2.14.0, biocinstall version 2.8.4. Installing Bioconductor version 2.8 packages: [1] "edgeR" Please wait... Warning: unable to access index for repository http://bioconductor.org /packages/2.8/bioc/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/extra/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://brainarray.mbni. med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14 Warning message: In getDependencies(pkgs, dependencies, available, lib) : package ?edgeR? is not available (for R Under development) ... so I guess I am doing something wrong. Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 08:53 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: SV: edgeR on microRNA data You need to install the devel version of R, available from CRAN. Then you get the devel version of edgeR and other Bioconductor packages automatically. Gordon On Mon, 3 Oct 2011, Helena Persson wrote: Dear Gordon, Upgrading sounds like a good idea ? how do I install the devel version of edgeR? Best, Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 05:33 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: edgeR on microRNA data Dear Helena, You will find it very helpful to upgrade your version of edgeR to the current developmental version (although you will need to be using R devel aka R 2.14 to do so). You will find that exactTest() is now much faster and less memory consuming. The current release version is time consuming when the counts are large, mainly because of a change to the way in which the rejection region is computed that we implemented two months ago. Fair comment about adding comments on prop.used. We had not considered that users would generally change this. If you choose prior.n very small, then edgeR will simply use the genewise dispersion estimate that depends on the data from that gene alone. This is not over-fitting in itself. However it can lead to an increase in the FDR because edgeR does not take into account when doing significance tests of the uncertainty with which the dispersion is estimated. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth ------------ original message -------------- On Mon, 3 Oct 2011, Helena Persson wrote: > Dear Gordon, I guess I should start with some clarifications: >I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. For the microRNA data I have few genes but a relatively large expression range. My reason for decreasing the prop.used was that I suspected that using 30% would bin genes that had very different means of expression. I did not give this a lot of thought at the time and have now rerun the analysis using 0.3. Maybe it would be good to comment a bit more on this parameter in the R help page or the edgeR vignette? > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. I decided to remove two of the samples in the control group because they appeared to be outliers from the rest, so my smallest group is actually 8 samples. I did not put together the control samples, but judging from the clinical data I got it is more hetereogeneous than the patient groups. Choosing 2 for the prior.n was a compromise (I realised I should go quite low for my dataset, but using 0 as suggested by someone I talked to produced very short lists of genes that did not look any better judging from boxplots). Actually, I was wondering if setting the prior very low (rather than high) could lead to overfitting of the variance estimate. >How large are the common and tagwise dispersions for your data? The common dispersion varies a little depending on how I group the samples: [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) [1] 0.2788752 (two groups, 8 ctrl and 32 patients) The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, trend=TRUE, prop.used=0.3, grid=FALSE)): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A strange thing: When I run exactTest for the miRNA data (618 genes x 38 samples) edgeR becomes extremely memory-consuming, basically using up all of the 8 GB RAM on my laptop and then becomes painfully slow as the memory starts switching. When I run exactTest for CAGE data for the same samples (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. I use a grid search for the CAGE data tagwise dispersion estimate and the library sizes are smaller (around 7 million counts vs 15 million), but otherwise the previous steps are basically the same. Any experience (or qualified guess) of what might make the analysis use so much memory? Thanks again, Helena On Sat, 1 Oct 2011, Gordon K Smyth wrote: > Dear Helena, > > Compared with mRNA-Seq, you have an unusually small number of transcripts but > a relatively large number of biological replicates. This suggests that you > should use a relative small value for prior.n but a relatively large value > for prop.used. I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. > > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. > > Your experience with prior.n is unintuitive to me. Generally speaking, > choosing prior.n small means that each miR gets to set its own dispersion, so > that miR with large variance will not appear in the topTag list. When you > say "variance outliers", do you mean large or small variance? > > Since your minimum group sample size is 10, I would have required miRs to > satisfy your cpm requirement in >= 10 samples rather than 5. > > Best wishes > Gordon > >> Date: Thu, 29 Sep 2011 05:25:14 +0000 >> From: Helena Persson <helena.persson at="" ki.se=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] edgeR on microRNA data >> >> Hi, > >> I would be grateful for some input on using edgeR for small RNA sequence >> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >> estimate with: >> >> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >> grid=FALSE) >> >> Increasing the prior to e.g. 10 gives more differentially expressed genes >> that do not look bad. Decreasing the prior to 0 leaves me with extremely >> few differentially expressed genes that are mainly variance outliers. I >> guess that miRNA data is likely to behave differently from mRNA data since >> there are so few genes (but still a very large dynamic range). Is it >> possible that I am over-fitting the estimate? Would you recommend changing >> any other parameters? >> >> Best regards, >> Helena >> _________________________________ >> >> Helena Persson, PhD >> >> Karolinska Institutet >> Dept of Biosciences and Nutrition >> H?lsov?gen 7-9 >> SE-141 83 Huddinge >> Sweden >> >> Helena.Persson at ki.se >> >> tel. +46-(0)8-52481058 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:15}}

ADD COMMENT • link 13.6 years ago Gordon Smyth 52k

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

Dear Helena, You will find it very helpful to upgrade your version of edgeR to the current developmental version (although you will need to be using R devel aka R 2.14 to do so). You will find that exactTest() is now much faster and less memory consuming. The current release version is time consuming when the counts are large, mainly because of a change to the way in which the rejection region is computed that we implemented two months ago. Fair comment about adding comments on prop.used. We had not considered that users would generally change this. If you choose prior.n very small, then edgeR will simply use the genewise dispersion estimate that depends on the data from that gene alone. This is not over-fitting in itself. However it can lead to an increase in the FDR because edgeR does not take into account when doing significance tests of the uncertainty with which the dispersion is estimated. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth ------------ original message -------------- On Mon, 3 Oct 2011, Helena Persson wrote: > Dear Gordon, I guess I should start with some clarifications: >I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. For the microRNA data I have few genes but a relatively large expression range. My reason for decreasing the prop.used was that I suspected that using 30% would bin genes that had very different means of expression. I did not give this a lot of thought at the time and have now rerun the analysis using 0.3. Maybe it would be good to comment a bit more on this parameter in the R help page or the edgeR vignette? > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. I decided to remove two of the samples in the control group because they appeared to be outliers from the rest, so my smallest group is actually 8 samples. I did not put together the control samples, but judging from the clinical data I got it is more hetereogeneous than the patient groups. Choosing 2 for the prior.n was a compromise (I realised I should go quite low for my dataset, but using 0 as suggested by someone I talked to produced very short lists of genes that did not look any better judging from boxplots). Actually, I was wondering if setting the prior very low (rather than high) could lead to overfitting of the variance estimate. >How large are the common and tagwise dispersions for your data? The common dispersion varies a little depending on how I group the samples: [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) [1] 0.2788752 (two groups, 8 ctrl and 32 patients) The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, trend=TRUE, prop.used=0.3, grid=FALSE)): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A strange thing: When I run exactTest for the miRNA data (618 genes x 38 samples) edgeR becomes extremely memory-consuming, basically using up all of the 8 GB RAM on my laptop and then becomes painfully slow as the memory starts switching. When I run exactTest for CAGE data for the same samples (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. I use a grid search for the CAGE data tagwise dispersion estimate and the library sizes are smaller (around 7 million counts vs 15 million), but otherwise the previous steps are basically the same. Any experience (or qualified guess) of what might make the analysis use so much memory? Thanks again, Helena On Sat, 1 Oct 2011, Gordon K Smyth wrote: > Dear Helena, > > Compared with mRNA-Seq, you have an unusually small number of transcripts but > a relatively large number of biological replicates. This suggests that you > should use a relative small value for prior.n but a relatively large value > for prop.used. I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. > > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. > > Your experience with prior.n is unintuitive to me. Generally speaking, > choosing prior.n small means that each miR gets to set its own dispersion, so > that miR with large variance will not appear in the topTag list. When you > say "variance outliers", do you mean large or small variance? > > Since your minimum group sample size is 10, I would have required miRs to > satisfy your cpm requirement in >= 10 samples rather than 5. > > Best wishes > Gordon > >> Date: Thu, 29 Sep 2011 05:25:14 +0000 >> From: Helena Persson <helena.persson at="" ki.se=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] edgeR on microRNA data >> >> Hi, > >> I would be grateful for some input on using edgeR for small RNA sequence >> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >> estimate with: >> >> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >> grid=FALSE) >> >> Increasing the prior to e.g. 10 gives more differentially expressed genes >> that do not look bad. Decreasing the prior to 0 leaves me with extremely >> few differentially expressed genes that are mainly variance outliers. I >> guess that miRNA data is likely to behave differently from mRNA data since >> there are so few genes (but still a very large dynamic range). Is it >> possible that I am over-fitting the estimate? Would you recommend changing >> any other parameters? >> >> Best regards, >> Helena >> _________________________________ >> >> Helena Persson, PhD >> >> Karolinska Institutet >> Dept of Biosciences and Nutrition >> H?lsov?gen 7-9 >> SE-141 83 Huddinge >> Sweden >> >> Helena.Persson at ki.se >> >> tel. +46-(0)8-52481058 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}

ADD COMMENT • link 13.6 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, Upgrading sounds like a good idea ? how do I install the devel version of edgeR? Best, Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 05:33 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: edgeR on microRNA data Dear Helena, You will find it very helpful to upgrade your version of edgeR to the current developmental version (although you will need to be using R devel aka R 2.14 to do so). You will find that exactTest() is now much faster and less memory consuming. The current release version is time consuming when the counts are large, mainly because of a change to the way in which the rejection region is computed that we implemented two months ago. Fair comment about adding comments on prop.used. We had not considered that users would generally change this. If you choose prior.n very small, then edgeR will simply use the genewise dispersion estimate that depends on the data from that gene alone. This is not over-fitting in itself. However it can lead to an increase in the FDR because edgeR does not take into account when doing significance tests of the uncertainty with which the dispersion is estimated. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth ------------ original message -------------- On Mon, 3 Oct 2011, Helena Persson wrote: > Dear Gordon, I guess I should start with some clarifications: >I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. For the microRNA data I have few genes but a relatively large expression range. My reason for decreasing the prop.used was that I suspected that using 30% would bin genes that had very different means of expression. I did not give this a lot of thought at the time and have now rerun the analysis using 0.3. Maybe it would be good to comment a bit more on this parameter in the R help page or the edgeR vignette? > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. I decided to remove two of the samples in the control group because they appeared to be outliers from the rest, so my smallest group is actually 8 samples. I did not put together the control samples, but judging from the clinical data I got it is more hetereogeneous than the patient groups. Choosing 2 for the prior.n was a compromise (I realised I should go quite low for my dataset, but using 0 as suggested by someone I talked to produced very short lists of genes that did not look any better judging from boxplots). Actually, I was wondering if setting the prior very low (rather than high) could lead to overfitting of the variance estimate. >How large are the common and tagwise dispersions for your data? The common dispersion varies a little depending on how I group the samples: [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) [1] 0.2788752 (two groups, 8 ctrl and 32 patients) The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, trend=TRUE, prop.used=0.3, grid=FALSE)): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A strange thing: When I run exactTest for the miRNA data (618 genes x 38 samples) edgeR becomes extremely memory-consuming, basically using up all of the 8 GB RAM on my laptop and then becomes painfully slow as the memory starts switching. When I run exactTest for CAGE data for the same samples (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. I use a grid search for the CAGE data tagwise dispersion estimate and the library sizes are smaller (around 7 million counts vs 15 million), but otherwise the previous steps are basically the same. Any experience (or qualified guess) of what might make the analysis use so much memory? Thanks again, Helena On Sat, 1 Oct 2011, Gordon K Smyth wrote: > Dear Helena, > > Compared with mRNA-Seq, you have an unusually small number of transcripts but > a relatively large number of biological replicates. This suggests that you > should use a relative small value for prior.n but a relatively large value > for prop.used. I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. > > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. > > Your experience with prior.n is unintuitive to me. Generally speaking, > choosing prior.n small means that each miR gets to set its own dispersion, so > that miR with large variance will not appear in the topTag list. When you > say "variance outliers", do you mean large or small variance? > > Since your minimum group sample size is 10, I would have required miRs to > satisfy your cpm requirement in >= 10 samples rather than 5. > > Best wishes > Gordon > >> Date: Thu, 29 Sep 2011 05:25:14 +0000 >> From: Helena Persson <helena.persson at="" ki.se=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] edgeR on microRNA data >> >> Hi, > >> I would be grateful for some input on using edgeR for small RNA sequence >> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >> estimate with: >> >> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >> grid=FALSE) >> >> Increasing the prior to e.g. 10 gives more differentially expressed genes >> that do not look bad. Decreasing the prior to 0 leaves me with extremely >> few differentially expressed genes that are mainly variance outliers. I >> guess that miRNA data is likely to behave differently from mRNA data since >> there are so few genes (but still a very large dynamic range). Is it >> possible that I am over-fitting the estimate? Would you recommend changing >> any other parameters? >> >> Best regards, >> Helena >> _________________________________ >> >> Helena Persson, PhD >> >> Karolinska Institutet >> Dept of Biosciences and Nutrition >> H?lsov?gen 7-9 >> SE-141 83 Huddinge >> Sweden >> >> Helena.Persson at ki.se >> >> tel. +46-(0)8-52481058 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 13.6 years ago Helena Persson ▴ 50

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

On Mon, 3 Oct 2011, Helena Persson wrote: > The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, > trend=TRUE, prop.used=0.3, grid=FALSE)): > > Min. 1st Qu. Median Mean 3rd Qu. Max. >0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890

A small number of the dispersions are alarmingly high. Have you filtered the gene list as I suggested? This may remove the "variance outliers".

Does a plot of the dispersions versus abundance suggest that the trend is smooth?

Best wishes
Gordon

ADD COMMENT • link 13.6 years ago • updated 3.9 years ago Gordon Smyth 52k

0

Entering edit mode

Dear Gordon, > Min. 1st Qu. Median Mean 3rd Qu. Max. >0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 >A small number of the dispersions are alarmingly high. Have you filtered >the gene list as I suggested? This may remove the "variance outliers". Depending on how I filter the counts table for expression there are ~5 genes with very high dispersion estimates. These are the genes I have been referring to as variance outliers ? in the table for differential expression from DESeq they have very high residual variances. By raising the expression threshold or the number of required samples within reasonable limites I can remove a few, but not the last 3 genes. Of course this also removes a substantial number of genes from the counts table. I attach plots for the variance estimates generated with plotMeanVar(). Best, Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 05:42 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: edgeR on microRNA data On Mon, 3 Oct 2011, Helena Persson wrote: > The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, > trend=TRUE, prop.used=0.3, grid=FALSE)): > > Min. 1st Qu. Median Mean 3rd Qu. Max. >0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A small number of the dispersions are alarmingly high. Have you filtered the gene list as I suggested? This may remove the "variance outliers". Does a plot of the dispersions versus abundance suggest that the trend is smooth? Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ -------------- next part -------------- A non-text attachment was scrubbed... Name: varianceFitPlots.pdf Type: application/pdf Size: 307226 bytes Desc: varianceFitPlots.pdf URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20111003="" c8ae39aa="" attachment.pdf="">

ADD REPLY • link 13.6 years ago Helena Persson ▴ 50

0

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 9 hours ago

WEHI, Melbourne, Australia

You need to install the devel version of R, available from CRAN. Then you get the devel version of edgeR and other Bioconductor packages automatically. Gordon On Mon, 3 Oct 2011, Helena Persson wrote: Dear Gordon, Upgrading sounds like a good idea ? how do I install the devel version of edgeR? Best, Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 05:33 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: edgeR on microRNA data Dear Helena, You will find it very helpful to upgrade your version of edgeR to the current developmental version (although you will need to be using R devel aka R 2.14 to do so). You will find that exactTest() is now much faster and less memory consuming. The current release version is time consuming when the counts are large, mainly because of a change to the way in which the rejection region is computed that we implemented two months ago. Fair comment about adding comments on prop.used. We had not considered that users would generally change this. If you choose prior.n very small, then edgeR will simply use the genewise dispersion estimate that depends on the data from that gene alone. This is not over-fitting in itself. However it can lead to an increase in the FDR because edgeR does not take into account when doing significance tests of the uncertainty with which the dispersion is estimated. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth ------------ original message -------------- On Mon, 3 Oct 2011, Helena Persson wrote: > Dear Gordon, I guess I should start with some clarifications: >I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. For the microRNA data I have few genes but a relatively large expression range. My reason for decreasing the prop.used was that I suspected that using 30% would bin genes that had very different means of expression. I did not give this a lot of thought at the time and have now rerun the analysis using 0.3. Maybe it would be good to comment a bit more on this parameter in the R help page or the edgeR vignette? > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. I decided to remove two of the samples in the control group because they appeared to be outliers from the rest, so my smallest group is actually 8 samples. I did not put together the control samples, but judging from the clinical data I got it is more hetereogeneous than the patient groups. Choosing 2 for the prior.n was a compromise (I realised I should go quite low for my dataset, but using 0 as suggested by someone I talked to produced very short lists of genes that did not look any better judging from boxplots). Actually, I was wondering if setting the prior very low (rather than high) could lead to overfitting of the variance estimate. >How large are the common and tagwise dispersions for your data? The common dispersion varies a little depending on how I group the samples: [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) [1] 0.2788752 (two groups, 8 ctrl and 32 patients) The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, trend=TRUE, prop.used=0.3, grid=FALSE)): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A strange thing: When I run exactTest for the miRNA data (618 genes x 38 samples) edgeR becomes extremely memory-consuming, basically using up all of the 8 GB RAM on my laptop and then becomes painfully slow as the memory starts switching. When I run exactTest for CAGE data for the same samples (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. I use a grid search for the CAGE data tagwise dispersion estimate and the library sizes are smaller (around 7 million counts vs 15 million), but otherwise the previous steps are basically the same. Any experience (or qualified guess) of what might make the analysis use so much memory? Thanks again, Helena On Sat, 1 Oct 2011, Gordon K Smyth wrote: > Dear Helena, > > Compared with mRNA-Seq, you have an unusually small number of transcripts but > a relatively large number of biological replicates. This suggests that you > should use a relative small value for prior.n but a relatively large value > for prop.used. I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. > > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. > > Your experience with prior.n is unintuitive to me. Generally speaking, > choosing prior.n small means that each miR gets to set its own dispersion, so > that miR with large variance will not appear in the topTag list. When you > say "variance outliers", do you mean large or small variance? > > Since your minimum group sample size is 10, I would have required miRs to > satisfy your cpm requirement in >= 10 samples rather than 5. > > Best wishes > Gordon > >> Date: Thu, 29 Sep 2011 05:25:14 +0000 >> From: Helena Persson <helena.persson at="" ki.se=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] edgeR on microRNA data >> >> Hi, > >> I would be grateful for some input on using edgeR for small RNA sequence >> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >> estimate with: >> >> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >> grid=FALSE) >> >> Increasing the prior to e.g. 10 gives more differentially expressed genes >> that do not look bad. Decreasing the prior to 0 leaves me with extremely >> few differentially expressed genes that are mainly variance outliers. I >> guess that miRNA data is likely to behave differently from mRNA data since >> there are so few genes (but still a very large dynamic range). Is it >> possible that I am over-fitting the estimate? Would you recommend changing >> any other parameters? >> >> Best regards, >> Helena >> _________________________________ >> >> Helena Persson, PhD >> >> Karolinska Institutet >> Dept of Biosciences and Nutrition >> H?lsov?gen 7-9 >> SE-141 83 Huddinge >> Sweden >> >> Helena.Persson at ki.se >> >> tel. +46-(0)8-52481058 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:10}}

ADD COMMENT • link 13.6 years ago Gordon Smyth 52k

0

Entering edit mode

Well, I installed the devel version of R, but if I then do: > source("http://www.bioconductor.org/biocLite.R") > biocLite("edgeR") I get the error message: Using R version 2.14.0, biocinstall version 2.8.4. Installing Bioconductor version 2.8 packages: [1] "edgeR" Please wait... Warning: unable to access index for repository http://bioconductor.org /packages/2.8/bioc/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://bioconductor.org /packages/2.8/extra/bin/macosx/leopard/contrib/2.14 Warning: unable to access index for repository http://brainarray.mbni. med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14 Warning message: In getDependencies(pkgs, dependencies, available, lib) : package ?edgeR? is not available (for R Under development) ... so I guess I am doing something wrong. Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 08:53 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: SV: edgeR on microRNA data You need to install the devel version of R, available from CRAN. Then you get the devel version of edgeR and other Bioconductor packages automatically. Gordon On Mon, 3 Oct 2011, Helena Persson wrote: Dear Gordon, Upgrading sounds like a good idea ? how do I install the devel version of edgeR? Best, Helena ________________________________________ Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] Skickat: den 3 oktober 2011 05:33 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: edgeR on microRNA data Dear Helena, You will find it very helpful to upgrade your version of edgeR to the current developmental version (although you will need to be using R devel aka R 2.14 to do so). You will find that exactTest() is now much faster and less memory consuming. The current release version is time consuming when the counts are large, mainly because of a change to the way in which the rejection region is computed that we implemented two months ago. Fair comment about adding comments on prop.used. We had not considered that users would generally change this. If you choose prior.n very small, then edgeR will simply use the genewise dispersion estimate that depends on the data from that gene alone. This is not over-fitting in itself. However it can lead to an increase in the FDR because edgeR does not take into account when doing significance tests of the uncertainty with which the dispersion is estimated. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.wehi.edu.au http://www.statsci.org/smyth ------------ original message -------------- On Mon, 3 Oct 2011, Helena Persson wrote: > Dear Gordon, I guess I should start with some clarifications: >I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. For the microRNA data I have few genes but a relatively large expression range. My reason for decreasing the prop.used was that I suspected that using 30% would bin genes that had very different means of expression. I did not give this a lot of thought at the time and have now rerun the analysis using 0.3. Maybe it would be good to comment a bit more on this parameter in the R help page or the edgeR vignette? > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. I decided to remove two of the samples in the control group because they appeared to be outliers from the rest, so my smallest group is actually 8 samples. I did not put together the control samples, but judging from the clinical data I got it is more hetereogeneous than the patient groups. Choosing 2 for the prior.n was a compromise (I realised I should go quite low for my dataset, but using 0 as suggested by someone I talked to produced very short lists of genes that did not look any better judging from boxplots). Actually, I was wondering if setting the prior very low (rather than high) could lead to overfitting of the variance estimate. >How large are the common and tagwise dispersions for your data? The common dispersion varies a little depending on how I group the samples: [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) [1] 0.2788752 (two groups, 8 ctrl and 32 patients) The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, trend=TRUE, prop.used=0.3, grid=FALSE)): Min. 1st Qu. Median Mean 3rd Qu. Max. 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 A strange thing: When I run exactTest for the miRNA data (618 genes x 38 samples) edgeR becomes extremely memory-consuming, basically using up all of the 8 GB RAM on my laptop and then becomes painfully slow as the memory starts switching. When I run exactTest for CAGE data for the same samples (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. I use a grid search for the CAGE data tagwise dispersion estimate and the library sizes are smaller (around 7 million counts vs 15 million), but otherwise the previous steps are basically the same. Any experience (or qualified guess) of what might make the analysis use so much memory? Thanks again, Helena On Sat, 1 Oct 2011, Gordon K Smyth wrote: > Dear Helena, > > Compared with mRNA-Seq, you have an unusually small number of transcripts but > a relatively large number of biological replicates. This suggests that you > should use a relative small value for prior.n but a relatively large value > for prop.used. I am concerned that you have decreased prop.used its default > value of 0.3. I would tend to increase this rather than decrease it. > > On the other hand, you have increased prior.n from its default value, which > for your data would be a little over 0.5. Is this simply because it gave > better looking results? Anyway, increasing prior.n does not result in > overfitting. The risk with larger prior.n is simply that it may start to > return differentially expressed miRs that are increased or decreased in only > a few of the samples, rather than consistently for all samples in a group. > > Your experience with prior.n is unintuitive to me. Generally speaking, > choosing prior.n small means that each miR gets to set its own dispersion, so > that miR with large variance will not appear in the topTag list. When you > say "variance outliers", do you mean large or small variance? > > Since your minimum group sample size is 10, I would have required miRs to > satisfy your cpm requirement in >= 10 samples rather than 5. > > Best wishes > Gordon > >> Date: Thu, 29 Sep 2011 05:25:14 +0000 >> From: Helena Persson <helena.persson at="" ki.se=""> >> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >> Subject: [BioC] edgeR on microRNA data >> >> Hi, > >> I would be grateful for some input on using edgeR for small RNA sequence >> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >> estimate with: >> >> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >> grid=FALSE) >> >> Increasing the prior to e.g. 10 gives more differentially expressed genes >> that do not look bad. Decreasing the prior to 0 leaves me with extremely >> few differentially expressed genes that are mainly variance outliers. I >> guess that miRNA data is likely to behave differently from mRNA data since >> there are so few genes (but still a very large dynamic range). Is it >> possible that I am over-fitting the estimate? Would you recommend changing >> any other parameters? >> >> Best regards, >> Helena >> _________________________________ >> >> Helena Persson, PhD >> >> Karolinska Institutet >> Dept of Biosciences and Nutrition >> H?lsov?gen 7-9 >> SE-141 83 Huddinge >> Sweden >> >> Helena.Persson at ki.se >> >> tel. +46-(0)8-52481058 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:13}}

ADD REPLY • link 13.6 years ago Helena Persson ▴ 50

0

Entering edit mode

Hi Helena, Did you manage to solve this problem? This works for me with R 2.14 alpha: > source("http://bioconductor.org/biocLite.R") BiocInstaller version 1.1.28, ?biocLite for help > biocLite("edgeR") BioC_mirror: 'http://www.bioconductor.org' Using R version 2.14, BiocInstaller version 1.1.28. Installing package(s) 'edgeR' Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14 trying URL 'http://www.bioconductor.org/packages/2.9/bioc/bin/macosx/leopard/cont rib/2.14/edgeR_2.3.52.tgz' Content type 'application/x-gzip' length 1169631 bytes (1.1 Mb) opened URL ================================================== downloaded 1.1 Mb The downloaded packages are in /tmp/Rtmp895H8Z/downloaded_packages Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14 > sessionInfo() R version 2.14.0 alpha (2011-10-11 r57214) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.1.28 loaded via a namespace (and not attached): [1] tools_2.14.0 How did you install R 2.14 on your Mac? Mine is coming from http://r.research.att.com/, which is the recommended place for getting the latest R devel binary for Mac. Please provide your sessionInfo(). Thanks! H. On 11-10-03 12:09 AM, Helena Persson wrote: > Well, I installed the devel version of R, but if I then do: > >> source("http://www.bioconductor.org/biocLite.R") >> biocLite("edgeR") > > I get the error message: > > Using R version 2.14.0, biocinstall version 2.8.4. > Installing Bioconductor version 2.8 packages: > [1] "edgeR" > Please wait... > > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/extra/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14 > Warning message: > In getDependencies(pkgs, dependencies, available, lib) : > package ?edgeR? is not available (for R Under development) > > ... so I guess I am doing something wrong. > > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 08:53 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: SV: edgeR on microRNA data > > You need to install the devel version of R, available from CRAN. Then you > get the devel version of edgeR and other Bioconductor packages > automatically. > > Gordon > > > On Mon, 3 Oct 2011, Helena Persson wrote: > > Dear Gordon, > Upgrading sounds like a good idea ? how do I install the devel version of edgeR? > > Best, > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 05:33 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: edgeR on microRNA data > > Dear Helena, > > You will find it very helpful to upgrade your version of edgeR to the > current developmental version (although you will need to be using R devel > aka R 2.14 to do so). You will find that exactTest() is now much faster > and less memory consuming. The current release version is time consuming > when the counts are large, mainly because of a change to the way in which > the rejection region is computed that we implemented two months ago. > > Fair comment about adding comments on prop.used. We had not considered > that users would generally change this. > > If you choose prior.n very small, then edgeR will simply use the genewise > dispersion estimate that depends on the data from that gene alone. This > is not over-fitting in itself. However it can lead to an increase in the > FDR because edgeR does not take into account when doing significance tests > of the uncertainty with which the dispersion is estimated. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.wehi.edu.au > http://www.statsci.org/smyth > > ------------ original message -------------- > > On Mon, 3 Oct 2011, Helena Persson wrote: > >> Dear Gordon, > I guess I should start with some clarifications: > >> I am concerned that you have decreased prop.used its default >> value of 0.3. I would tend to increase this rather than decrease it. > > For the microRNA data I have few genes but a relatively large expression > range. My reason for decreasing the prop.used was that I suspected that > using 30% would bin genes that had very different means of expression. I > did not give this a lot of thought at the time and have now rerun the > analysis using 0.3. Maybe it would be good to comment a bit more on this > parameter in the R help page or the edgeR vignette? > >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. Is this simply because it gave >> better looking results? Anyway, increasing prior.n does not result in >> overfitting. The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. > > I decided to remove two of the samples in the control group because they > appeared to be outliers from the rest, so my smallest group is actually 8 > samples. I did not put together the control samples, but judging from the > clinical data I got it is more hetereogeneous than the patient groups. > Choosing 2 for the prior.n was a compromise (I realised I should go quite > low for my dataset, but using 0 as suggested by someone I talked to > produced very short lists of genes that did not look any better judging > from boxplots). Actually, I was wondering if setting the prior very low > (rather than high) could lead to overfitting of the variance estimate. > >> How large are the common and tagwise dispersions for your data? > > The common dispersion varies a little depending on how I group the > samples: > > [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) > [1] 0.2788752 (two groups, 8 ctrl and 32 patients) > > The tagwise dispersions (cds1<- estimateTagwiseDisp(cds1, prior.n=1, > trend=TRUE, prop.used=0.3, grid=FALSE)): > > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 > > A strange thing: When I run exactTest for the miRNA data (618 genes x 38 > samples) edgeR becomes extremely memory-consuming, basically using up all > of the 8 GB RAM on my laptop and then becomes painfully slow as the memory > starts switching. When I run exactTest for CAGE data for the same samples > (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. > I use a grid search for the CAGE data tagwise dispersion estimate and the > library sizes are smaller (around 7 million counts vs 15 million), but > otherwise the previous steps are basically the same. Any experience (or > qualified guess) of what might make the analysis use so much memory? > > Thanks again, > > Helena > > > On Sat, 1 Oct 2011, Gordon K Smyth wrote: > >> Dear Helena, >> >> Compared with mRNA-Seq, you have an unusually small number of transcripts but >> a relatively large number of biological replicates. This suggests that you >> should use a relative small value for prior.n but a relatively large value >> for prop.used. I am concerned that you have decreased prop.used its default >> value of 0.3. I would tend to increase this rather than decrease it. >> >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. Is this simply because it gave >> better looking results? Anyway, increasing prior.n does not result in >> overfitting. The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. >> >> Your experience with prior.n is unintuitive to me. Generally speaking, >> choosing prior.n small means that each miR gets to set its own dispersion, so >> that miR with large variance will not appear in the topTag list. When you >> say "variance outliers", do you mean large or small variance? >> >> Since your minimum group sample size is 10, I would have required miRs to >> satisfy your cpm requirement in>= 10 samples rather than 5. >> >> Best wishes >> Gordon >> >>> Date: Thu, 29 Sep 2011 05:25:14 +0000 >>> From: Helena Persson<helena.persson at="" ki.se=""> >>> To: "bioconductor at stat.math.ethz.ch"<bioconductor at="" stat.math.ethz.ch=""> >>> Subject: [BioC] edgeR on microRNA data >>> >>> Hi, >> >>> I would be grateful for some input on using edgeR for small RNA sequence >>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >>> 15 and 15). After removing genes that are not expressed at>= 0.2 cpm in>= >>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >>> estimate with: >>> >>> cds1<- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >>> grid=FALSE) >>> >>> Increasing the prior to e.g. 10 gives more differentially expressed genes >>> that do not look bad. Decreasing the prior to 0 leaves me with extremely >>> few differentially expressed genes that are mainly variance outliers. I >>> guess that miRNA data is likely to behave differently from mRNA data since >>> there are so few genes (but still a very large dynamic range). Is it >>> possible that I am over-fitting the estimate? Would you recommend changing >>> any other parameters? >>> >>> Best regards, >>> Helena >>> _________________________________ >>> >>> Helena Persson, PhD >>> >>> Karolinska Institutet >>> Dept of Biosciences and Nutrition >>> H?lsov?gen 7-9 >>> SE-141 83 Huddinge >>> Sweden >>> >>> Helena.Persson at ki.se >>> >>> tel. +46-(0)8-52481058 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:23}}

ADD REPLY • link 13.5 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herv? and Dan, Thanks for your advice! Yes, I downloaded R from http://r.research.att.com/ as well. I originally solved the problem by using the Package Installer instead, which worked fine. For some reason, trying the BiocInstaller today works with the same installation of R that I was using before (I guess they must have changed something in the Matrix...). I will still update my R 2.14 though. Thanks again, Helena > sessionInfo() R Under development (unstable) (2011-10-01 r57123) Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) locale: [1] sv_SE.UTF-8/sv_SE.UTF-8/sv_SE.UTF-8/C/sv_SE.UTF-8/sv_SE.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.1.28 loaded via a namespace (and not attached): [1] tools_2.14.0 ________________________________________ Fr?n: Hervé Pagès [hpages at fhcrc.org] Skickat: den 12 oktober 2011 22:19 Till: Helena Persson Kopia: Bioconductor mailing list ?mne: Re: [BioC] edgeR on microRNA data Hi Helena, Did you manage to solve this problem? This works for me with R 2.14 alpha: > source("http://bioconductor.org/biocLite.R") BiocInstaller version 1.1.28, ?biocLite for help > biocLite("edgeR") BioC_mirror: 'http://www.bioconductor.org' Using R version 2.14, BiocInstaller version 1.1.28. Installing package(s) 'edgeR' Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14 trying URL 'http://www.bioconductor.org/packages/2.9/bioc/bin/macosx/leopard/cont rib/2.14/edgeR_2.3.52.tgz' Content type 'application/x-gzip' length 1169631 bytes (1.1 Mb) opened URL ================================================== downloaded 1.1 Mb The downloaded packages are in /tmp/Rtmp895H8Z/downloaded_packages Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/macosx/leopard/contrib/2.14 > sessionInfo() R version 2.14.0 alpha (2011-10-11 r57214) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.1.28 loaded via a namespace (and not attached): [1] tools_2.14.0 How did you install R 2.14 on your Mac? Mine is coming from http://r.research.att.com/, which is the recommended place for getting the latest R devel binary for Mac. Please provide your sessionInfo(). Thanks! H. On 11-10-03 12:09 AM, Helena Persson wrote: > Well, I installed the devel version of R, but if I then do: > >> source("http://www.bioconductor.org/biocLite.R") >> biocLite("edgeR") > > I get the error message: > > Using R version 2.14.0, biocinstall version 2.8.4. > Installing Bioconductor version 2.8 packages: > [1] "edgeR" > Please wait... > > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/extra/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14 > Warning message: > In getDependencies(pkgs, dependencies, available, lib) : > package ?edgeR? is not available (for R Under development) > > ... so I guess I am doing something wrong. > > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 08:53 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: SV: edgeR on microRNA data > > You need to install the devel version of R, available from CRAN. Then you > get the devel version of edgeR and other Bioconductor packages > automatically. > > Gordon > > > On Mon, 3 Oct 2011, Helena Persson wrote: > > Dear Gordon, > Upgrading sounds like a good idea ? how do I install the devel version of edgeR? > > Best, > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 05:33 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: edgeR on microRNA data > > Dear Helena, > > You will find it very helpful to upgrade your version of edgeR to the > current developmental version (although you will need to be using R devel > aka R 2.14 to do so). You will find that exactTest() is now much faster > and less memory consuming. The current release version is time consuming > when the counts are large, mainly because of a change to the way in which > the rejection region is computed that we implemented two months ago. > > Fair comment about adding comments on prop.used. We had not considered > that users would generally change this. > > If you choose prior.n very small, then edgeR will simply use the genewise > dispersion estimate that depends on the data from that gene alone. This > is not over-fitting in itself. However it can lead to an increase in the > FDR because edgeR does not take into account when doing significance tests > of the uncertainty with which the dispersion is estimated. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.wehi.edu.au > http://www.statsci.org/smyth > > ------------ original message -------------- > > On Mon, 3 Oct 2011, Helena Persson wrote: > >> Dear Gordon, > I guess I should start with some clarifications: > >> I am concerned that you have decreased prop.used its default >> value of 0.3. I would tend to increase this rather than decrease it. > > For the microRNA data I have few genes but a relatively large expression > range. My reason for decreasing the prop.used was that I suspected that > using 30% would bin genes that had very different means of expression. I > did not give this a lot of thought at the time and have now rerun the > analysis using 0.3. Maybe it would be good to comment a bit more on this > parameter in the R help page or the edgeR vignette? > >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. Is this simply because it gave >> better looking results? Anyway, increasing prior.n does not result in >> overfitting. The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. > > I decided to remove two of the samples in the control group because they > appeared to be outliers from the rest, so my smallest group is actually 8 > samples. I did not put together the control samples, but judging from the > clinical data I got it is more hetereogeneous than the patient groups. > Choosing 2 for the prior.n was a compromise (I realised I should go quite > low for my dataset, but using 0 as suggested by someone I talked to > produced very short lists of genes that did not look any better judging > from boxplots). Actually, I was wondering if setting the prior very low > (rather than high) could lead to overfitting of the variance estimate. > >> How large are the common and tagwise dispersions for your data? > > The common dispersion varies a little depending on how I group the > samples: > > [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) > [1] 0.2788752 (two groups, 8 ctrl and 32 patients) > > The tagwise dispersions (cds1<- estimateTagwiseDisp(cds1, prior.n=1, > trend=TRUE, prop.used=0.3, grid=FALSE)): > > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > Min. 1st Qu. Median Mean 3rd Qu. Max. > 0.1022 0.1894 0.2534 0.2916 0.3183 2.1890 > > A strange thing: When I run exactTest for the miRNA data (618 genes x 38 > samples) edgeR becomes extremely memory-consuming, basically using up all > of the 8 GB RAM on my laptop and then becomes painfully slow as the memory > starts switching. When I run exactTest for CAGE data for the same samples > (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. > I use a grid search for the CAGE data tagwise dispersion estimate and the > library sizes are smaller (around 7 million counts vs 15 million), but > otherwise the previous steps are basically the same. Any experience (or > qualified guess) of what might make the analysis use so much memory? > > Thanks again, > > Helena > > > On Sat, 1 Oct 2011, Gordon K Smyth wrote: > >> Dear Helena, >> >> Compared with mRNA-Seq, you have an unusually small number of transcripts but >> a relatively large number of biological replicates. This suggests that you >> should use a relative small value for prior.n but a relatively large value >> for prop.used. I am concerned that you have decreased prop.used its default >> value of 0.3. I would tend to increase this rather than decrease it. >> >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. Is this simply because it gave >> better looking results? Anyway, increasing prior.n does not result in >> overfitting. The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. >> >> Your experience with prior.n is unintuitive to me. Generally speaking, >> choosing prior.n small means that each miR gets to set its own dispersion, so >> that miR with large variance will not appear in the topTag list. When you >> say "variance outliers", do you mean large or small variance? >> >> Since your minimum group sample size is 10, I would have required miRs to >> satisfy your cpm requirement in>= 10 samples rather than 5. >> >> Best wishes >> Gordon >> >>> Date: Thu, 29 Sep 2011 05:25:14 +0000 >>> From: Helena Persson<helena.persson at="" ki.se=""> >>> To: "bioconductor at stat.math.ethz.ch"<bioconductor at="" stat.math.ethz.ch=""> >>> Subject: [BioC] edgeR on microRNA data >>> >>> Hi, >> >>> I would be grateful for some input on using edgeR for small RNA sequence >>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >>> 15 and 15). After removing genes that are not expressed at>= 0.2 cpm in>= >>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >>> estimate with: >>> >>> cds1<- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >>> grid=FALSE) >>> >>> Increasing the prior to e.g. 10 gives more differentially expressed genes >>> that do not look bad. Decreasing the prior to 0 leaves me with extremely >>> few differentially expressed genes that are mainly variance outliers. I >>> guess that miRNA data is likely to behave differently from mRNA data since >>> there are so few genes (but still a very large dynamic range). Is it >>> possible that I am over-fitting the estimate? Would you recommend changing >>> any other parameters? >>> >>> Best regards, >>> Helena >>> _________________________________ >>> >>> Helena Persson, PhD >>> >>> Karolinska Institutet >>> Dept of Biosciences and Nutrition >>> H?lsov?gen 7-9 >>> SE-141 83 Huddinge >>> Sweden >>> >>> Helena.Persson at ki.se >>> >>> tel. +46-(0)8-52481058 > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:13}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.5 years ago Helena Persson ▴ 50

0

Entering edit mode

On Mon, Oct 3, 2011 at 12:09 AM, Helena Persson <helena.persson at="" ki.se=""> wrote: > Well, I installed the devel version of R, but if I then do: > >> source("http://www.bioconductor.org/biocLite.R") >> biocLite("edgeR") > > I get the error message: > > Using R version 2.14.0, biocinstall version 2.8.4. > Installing Bioconductor version 2.8 packages: > [1] "edgeR" > Please wait... > > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/bioc/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/annotation/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/data/experiment/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://bioconductor.o rg/packages/2.8/extra/bin/macosx/leopard/contrib/2.14 > Warning: unable to access index for repository http://brainarray.mbn i.med.umich.edu/bioc/bin/macosx/leopard/contrib/2.14 > Warning message: > In getDependencies(pkgs, dependencies, available, lib) : > ?package ?edgeR? is not available (for R Under development) > > ... so I guess I am doing something wrong. My guess is your version of R-2.14 is a little old. If you're using R-2.14 newer than SVN revision 55733, you'd be using a new installation mechanism (the BiocInstaller package), and you would see the word BiocInstaller in your output instead of "biocinstall". Try updating to the latest R-devel, as Herv? suggests, from http://r.research.att.com. If you still have trouble, do send us the output of sessionInfo() as suggested earlier. Dan > > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 08:53 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: SV: edgeR on microRNA data > > You need to install the devel version of R, available from CRAN. ?Then you > get the devel version of edgeR and other Bioconductor packages > automatically. > > Gordon > > > On Mon, 3 Oct 2011, Helena Persson wrote: > > Dear Gordon, > Upgrading sounds like a good idea ? how do I install the devel version of edgeR? > > Best, > Helena > > ________________________________________ > Fr?n: Gordon K Smyth [smyth at wehi.EDU.AU] > Skickat: den 3 oktober 2011 05:33 > Till: Helena Persson > Kopia: Bioconductor mailing list > ?mne: Re: edgeR on microRNA data > > Dear Helena, > > You will find it very helpful to upgrade your version of edgeR to the > current developmental version (although you will need to be using R devel > aka R 2.14 to do so). ?You will find that exactTest() is now much faster > and less memory consuming. ?The current release version is time consuming > when the counts are large, mainly because of a change to the way in which > the rejection region is computed that we implemented two months ago. > > Fair comment about adding comments on prop.used. ?We had not considered > that users would generally change this. > > If you choose prior.n very small, then edgeR will simply use the genewise > dispersion estimate that depends on the data from that gene alone. ?This > is not over-fitting in itself. ?However it can lead to an increase in the > FDR because edgeR does not take into account when doing significance tests > of the uncertainty with which the dispersion is estimated. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.wehi.edu.au > http://www.statsci.org/smyth > > ------------ original message -------------- > > On Mon, 3 Oct 2011, Helena Persson wrote: > >> Dear Gordon, > I guess I should start with some clarifications: > >>I am concerned that you have decreased prop.used its default >> value of 0.3. ?I would tend to increase this rather than decrease it. > > For the microRNA data I have few genes but a relatively large expression > range. My reason for decreasing the prop.used was that I suspected that > using 30% would bin genes that had very different means of expression. I > did not give this a lot of thought at the time and have now rerun the > analysis using 0.3. Maybe it would be good to comment a bit more on this > parameter in the R help page or the edgeR vignette? > >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. ?Is this simply because it gave >> better looking results? ?Anyway, increasing prior.n does not result in >> overfitting. ?The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. > > I decided to remove two of the samples in the control group because they > appeared to be outliers from the rest, so my smallest group is actually 8 > samples. I did not put together the control samples, but judging from the > clinical data I got it is more hetereogeneous than the patient groups. > Choosing 2 for the prior.n was a compromise (I realised I should go quite > low for my dataset, but using 0 as suggested by someone I talked to > produced very short lists of genes that did not look any better judging > from boxplots). Actually, I was wondering if setting the prior very low > (rather than high) could lead to overfitting of the variance estimate. > >>How large are the common and tagwise dispersions for your data? > > The common dispersion varies a little depending on how I group the > samples: > > [1] 0.2681829 (three groups, 8 ctrl and 2 x 15 patients) > [1] 0.2788752 (two groups, 8 ctrl and 32 patients) > > The tagwise dispersions (cds1 <- estimateTagwiseDisp(cds1, prior.n=1, > trend=TRUE, prop.used=0.3, grid=FALSE)): > > ? ? Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max. > 0.09599 0.18370 0.24550 0.28160 0.31190 2.23000 > ? ? Min. 1st Qu. ?Median ? ?Mean 3rd Qu. ? ?Max. > ? 0.1022 ?0.1894 ?0.2534 ?0.2916 ?0.3183 ?2.1890 > > A strange thing: When I run exactTest for the miRNA data (618 genes x 38 > samples) edgeR becomes extremely memory-consuming, basically using up all > of the 8 GB RAM on my laptop and then becomes painfully slow as the memory > starts switching. When I run exactTest for CAGE data for the same samples > (15066 genes x 40 samples) it never goes above 3 GB and finishes rapidly. > I use a grid search for the CAGE data tagwise dispersion estimate and the > library sizes are smaller (around 7 million counts vs 15 million), but > otherwise the previous steps are basically the same. Any experience (or > qualified guess) of what might make the analysis use so much memory? > > Thanks again, > > Helena > > > On Sat, 1 Oct 2011, Gordon K Smyth wrote: > >> Dear Helena, >> >> Compared with mRNA-Seq, you have an unusually small number of transcripts but >> a relatively large number of biological replicates. ?This suggests that you >> should use a relative small value for prior.n but a relatively large value >> for prop.used. ?I am concerned that you have decreased prop.used its default >> value of 0.3. ?I would tend to increase this rather than decrease it. >> >> On the other hand, you have increased prior.n from its default value, which >> for your data would be a little over 0.5. ?Is this simply because it gave >> better looking results? ?Anyway, increasing prior.n does not result in >> overfitting. ?The risk with larger prior.n is simply that it may start to >> return differentially expressed miRs that are increased or decreased in only >> a few of the samples, rather than consistently for all samples in a group. >> >> Your experience with prior.n is unintuitive to me. ?Generally speaking, >> choosing prior.n small means that each miR gets to set its own dispersion, so >> that miR with large variance will not appear in the topTag list. ?When you >> say "variance outliers", do you mean large or small variance? >> >> Since your minimum group sample size is 10, I would have required miRs to >> satisfy your cpm requirement in >= 10 samples rather than 5. >> >> Best wishes >> Gordon >> >>> Date: Thu, 29 Sep 2011 05:25:14 +0000 >>> From: Helena Persson <helena.persson at="" ki.se=""> >>> To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> >>> Subject: [BioC] edgeR on microRNA data >>> >>> Hi, >> >>> I would be grateful for some input on using edgeR for small RNA sequence >>> data. I have been testing edgeR on a set of miRNA data (3 groups with n=10, >>> 15 and 15). After removing genes that are not expressed at >= 0.2 cpm in >= >>> 5 samples I have ~600 rows left. I tried calculating the tagwise dispersion >>> estimate with: >>> >>> cds1 <- estimateTagwiseDisp(cds1, prior.n=2, trend=TRUE, prop.used=0.1, >>> grid=FALSE) >>> >>> Increasing the prior to e.g. 10 gives more differentially expressed genes >>> that do not look bad. Decreasing the prior to 0 leaves me with extremely >>> few differentially expressed genes that are mainly variance outliers. I >>> guess that miRNA data is likely to behave differently from mRNA data since >>> there are so few genes (but still a very large dynamic range). Is it >>> possible that I am over-fitting the estimate? Would you recommend changing >>> any other parameters? >>> >>> Best regards, >>> Helena >>> _________________________________ >>> >>> Helena Persson, PhD >>> >>> Karolinska Institutet >>> Dept of Biosciences and Nutrition >>> H?lsov?gen 7-9 >>> SE-141 83 Huddinge >>> Sweden >>> >>> Helena.Persson at ki.se >>> >>> tel. +46-(0)8-52481058 > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:13}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 13.5 years ago Dan Tenenbaum ★ 8.2k

Login before adding your answer.