Question: questions on edgeR package
1
gravatar for Gordon Smyth
5.4 years ago by
Gordon Smyth37k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth37k wrote:
Dear Ming, I have said this in a separate thread, but thought I should summarize it for the original thread. After some investigation, it is clear that edgeR's problem with the TCGA RSEM "raw counts" is that they are not integers. edgeR is a likelihood- based package and it evaluates the negative binomial likelihood, which is always exactly zero at non-integer values. Hence edgeR cannot estimate the dispersion sensibly if any of the input counts are fractional. If you round the RSEM counts to integers, then edgeR will run ok. I have made some changes to the edgeR GLM code on Bioc-devel so that it does this rounding automatically. (The edgeR classic code was doing this anyway.) Note that this work-around is only sensible for input values that are estimates of counts. Best wishes Gordon On Wed, 29 Jan 2014, Gordon K Smyth wrote: > Dear Ming, > > Something is seriously wrong -- you shouldn't get these warnings, you > shouldn't get such a large dispersion estimate and, if you do, you shouldn't > get such small p-values. > > I suspect that the culprit is RSEM. edgeR is designed to accept raw read > counts rather than estimated abundance levels as might be output from RSEM or > Cufflinks. > > There are a number of tools that produce genewise read counts from RNA-seq > data. My lab routinely uses subread and featureCounts: > > http://www.ncbi.nlm.nih.gov/pubmed/24227677 > > which are available from the Bioconductor package Rsubread. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Tue, 28 Jan 2014, Ming Yi wrote: > >> >> Hi, >> >> I have a few questions on edgeR package. I used it a while ago. Recently, I >> started to use it again for an ongoing project and realized that it has >> been updated a lot with many new feaures. >> >> The data I am working on has 29 pairs of tumor and matched normal tissue >> samples from the same patients that have been run through RNAseq, the data >> has been processed by the popular RSEM method to be raw counts for each >> sample (tumor or macthed normal tissue of each patients). So I have a >> matrix of raw counts for 58 columns of data for all genes. we are looking >> for differential expressed genes between tumors and normals. Here are some >> main commands I used, eliminating some basic commands for simplicity: >> >>> show(head(targets)); >> SampleName Subject Type >> 1 T4626_01A 38_4626 Tumor >> 2 N4626_11A 38_4626 Normal >> 3 T2668_01A 44_2668 Tumor >> 4 N2668_11A 44_2668 Normal >> ...... >>> Patient<-factor(targets$Subject); >>> Type<-factor(targets$Type); >> ...... >> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>> show(dim(y)); >> [1] 20531 58 >> >>> #filter low expressed genes >>> y<-y[rowSums(y$counts)>=50,] >>> #Recompute the library sizes: >>> y$samples$lib.size <- colSums(y$counts) >>> #Use Entrez Gene IDs as row names: >>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>> ##Apply TMM normalization >>> y <- calcNormFactors(y,method=c("TMM")); >>> show(y$samples); >> group lib.size norm.factors >> T4626_01A 1 97435132 0.9834587 >> N4626_11A 1 62583672 0.9154837 >> T2668_01A 1 77034746 0.9687860 >> N2668_11A 1 58631311 0.8644352 >> ...... >>> design<-model.matrix(~Patient+Type); >>> rownames(design) <- colnames(y); >>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >> Disp = 3.99994 , BCV = 2 >> There were 50 or more warnings (use warnings() to see the first 50) >>> show(warnings()); >> Warning messages: >> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 4.520000 >> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 5.380000 >> .... >>> y <- estimateGLMTrendedDisp(y, design); >> Loading required package: splines >> There were 50 or more warnings (use warnings() to see the first 50) >>> y <- estimateGLMTagwiseDisp(y, design) >> There were 50 or more warnings (use warnings() to see the first 50) >>> fit <- glmFit(y, design); >>> lrt <- glmLRT(fit); >>> show(topTags(lrt)); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ....... >>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>> cpmRslt<-cpm(y) >>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>> show(cpmRslt[1:10,1:6]); >> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >> ..... >>> show(Rslt[1:10,]); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ..... >>> show(summary(de <- decideTestsDGE(lrt))); >> [,1] >> -1 6179 >> 0 4832 >> 1 7204 >> >> My questions are below: >> >> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >> expressed using 50 as cutoff here,kind of arbitrary. is there any better >> way to assess the cutoff value? can any one suggest a good cutoff based on >> his or her experience in such setting? >> >> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >> we got warning here (see warning message as above inside my commands), any >> issue on the warnings? >> >> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >> for example, gene 6532 always have tumor samples consistently much lower in >> cpm counts compared to that of matched normal samples, which is consistent >> with the logFC result in Rslt. So from the top genes, the result looks very >> promising. However, when I look into the full table of Rslt, there are so >> many genes appear to be differential expressed (see result of summary(de <- >> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >> the dataset) in this table with FDR <0.05, which make me a bit nervous >> considering the issues as I asked in last two questions. Is this what was >> usually seen in tumor vs normal comparison. This is a paired test of >> course, which may increase the power and may see more differntial genes. >> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >> normal sample pairs? which one shall be better for RNA-seq data, paired >> test vs pooled test? >> >> 4. For pooled test, I may use exactTest(), also based on the guide, there >> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >> model. anything like that worthy of trying? Any advice? >> >> 5. I did MDS plot, which looks nice in that the tumor and normal are >> separated very well in one side vs other. In the plotSmear plot, I did see >> some red spots inside the black spots zones in the center zone (supposed to >> be lower logFC), why they were called as significant? May this explain why >> we get so many DEGs >13k of them? Or large variations across samples? >> >> Thanks so much in advance for your help! >> >> Mike >> NIH >> Maryland, USA. >> >>> show(sessionInfo()); >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods >> [8] base >> other attached packages: >> [1] edgeR_3.4.2 limma_3.16.8 >> >> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
rnaseq edger rsubread • 1.0k views
ADD COMMENTlink modified 5.4 years ago by Yi, Ming NIH/NCI [C]100 • written 5.4 years ago by Gordon Smyth37k
Answer: questions on edgeR package
0
gravatar for Yi, Ming NIH/NCI [C]
5.4 years ago by
United States
Yi, Ming NIH/NCI [C]100 wrote:
Dear Dr. Smyth: Thanks so much for your help and advice. Indeed, I did try to use edger on rounded RSEM processed raw counts, result is quite different, here are the comparisons FYI: Not rounded: > y <- estimateGLMCommonDisp(y, design, verbose=TRUE); Disp = 3.99994 , BCV = 2 After rounded the RSEM raw counts to integers: > y <- estimateGLMCommonDisp(y, design, verbose=TRUE); Disp = 0.32356 , BCV = 0.5688 Not rounded: > show(summary(de <- decideTestsDGE(lrt))); [,1] -1 6179 0 4832 1 7204 After rounded the RSEM raw counts to integers: > show(summary(de <- decideTestsDGE(lrt))); [,1] -1 4948 0 7778 1 5490 So observation is: after rounded, the dispersion is much smaller, and DEGs also has about 3k less genes, which appears to be promising. Then compared with limma-voom result of the same data: show(summary(de <- decideTests(fit))); ... PatientTCGA_91_6836 TypeTumor -1 13 4917 0 18186 7276 1 16 6022 The number of EDGs seems rather close to that when using edgeR on rounded RSEM raw counts. But what I am trying to ask your opinion here: in these situations, you would like more on the limma-voom result or the edgeR result on rounded RSEM raw counts? Since I used limma a lot on array data, but not much except this one on RNA-seq data. Thanks again for all your help and advice! Best, Ming -----Original Message----- From: Gordon K Smyth [mailto:smyth@wehi.edu.au] Sent: Thursday, February 06, 2014 12:40 AM To: Ming Yi Cc: Bioconductor mailing list Subject: Re: [BioC] questions on edgeR package Dear Ming, I have said this in a separate thread, but thought I should summarize it for the original thread. After some investigation, it is clear that edgeR's problem with the TCGA RSEM "raw counts" is that they are not integers. edgeR is a likelihood- based package and it evaluates the negative binomial likelihood, which is always exactly zero at non-integer values. Hence edgeR cannot estimate the dispersion sensibly if any of the input counts are fractional. If you round the RSEM counts to integers, then edgeR will run ok. I have made some changes to the edgeR GLM code on Bioc-devel so that it does this rounding automatically. (The edgeR classic code was doing this anyway.) Note that this work-around is only sensible for input values that are estimates of counts. Best wishes Gordon On Wed, 29 Jan 2014, Gordon K Smyth wrote: > Dear Ming, > > Something is seriously wrong -- you shouldn't get these warnings, you > shouldn't get such a large dispersion estimate and, if you do, you shouldn't > get such small p-values. > > I suspect that the culprit is RSEM. edgeR is designed to accept raw read > counts rather than estimated abundance levels as might be output from RSEM or > Cufflinks. > > There are a number of tools that produce genewise read counts from RNA-seq > data. My lab routinely uses subread and featureCounts: > > http://www.ncbi.nlm.nih.gov/pubmed/24227677 > > which are available from the Bioconductor package Rsubread. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Tue, 28 Jan 2014, Ming Yi wrote: > >> >> Hi, >> >> I have a few questions on edgeR package. I used it a while ago. Recently, I >> started to use it again for an ongoing project and realized that it has >> been updated a lot with many new feaures. >> >> The data I am working on has 29 pairs of tumor and matched normal tissue >> samples from the same patients that have been run through RNAseq, the data >> has been processed by the popular RSEM method to be raw counts for each >> sample (tumor or macthed normal tissue of each patients). So I have a >> matrix of raw counts for 58 columns of data for all genes. we are looking >> for differential expressed genes between tumors and normals. Here are some >> main commands I used, eliminating some basic commands for simplicity: >> >>> show(head(targets)); >> SampleName Subject Type >> 1 T4626_01A 38_4626 Tumor >> 2 N4626_11A 38_4626 Normal >> 3 T2668_01A 44_2668 Tumor >> 4 N2668_11A 44_2668 Normal >> ...... >>> Patient<-factor(targets$Subject); >>> Type<-factor(targets$Type); >> ...... >> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>> show(dim(y)); >> [1] 20531 58 >> >>> #filter low expressed genes >>> y<-y[rowSums(y$counts)>=50,] >>> #Recompute the library sizes: >>> y$samples$lib.size <- colSums(y$counts) >>> #Use Entrez Gene IDs as row names: >>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>> ##Apply TMM normalization >>> y <- calcNormFactors(y,method=c("TMM")); >>> show(y$samples); >> group lib.size norm.factors >> T4626_01A 1 97435132 0.9834587 >> N4626_11A 1 62583672 0.9154837 >> T2668_01A 1 77034746 0.9687860 >> N2668_11A 1 58631311 0.8644352 >> ...... >>> design<-model.matrix(~Patient+Type); >>> rownames(design) <- colnames(y); >>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >> Disp = 3.99994 , BCV = 2 >> There were 50 or more warnings (use warnings() to see the first 50) >>> show(warnings()); >> Warning messages: >> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 4.520000 >> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >> non-integer x = 5.380000 >> .... >>> y <- estimateGLMTrendedDisp(y, design); >> Loading required package: splines >> There were 50 or more warnings (use warnings() to see the first 50) >>> y <- estimateGLMTagwiseDisp(y, design) >> There were 50 or more warnings (use warnings() to see the first 50) >>> fit <- glmFit(y, design); >>> lrt <- glmLRT(fit); >>> show(topTags(lrt)); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ....... >>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>> cpmRslt<-cpm(y) >>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>> show(cpmRslt[1:10,1:6]); >> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >> ..... >>> show(Rslt[1:10,]); >> Coefficient: TypeTumor >> symbols entrezID gene_id logFC logCPM LR PValue >> FDR >> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >> 0 >> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >> 0 >> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >> 0 >> ..... >>> show(summary(de <- decideTestsDGE(lrt))); >> [,1] >> -1 6179 >> 0 4832 >> 1 7204 >> >> My questions are below: >> >> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >> expressed using 50 as cutoff here,kind of arbitrary. is there any better >> way to assess the cutoff value? can any one suggest a good cutoff based on >> his or her experience in such setting? >> >> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >> we got warning here (see warning message as above inside my commands), any >> issue on the warnings? >> >> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >> for example, gene 6532 always have tumor samples consistently much lower in >> cpm counts compared to that of matched normal samples, which is consistent >> with the logFC result in Rslt. So from the top genes, the result looks very >> promising. However, when I look into the full table of Rslt, there are so >> many genes appear to be differential expressed (see result of summary(de <- >> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >> the dataset) in this table with FDR <0.05, which make me a bit nervous >> considering the issues as I asked in last two questions. Is this what was >> usually seen in tumor vs normal comparison. This is a paired test of >> course, which may increase the power and may see more differntial genes. >> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >> normal sample pairs? which one shall be better for RNA-seq data, paired >> test vs pooled test? >> >> 4. For pooled test, I may use exactTest(), also based on the guide, there >> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >> model. anything like that worthy of trying? Any advice? >> >> 5. I did MDS plot, which looks nice in that the tumor and normal are >> separated very well in one side vs other. In the plotSmear plot, I did see >> some red spots inside the black spots zones in the center zone (supposed to >> be lower logFC), why they were called as significant? May this explain why >> we get so many DEGs >13k of them? Or large variations across samples? >> >> Thanks so much in advance for your help! >> >> Mike >> NIH >> Maryland, USA. >> >>> show(sessionInfo()); >> R version 3.0.1 (2013-05-16) >> Platform: x86_64-unknown-linux-gnu (64-bit) >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=C LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods >> [8] base >> other attached packages: >> [1] edgeR_3.4.2 limma_3.16.8 >> >> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}
ADD COMMENTlink written 5.4 years ago by Yi, Ming NIH/NCI [C]100
Ming, Given RSEM counts, I would personally use the limma-voom for preference because it doesn't require you to round the counts. But you could use either in your case. I said the same to you yesterday: https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/057 538.html Gordon On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote: > > Dear Dr. Smyth: > > Thanks so much for your help and advice. Indeed, I did try to use edger > on rounded RSEM processed raw counts, result is quite different, here > are the comparisons FYI: > > Not rounded: >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > Disp = 3.99994 , BCV = 2 > After rounded the RSEM raw counts to integers: >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > Disp = 0.32356 , BCV = 0.5688 > > > Not rounded: >> show(summary(de <- decideTestsDGE(lrt))); > [,1] > -1 6179 > 0 4832 > 1 7204 > After rounded the RSEM raw counts to integers: >> show(summary(de <- decideTestsDGE(lrt))); > [,1] > -1 4948 > 0 7778 > 1 5490 > > So observation is: after rounded, the dispersion is much smaller, and > DEGs also has about 3k less genes, which appears to be promising. > > Then compared with limma-voom result of the same data: > show(summary(de <- decideTests(fit))); > ... > PatientTCGA_91_6836 TypeTumor > -1 13 4917 > 0 18186 7276 > 1 16 6022 > > > The number of EDGs seems rather close to that when using edgeR on > rounded RSEM raw counts. But what I am trying to ask your opinion here: > in these situations, you would like more on the limma-voom result or the > edgeR result on rounded RSEM raw counts? Since I used limma a lot on > array data, but not much except this one on RNA-seq data. > > Thanks again for all your help and advice! > > Best, > > Ming > > > > > > -----Original Message----- > From: Gordon K Smyth [mailto:smyth at wehi.edu.au] > Sent: Thursday, February 06, 2014 12:40 AM > To: Ming Yi > Cc: Bioconductor mailing list > Subject: Re: [BioC] questions on edgeR package > > Dear Ming, > > I have said this in a separate thread, but thought I should summarize it > for the original thread. > > After some investigation, it is clear that edgeR's problem with the TCGA > RSEM "raw counts" is that they are not integers. edgeR is a likelihood- > based package and it evaluates the negative binomial likelihood, which is > always exactly zero at non-integer values. Hence edgeR cannot estimate > the dispersion sensibly if any of the input counts are fractional. > > If you round the RSEM counts to integers, then edgeR will run ok. > > I have made some changes to the edgeR GLM code on Bioc-devel so that it > does this rounding automatically. (The edgeR classic code was doing this > anyway.) Note that this work-around is only sensible for input values > that are estimates of counts. > > Best wishes > Gordon > > On Wed, 29 Jan 2014, Gordon K Smyth wrote: > >> Dear Ming, >> >> Something is seriously wrong -- you shouldn't get these warnings, you >> shouldn't get such a large dispersion estimate and, if you do, you shouldn't >> get such small p-values. >> >> I suspect that the culprit is RSEM. edgeR is designed to accept raw read >> counts rather than estimated abundance levels as might be output from RSEM or >> Cufflinks. >> >> There are a number of tools that produce genewise read counts from RNA-seq >> data. My lab routinely uses subread and featureCounts: >> >> http://www.ncbi.nlm.nih.gov/pubmed/24227677 >> >> which are available from the Bioconductor package Rsubread. >> >> Best wishes >> Gordon >> >> --------------------------------------------- >> Professor Gordon K Smyth, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> http://www.statsci.org/smyth >> >> On Tue, 28 Jan 2014, Ming Yi wrote: >> >>> >>> Hi, >>> >>> I have a few questions on edgeR package. I used it a while ago. Recently, I >>> started to use it again for an ongoing project and realized that it has >>> been updated a lot with many new feaures. >>> >>> The data I am working on has 29 pairs of tumor and matched normal tissue >>> samples from the same patients that have been run through RNAseq, the data >>> has been processed by the popular RSEM method to be raw counts for each >>> sample (tumor or macthed normal tissue of each patients). So I have a >>> matrix of raw counts for 58 columns of data for all genes. we are looking >>> for differential expressed genes between tumors and normals. Here are some >>> main commands I used, eliminating some basic commands for simplicity: >>> >>>> show(head(targets)); >>> SampleName Subject Type >>> 1 T4626_01A 38_4626 Tumor >>> 2 N4626_11A 38_4626 Normal >>> 3 T2668_01A 44_2668 Tumor >>> 4 N2668_11A 44_2668 Normal >>> ...... >>>> Patient<-factor(targets$Subject); >>>> Type<-factor(targets$Type); >>> ...... >>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >>>> show(dim(y)); >>> [1] 20531 58 >>> >>>> #filter low expressed genes >>>> y<-y[rowSums(y$counts)>=50,] >>>> #Recompute the library sizes: >>>> y$samples$lib.size <- colSums(y$counts) >>>> #Use Entrez Gene IDs as row names: >>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID >>>> ##Apply TMM normalization >>>> y <- calcNormFactors(y,method=c("TMM")); >>>> show(y$samples); >>> group lib.size norm.factors >>> T4626_01A 1 97435132 0.9834587 >>> N4626_11A 1 62583672 0.9154837 >>> T2668_01A 1 77034746 0.9687860 >>> N2668_11A 1 58631311 0.8644352 >>> ...... >>>> design<-model.matrix(~Patient+Type); >>>> rownames(design) <- colnames(y); >>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); >>> Disp = 3.99994 , BCV = 2 >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> show(warnings()); >>> Warning messages: >>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>> non-integer x = 4.520000 >>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : >>> non-integer x = 5.380000 >>> .... >>>> y <- estimateGLMTrendedDisp(y, design); >>> Loading required package: splines >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> y <- estimateGLMTagwiseDisp(y, design) >>> There were 50 or more warnings (use warnings() to see the first 50) >>>> fit <- glmFit(y, design); >>>> lrt <- glmLRT(fit); >>>> show(topTags(lrt)); >>> Coefficient: TypeTumor >>> symbols entrezID gene_id logFC logCPM LR PValue >>> FDR >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >>> 0 >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >>> 0 >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >>> 0 >>> ....... >>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); >>>> cpmRslt<-cpm(y) >>>> cpmRslt<-cpmRslt[rownames(Rslt),]; >>>> show(cpmRslt[1:10,1:6]); >>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A >>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 >>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 >>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 >>> ..... >>>> show(Rslt[1:10,]); >>> Coefficient: TypeTumor >>> symbols entrezID gene_id logFC logCPM LR PValue >>> FDR >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 >>> 0 >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 >>> 0 >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 >>> 0 >>> ..... >>>> show(summary(de <- decideTestsDGE(lrt))); >>> [,1] >>> -1 6179 >>> 0 4832 >>> 1 7204 >>> >>> My questions are below: >>> >>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower >>> expressed using 50 as cutoff here,kind of arbitrary. is there any better >>> way to assess the cutoff value? can any one suggest a good cutoff based on >>> his or her experience in such setting? >>> >>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives >>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why >>> we got warning here (see warning message as above inside my commands), any >>> issue on the warnings? >>> >>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, >>> for example, gene 6532 always have tumor samples consistently much lower in >>> cpm counts compared to that of matched normal samples, which is consistent >>> with the logFC result in Rslt. So from the top genes, the result looks very >>> promising. However, when I look into the full table of Rslt, there are so >>> many genes appear to be differential expressed (see result of summary(de <- >>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in >>> the dataset) in this table with FDR <0.05, which make me a bit nervous >>> considering the issues as I asked in last two questions. Is this what was >>> usually seen in tumor vs normal comparison. This is a paired test of >>> course, which may increase the power and may see more differntial genes. >>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs >>> normal sample pairs? which one shall be better for RNA-seq data, paired >>> test vs pooled test? >>> >>> 4. For pooled test, I may use exactTest(), also based on the guide, there >>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm >>> model. anything like that worthy of trying? Any advice? >>> >>> 5. I did MDS plot, which looks nice in that the tumor and normal are >>> separated very well in one side vs other. In the plotSmear plot, I did see >>> some red spots inside the black spots zones in the center zone (supposed to >>> be lower logFC), why they were called as significant? May this explain why >>> we get so many DEGs >13k of them? Or large variations across samples? >>> >>> Thanks so much in advance for your help! >>> >>> Mike >>> NIH >>> Maryland, USA. >>> >>>> show(sessionInfo()); >>> R version 3.0.1 (2013-05-16) >>> Platform: x86_64-unknown-linux-gnu (64-bit) >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=C LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> attached base packages: >>> [1] splines stats graphics grDevices utils datasets methods >>> [8] base >>> other attached packages: >>> [1] edgeR_3.4.2 limma_3.16.8 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLYlink written 5.4 years ago by Gordon Smyth37k
Thx for your advice, Ming > Date: Fri, 7 Feb 2014 09:15:36 +1100 > From: smyth@wehi.EDU.AU > To: yiming@mail.nih.gov > CC: yi02@hotmail.com; bioconductor@r-project.org > Subject: RE: [BioC] questions on edgeR package > > Ming, > > Given RSEM counts, I would personally use the limma-voom for preference > because it doesn't require you to round the counts. But you could use > either in your case. I said the same to you yesterday: > > https://www.stat.math.ethz.ch/pipermail/bioconductor/2014-February/0 57538.html > > Gordon > > On Thu, 6 Feb 2014, Yi, Ming (NIH/NCI) [C] wrote: > > > > > Dear Dr. Smyth: > > > > Thanks so much for your help and advice. Indeed, I did try to use edger > > on rounded RSEM processed raw counts, result is quite different, here > > are the comparisons FYI: > > > > Not rounded: > >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > > Disp = 3.99994 , BCV = 2 > > After rounded the RSEM raw counts to integers: > >> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > > Disp = 0.32356 , BCV = 0.5688 > > > > > > Not rounded: > >> show(summary(de <- decideTestsDGE(lrt))); > > [,1] > > -1 6179 > > 0 4832 > > 1 7204 > > After rounded the RSEM raw counts to integers: > >> show(summary(de <- decideTestsDGE(lrt))); > > [,1] > > -1 4948 > > 0 7778 > > 1 5490 > > > > So observation is: after rounded, the dispersion is much smaller, and > > DEGs also has about 3k less genes, which appears to be promising. > > > > Then compared with limma-voom result of the same data: > > show(summary(de <- decideTests(fit))); > > ... > > PatientTCGA_91_6836 TypeTumor > > -1 13 4917 > > 0 18186 7276 > > 1 16 6022 > > > > > > The number of EDGs seems rather close to that when using edgeR on > > rounded RSEM raw counts. But what I am trying to ask your opinion here: > > in these situations, you would like more on the limma-voom result or the > > edgeR result on rounded RSEM raw counts? Since I used limma a lot on > > array data, but not much except this one on RNA-seq data. > > > > Thanks again for all your help and advice! > > > > Best, > > > > Ming > > > > > > > > > > > > -----Original Message----- > > From: Gordon K Smyth [mailto:smyth@wehi.edu.au] > > Sent: Thursday, February 06, 2014 12:40 AM > > To: Ming Yi > > Cc: Bioconductor mailing list > > Subject: Re: [BioC] questions on edgeR package > > > > Dear Ming, > > > > I have said this in a separate thread, but thought I should summarize it > > for the original thread. > > > > After some investigation, it is clear that edgeR's problem with the TCGA > > RSEM "raw counts" is that they are not integers. edgeR is a likelihood- > > based package and it evaluates the negative binomial likelihood, which is > > always exactly zero at non-integer values. Hence edgeR cannot estimate > > the dispersion sensibly if any of the input counts are fractional. > > > > If you round the RSEM counts to integers, then edgeR will run ok. > > > > I have made some changes to the edgeR GLM code on Bioc-devel so that it > > does this rounding automatically. (The edgeR classic code was doing this > > anyway.) Note that this work-around is only sensible for input values > > that are estimates of counts. > > > > Best wishes > > Gordon > > > > On Wed, 29 Jan 2014, Gordon K Smyth wrote: > > > >> Dear Ming, > >> > >> Something is seriously wrong -- you shouldn't get these warnings, you > >> shouldn't get such a large dispersion estimate and, if you do, you shouldn't > >> get such small p-values. > >> > >> I suspect that the culprit is RSEM. edgeR is designed to accept raw read > >> counts rather than estimated abundance levels as might be output from RSEM or > >> Cufflinks. > >> > >> There are a number of tools that produce genewise read counts from RNA-seq > >> data. My lab routinely uses subread and featureCounts: > >> > >> http://www.ncbi.nlm.nih.gov/pubmed/24227677 > >> > >> which are available from the Bioconductor package Rsubread. > >> > >> Best wishes > >> Gordon > >> > >> --------------------------------------------- > >> Professor Gordon K Smyth, > >> Bioinformatics Division, > >> Walter and Eliza Hall Institute of Medical Research, > >> 1G Royal Parade, Parkville, Vic 3052, Australia. > >> http://www.statsci.org/smyth > >> > >> On Tue, 28 Jan 2014, Ming Yi wrote: > >> > >>> > >>> Hi, > >>> > >>> I have a few questions on edgeR package. I used it a while ago. Recently, I > >>> started to use it again for an ongoing project and realized that it has > >>> been updated a lot with many new feaures. > >>> > >>> The data I am working on has 29 pairs of tumor and matched normal tissue > >>> samples from the same patients that have been run through RNAseq, the data > >>> has been processed by the popular RSEM method to be raw counts for each > >>> sample (tumor or macthed normal tissue of each patients). So I have a > >>> matrix of raw counts for 58 columns of data for all genes. we are looking > >>> for differential expressed genes between tumors and normals. Here are some > >>> main commands I used, eliminating some basic commands for simplicity: > >>> > >>>> show(head(targets)); > >>> SampleName Subject Type > >>> 1 T4626_01A 38_4626 Tumor > >>> 2 N4626_11A 38_4626 Normal > >>> 3 T2668_01A 44_2668 Tumor > >>> 4 N2668_11A 44_2668 Normal > >>> ...... > >>>> Patient<-factor(targets$Subject); > >>>> Type<-factor(targets$Type); > >>> ...... > >>> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > >>>> show(dim(y)); > >>> [1] 20531 58 > >>> > >>>> #filter low expressed genes > >>>> y<-y[rowSums(y$counts)>=50,] > >>>> #Recompute the library sizes: > >>>> y$samples$lib.size <- colSums(y$counts) > >>>> #Use Entrez Gene IDs as row names: > >>>> rownames(y$counts) <- rownames(y$genes) <- y$genes$entrezID > >>>> ##Apply TMM normalization > >>>> y <- calcNormFactors(y,method=c("TMM")); > >>>> show(y$samples); > >>> group lib.size norm.factors > >>> T4626_01A 1 97435132 0.9834587 > >>> N4626_11A 1 62583672 0.9154837 > >>> T2668_01A 1 77034746 0.9687860 > >>> N2668_11A 1 58631311 0.8644352 > >>> ...... > >>>> design<-model.matrix(~Patient+Type); > >>>> rownames(design) <- colnames(y); > >>>> y <- estimateGLMCommonDisp(y, design, verbose=TRUE); > >>> Disp = 3.99994 , BCV = 2 > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> show(warnings()); > >>> Warning messages: > >>> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 4.520000 > >>> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) : > >>> non-integer x = 5.380000 > >>> .... > >>>> y <- estimateGLMTrendedDisp(y, design); > >>> Loading required package: splines > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> y <- estimateGLMTagwiseDisp(y, design) > >>> There were 50 or more warnings (use warnings() to see the first 50) > >>>> fit <- glmFit(y, design); > >>>> lrt <- glmLRT(fit); > >>>> show(topTags(lrt)); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue > >>> FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 > >>> 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 > >>> 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 > >>> 0 > >>> ....... > >>>> Rslt<-topTags(lrt, n=nrow(y),adjust.method="BH", sort.by="PValue"); > >>>> cpmRslt<-cpm(y) > >>>> cpmRslt<-cpmRslt[rownames(Rslt),]; > >>>> show(cpmRslt[1:10,1:6]); > >>> T4626_01A N4626_11A T2668_01A N2668_11A T6145_01A > >>> 6532 2.0558647 2.140002e+02 0.02679881 1.568574e+01 0.2147611 > >>> 1301 1.5236358 1.047224e-01 214.77904585 1.973049e-02 132.6917056 > >>> 5047 13.4413898 1.221761e-01 7.62426085 3.946099e-02 60.7774032 > >>> ..... > >>>> show(Rslt[1:10,]); > >>> Coefficient: TypeTumor > >>> symbols entrezID gene_id logFC logCPM LR PValue > >>> FDR > >>> 6532 SLC6A4 6532 SLC6A4|6532 -7.941097 6.329882 2946.202 0 > >>> 0 > >>> 1301 COL11A1 1301 COL11A1|1301 7.535040 6.150749 2105.109 0 > >>> 0 > >>> 5047 PAEP 5047 PAEP|5047 7.083690 5.655034 1498.046 0 > >>> 0 > >>> ..... > >>>> show(summary(de <- decideTestsDGE(lrt))); > >>> [,1] > >>> -1 6179 > >>> 0 4832 > >>> 1 7204 > >>> > >>> My questions are below: > >>> > >>> 1. I used the command: y<-y[rowSums(y$counts)>=50,] to filter lower > >>> expressed using 50 as cutoff here,kind of arbitrary. is there any better > >>> way to assess the cutoff value? can any one suggest a good cutoff based on > >>> his or her experience in such setting? > >>> > >>> 2. the command: y <- estimateGLMCommonDisp(y, design, verbose=TRUE) gives > >>> Disp = 3.99994 , BCV = 2, which is quite large, any issue on this? also why > >>> we got warning here (see warning message as above inside my commands), any > >>> issue on the warnings? > >>> > >>> 3. The top gene lists shown in both cpmRslt and Rslt are quite consistent, > >>> for example, gene 6532 always have tumor samples consistently much lower in > >>> cpm counts compared to that of matched normal samples, which is consistent > >>> with the logFC result in Rslt. So from the top genes, the result looks very > >>> promising. However, when I look into the full table of Rslt, there are so > >>> many genes appear to be differential expressed (see result of summary(de <- > >>> decideTestsDGE(lrt))). There are more than 13k genes (total 20531 genes in > >>> the dataset) in this table with FDR <0.05, which make me a bit nervous > >>> considering the issues as I asked in last two questions. Is this what was > >>> usually seen in tumor vs normal comparison. This is a paired test of > >>> course, which may increase the power and may see more differntial genes. > >>> what if I try pooled tumors vs pooled normals disregard of matched tumor vs > >>> normal sample pairs? which one shall be better for RNA-seq data, paired > >>> test vs pooled test? > >>> > >>> 4. For pooled test, I may use exactTest(), also based on the guide, there > >>> is no estimateGLMTrendedDisp needed like in calling glmFit/glmLRT in glm > >>> model. anything like that worthy of trying? Any advice? > >>> > >>> 5. I did MDS plot, which looks nice in that the tumor and normal are > >>> separated very well in one side vs other. In the plotSmear plot, I did see > >>> some red spots inside the black spots zones in the center zone (supposed to > >>> be lower logFC), why they were called as significant? May this explain why > >>> we get so many DEGs >13k of them? Or large variations across samples? > >>> > >>> Thanks so much in advance for your help! > >>> > >>> Mike > >>> NIH > >>> Maryland, USA. > >>> > >>>> show(sessionInfo()); > >>> R version 3.0.1 (2013-05-16) > >>> Platform: x86_64-unknown-linux-gnu (64-bit) > >>> locale: > >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >>> [7] LC_PAPER=C LC_NAME=C > >>> [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>> attached base packages: > >>> [1] splines stats graphics grDevices utils datasets methods > >>> [8] base > >>> other attached packages: > >>> [1] edgeR_3.4.2 limma_3.16.8 > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}
ADD REPLYlink written 5.4 years ago by Ming Yi350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 285 users visited in the last hour