Question: limma-voom
1
5.0 years ago by
Ming Yi350
Ming Yi350 wrote:
Hi, Dr. Gordon and all: Just follow up on my RNAseq analysis on the TCGA RSEM processed so- called raw counts using limma-voom package as Dr. Gordon suggested due to the issue of so-called RSEM raw counts but not true raw counts that edgeR required. just try to get your opinion on it. Here are the commands I used (only show most relevant ones) > library(edgeR); > library(limma) > show(head(targets)); SampleName Subject Type 1 T4626_01A TCGA_38_4626 Tumor 2 N4626_11A TCGA_38_4626 Normal 3 T2668_01A TCGA_44_2668 Tumor 4 N2668_11A TCGA_44_2668 Normal > Patient<-factor(targets$Subject); > Type<-factor(targets$Type); > y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); > show(dim(y)); [1] 20531 58 ##Keep genes with least 1 count-per-million reads (cpm) in at least 20 samples: > isexpr <- rowSums(cpm(y)>1) >= 20; > y2 <- y[isexpr,] > show(dim(y2)); [1] 14735 58 > #Recompute the library sizes: > y2$samples$lib.size <- colSums(y2$counts) > #Use Entrez Gene IDs as row names: > rownames(y2$counts) <- rownames(y2$genes) <- y2$genes$entrezID; > ##Apply TMM normalization > y2 <- calcNormFactors(y2,method=c("TMM")); > plotMDS(y2,main="y2"); > show(data.frame(Sample=colnames(y2),Patient,Type)); Sample Patient Type 1 T4626_01A TCGA_38_4626 Tumor 2 N4626_11A TCGA_38_4626 Normal 3 T2668_01A TCGA_44_2668 Tumor 4 N2668_11A TCGA_44_2668 Normal ... > design<-model.matrix(~Patient+Type); > rownames(design) <- colnames(y2); > show(design[1:6,]); (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 T4626_01A 1 1 0 N4626_11A 1 1 0 T2668_01A 1 0 0 ... > v <- voom(y2,design,plot=TRUE); > plotMDS(v, top=50,main="common for v of y2",labels=paste(substr(Type ,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),gen e.selection="common"); > plotMDS(v, top=50,main="pairwise for v of y2",labels=paste(substr(Ty pe,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),g ene.selection="pairwise"); > fit <- lmFit(v,design) > fit <- eBayes(fit) > options(digits=3) > Rslt<-topTable(fit,coef="TypeTumor",n=Inf,sort="p"); > show(summary(de <- decideTests(fit))); (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 PatientTCGA_38_4632 -1 126 1 3 1 0 1805 14732 14725 14720 1 12804 2 7 14 ... PatientTCGA_91_6836 TypeTumor -1 4 3917 0 14718 5426 1 13 5392 Here are my questions: 1. I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the filtering (based on one example in the package guide). I have 29 paired of matched tumor and normal samples (total 58 samples), is this reasonble to use for filtering, I asked this last time as well? I have a bit concern, since after filtering, the number of genes went from 20531 to 14735. Shall I use something like y3<-y[rowSums(y$counts)>=50,] (y3 filtering) to filter (this filtering leaves about 18215 genes, about 4k more genes than the 14735 from the other filtering way)? 2. I did plot with plotMDS(y2,main="y2"); v <- voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all show very nice separation of tumors vs normals in dim1. However, the v <- voom(y2,design,plot=TRUE) plot slightly different between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got similar plot as the one on page 116 of the user guide, but the second way seems got bending toward to the down side around log2(count size +0.5) between 0 to 5, and the data points also shift to 0 side due to more of lower reads data points (the first option only have very few data points between 0 to 5 in x-axis). Does this suggest the first filtering way (y2) is better? 3. the above command show(summary(de <- decideTests(fit))); showed y2 filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas the y3 filtering (similar commands as y2, not shown) got final 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many DEGs, any issues here? 4. using edgeR and the same data, I got more than 13k DEGs (total 18215 genes in the dataset after filtering) with FDR <0.05 (filtering like y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs at FDR5%) as shown in my question 3, although slight less than edgeR. Any potenital issue? Thanks so much in advance for your input and advice! Ming > From: yi02@hotmail.com > To: smyth@wehi.edu.au > Date: Thu, 30 Jan 2014 03:07:56 +0000 > CC: bioconductor@r-project.org > Subject: Re: [BioC] limma-voom > > Hi, Dr. Smyth: > > Thanks for the info. I used limma a lot but not much on voom. I will give a shot on voom. Thanks again and best Ming > > > Date: Thu, 30 Jan 2014 13:05:57 +1100 > > From: smyth@wehi.EDU.AU > > To: yi02@hotmail.com > > CC: bioconductor@r-project.org > > Subject: limma-voom > > > > Hi Ming, > > > > voom is part of the limma package. There is a voom case study in the > > limma User's Guide with complete working code. It is the last case study > > in the users guide. > > > > voom works fine with either counts, or fractional counts, or scaled > > counts. > > > > We are currently finalizing additional voom code to detect differential > > splicing. That code isn't in the public package, but will be soon. > > > > Best wishes > > Gordon > > > > On Thu, 30 Jan 2014, Ming Yi wrote: > > > > > I found the following documents > > > > > > http://stuff.mit.edu/afs/athena/software/r/r_v2.15.1/lib/R/libra ry/limma/html/voom.html > > > http://www.bioconductor.org/packages/release/bioc/vignettes/limm a/inst/doc/usersguide.pdf > > > http://www.statsci.org/smyth/pubs/VoomPreprint.pdf > > > > > > But I could not find the real limma-voom package document. Is voom just > > > as normalization function prior to using limma linear model? Any good > > > examples of case study of working commands available as example? > > > > > > Thanks again! > > > Ming > > > > ______________________________________________________________________ > > The information in this email is confidential and inte...{{dropped:9}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]] rnaseq normalization limma edger • 3.0k views ADD COMMENTlink modified 5.0 years ago by Gordon Smyth36k • written 5.0 years ago by Ming Yi350 Answer: limma-voom 0 5.0 years ago by Gordon Smyth36k Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia Gordon Smyth36k wrote: > Hi, Dr. Gordon and all: > > Just follow up on my RNAseq analysis on the TCGA RSEM processed > so-called raw counts using limma-voom package as Dr. Gordon suggested > due to the issue of so-called RSEM raw counts but not true raw counts > that edgeR required. edgeR will work OK if you simply round the RSEM expected counts to integers before running edgeR. However voom is slightly preferable because it can use the fractional counts as they are. > just try to get your opinion on it. Here are the > commands I used (only show most relevant ones) > >> library(edgeR); >> library(limma) >> show(head(targets)); > SampleName Subject Type > 1 T4626_01A TCGA_38_4626 Tumor > 2 N4626_11A TCGA_38_4626 Normal > 3 T2668_01A TCGA_44_2668 Tumor > 4 N2668_11A TCGA_44_2668 Normal >> Patient<-factor(targets$Subject); >> Type<-factor(targets$Type); >> y<-DGEList(count=rawdata[,4:ncol(rawdata)],genes=rawdata[,1:3]); >> show(dim(y)); > [1] 20531 58 > > ##Keep genes with least 1 count-per-million reads (cpm) in at least 20 samples: >> isexpr <- rowSums(cpm(y)>1) >= 20; >> y2 <- y[isexpr,] >> show(dim(y2)); > [1] 14735 58 >> #Recompute the library sizes: >> y2$samples$lib.size <- colSums(y2$counts) >> #Use Entrez Gene IDs as row names: >> rownames(y2$counts) <- rownames(y2$genes) <- y2$genes$entrezID; > >> ##Apply TMM normalization >> y2 <- calcNormFactors(y2,method=c("TMM")); >> plotMDS(y2,main="y2"); > >> show(data.frame(Sample=colnames(y2),Patient,Type)); > Sample Patient Type > 1 T4626_01A TCGA_38_4626 Tumor > 2 N4626_11A TCGA_38_4626 Normal > 3 T2668_01A TCGA_44_2668 Tumor > 4 N2668_11A TCGA_44_2668 Normal > ... >> design<-model.matrix(~Patient+Type); >> rownames(design) <- colnames(y2); >> show(design[1:6,]); > (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 > T4626_01A 1 1 0 > N4626_11A 1 1 0 > T2668_01A 1 0 0 > ... >> v <- voom(y2,design,plot=TRUE); >> plotMDS(v, top=50,main="common for v of y2",labels=paste(substr(Typ e,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"),ge ne.selection="common"); >> plotMDS(v, top=50,main="pairwise for v of y2",labels=paste(substr(T ype,1,1),substr(Patient,9,12)),col=ifelse(Type=="Tumor","red","blue"), gene.selection="pairwise"); >> fit <- lmFit(v,design) >> fit <- eBayes(fit) >> options(digits=3) >> Rslt<-topTable(fit,coef="TypeTumor",n=Inf,sort="p"); >> show(summary(de <- decideTests(fit))); > (Intercept) PatientTCGA_38_4626 PatientTCGA_38_4627 PatientTCGA_38_4632 > -1 126 1 3 1 > 0 1805 14732 14725 14720 > 1 12804 2 7 14 > ... > PatientTCGA_91_6836 TypeTumor > -1 4 3917 > 0 14718 5426 > 1 13 5392 > > > Here are my questions: > > 1. I used isexpr <- rowSums(cpm(y)>1) >= 20 (y2 filtering) to do the > filtering (based on one example in the package guide). I have 29 paired > of matched tumor and normal samples (total 58 samples), is this > reasonble to use for filtering, I asked this last time as well? I have a > bit concern, since after filtering, the number of genes went from 20531 > to 14735. Why would this be a concern? One wouldn't expect the whole genome to be expressed at worthwhile levels in each particular cell type. Some genes are specific to cell types. > Shall I use something like y3<-y[rowSums(y$counts)>=50,] (y3 > filtering) to filter (this filtering leaves about 18215 genes, about 4k > more genes than the 14735 from the other filtering way)? > > 2. I did plot with plotMDS(y2,main="y2"); v <- > voom(y2,design,plot=TRUE); plotMDS(v, top=50,main...). The plotMDS all > show very nice separation of tumors vs normals in dim1. I have analysed similar data from TCGA, and the separation is really much too good. Furthermore there is little evidence of matching. It would appear that the matched normal cells are not really comparable cells to the tumours. The matched normal samples may actually be profiles of whole blood, a cell type which has a radically different expression profile to epithelial cells. I really wonder what one can hope to learn about cancer by comparing epithelial tumours to blood. > However, the v <- voom(y2,design,plot=TRUE) plot slightly different > between filtering with isexpr <- rowSums(cpm(y)>1) >= 20 vs filtering > with y3<-y[rowSums(y$counts)>=50,]. The first filtering way seems got > similar plot as the one on page 116 of the user guide, but the second > way seems got bending toward to the down side around log2(count size > +0.5) between 0 to 5, and the data points also shift to 0 side due to > more of lower reads data points (the first option only have very few > data points between 0 to 5 in x-axis). Does this suggest the first > filtering way (y2) is better? Yes, but one will good results either way. > 3. the above command show(summary(de <- decideTests(fit))); showed y2 > filtering got 3917 +5392=9309 DEGs at FDR5% (7661 at FDR 1%), whereas > the y3 filtering (similar commands as y2, not shown) got final > 4917+6022=10939 DEGs at FDR5%, both filtering seem getting too many > DEGs, any issues here? This is what I would expect to get if I compared epithelial tumour cells to normal blood. If this is "too many DEGs", then the problem is with the data rather than the analysis. > 4. using edgeR and the same data, I got more than 13k DEGs (total 18215 > genes in the dataset after filtering) with FDR <0.05 (filtering like > y3), now using limma voom, I still got quite a lot of DEGs (10939 DEGs > at FDR5%) as shown in my question 3, although slight less than edgeR. > Any potenital issue? The edgeR analysis on fractional counts is not correct. edgeR is a likelihood based package. It evaluates the negative binomial likelihood, which is always exactly zero at non-integers. So there is no point in comparing the edgeR results to the voom results. Best wishes Gordon > Thanks so much in advance for your input and advice! > > Ming > > > > > > >> From: yi02 at hotmail.com >> To: smyth at wehi.edu.au >> Date: Thu, 30 Jan 2014 03:07:56 +0000 >> CC: bioconductor at r-project.org >> Subject: Re: [BioC] limma-voom >> >> Hi, Dr. Smyth: >> >> Thanks for the info. I used limma a lot but not much on voom. I will give a shot on voom. Thanks again and best Ming >> >>> Date: Thu, 30 Jan 2014 13:05:57 +1100 >>> From: smyth at wehi.EDU.AU >>> To: yi02 at hotmail.com >>> CC: bioconductor at r-project.org >>> Subject: limma-voom >>> >>> Hi Ming, >>> >>> voom is part of the limma package. There is a voom case study in the >>> limma User's Guide with complete working code. It is the last case study >>> in the users guide. >>> >>> voom works fine with either counts, or fractional counts, or scaled >>> counts. >>> >>> We are currently finalizing additional voom code to detect differential >>> splicing. That code isn't in the public package, but will be soon. >>> >>> Best wishes >>> Gordon >>> >>> On Thu, 30 Jan 2014, Ming Yi wrote: >>> >>>> I found the following documents >>>> >>>> http://stuff.mit.edu/afs/athena/software/r/r_v2.15.1/lib/R/librar y/limma/html/voom.html >>>> http://www.bioconductor.org/packages/release/bioc/vignettes/limma /inst/doc/usersguide.pdf >>>> http://www.statsci.org/smyth/pubs/VoomPreprint.pdf >>>> >>>> But I could not find the real limma-voom package document. Is voom just >>>> as normalization function prior to using limma linear model? Any good >>>> examples of case study of working commands available as example? >>>> >>>> Thanks again! >>>> Ming >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and inte...{{dropped:9}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Hi, Just wanted to comment on the "paired" TCGA data: On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: [snip] > I have analysed similar data from TCGA, and the separation is really much > too good. Furthermore there is little evidence of matching. It would > appear that the matched normal cells are not really comparable cells to the > tumours. The matched normal samples may actually be profiles of whole > blood, a cell type which has a radically different expression profile to > epithelial cells. I really wonder what one can hope to learn about cancer > by comparing epithelial tumours to blood. There are samples where you have the "correct" matched sample for gene expression, as indeed comparing a solid tumor to "normal blood" to identify changes in gene expression would make very little sense. You just need to be careful in parsing the sample barcodes to ensure you are working with the correct stuff. An explanation of the barcode scheme is here: https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode You are interested in decoding the "sample" value, which ranges from 01-29. Explanation is here: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTab le=Sample%20type There are samples taken from "normal blood" (code = 10), however code=11 are solid tissue normals (presumably from the correct tissue ;-) The preponderance of blood samples is, I believe, primarily to be used in order to identify somatic mutations -- presumably using the exome-seq from "normal blood" would be cleaner than primary tissue, depending on how close to the tumor the tissue was taken from. To figure out what was extracted from the "sample" (DNA, RNA, etc), you just decode the "Analyte" part of the barcode: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeTab le=portion%20analyte HTH, -steve -- Steve Lianoglou Computational Biologist Genentech
Hi, Just wanted to add in a little more on the "normal" adjacent samples that TCGA uses. I work with breast cancer samples, and it has appeared that some(if not many) of these "normal" adjacent samples are really similar to the tumour themselves. This has been reported earlier (see http://cancerres.aacrjournals.org/cgi/content/meeting_abstract/72/24_M eetingAbstracts/P1-03-02 ). Just thought it will be worth sharing. Cheers, Jeremy On Thu, Feb 6, 2014 at 7:23 AM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi, > > Just wanted to comment on the "paired" TCGA data: > > On Wed, Feb 5, 2014 at 2:38 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > [snip] > > I have analysed similar data from TCGA, and the separation is really much > > too good. Furthermore there is little evidence of matching. It would > > appear that the matched normal cells are not really comparable cells to > the > > tumours. The matched normal samples may actually be profiles of whole > > blood, a cell type which has a radically different expression profile to > > epithelial cells. I really wonder what one can hope to learn about > cancer > > by comparing epithelial tumours to blood. > > There are samples where you have the "correct" matched sample for gene > expression, as indeed comparing a solid tumor to "normal blood" to > identify changes in gene expression would make very little sense. > > You just need to be careful in parsing the sample barcodes to ensure > you are working with the correct stuff. > > An explanation of the barcode scheme is here: > https://wiki.nci.nih.gov/display/TCGA/TCGA+Barcode > > You are interested in decoding the "sample" value, which ranges from 01-29. > > Explanation is here: > > https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeT able=Sample%20type > > There are samples taken from "normal blood" (code = 10), however > code=11 are solid tissue normals (presumably from the correct tissue > ;-) > > The preponderance of blood samples is, I believe, primarily to be used > in order to identify somatic mutations -- presumably using the > exome-seq from "normal blood" would be cleaner than primary tissue, > depending on how close to the tumor the tissue was taken from. > > To figure out what was extracted from the "sample" (DNA, RNA, etc), > you just decode the "Analyte" part of the barcode: > > https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm?codeT able=portion%20analyte > > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]