Search
Question: Design matrix and BCV
0
5.5 years ago by
Gordon Smyth35k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth35k wrote:
On Sun, 5 May 2013, Manoj Hariharan wrote: > Dear Gordon, > Thanks again for your inputs. I am quite clear of the method now. I > agree that the DE genes are exactly the same (and in same order) > whichever tissue I would take as a base group (the one that gets > absorbed as the intercept). > I was referring to the values of logFC.XX that is obtained from the > topTags table. This is quite different based on the tissue that I use as > base group. I guess this is not the log 2 fold change compared to the > average across all groups, whereas, it is the fold change compared to > the base group. Yes, that is correct. The toptable shows you the estimated coefficients from the fitted model, and in this case you defined the coefficients relative to the base group. You can easily get the fold change compared to the average across all groups, if you wish, but that's not usually a very useful quantity. What your question? > I have attached a screen-shot of the topTags table for the top 47 DE > genes in a few tissues to make the point, by using three different > tissues as the basegroup. No need to give examples. This is just documented behaviour of the software. Best wishes Gordon With FT as base group: tiss_groups <- factor(c("AAFT","AAFT","AAFT","AD","AD","AO","AO","BL",...) design <- model.matrix(~tiss_groups) QLF_lrt <- glmQLFTest(fit,coef=2:18) toptags_QLFLRT <- topTags(QLF_lrt, n=nrow(D$counts)) toptags_QLFLRT_table <- toptags_QLFLRT$table write.table(toptags_QLFLRT_table, "All37Cmprd_QLTLRTTable_BaseGroupFT_toptags", sep="\t", quote=FALSE) With PO as base group: tiss_groups_PO <- factor(c("AAPO","AAPO","AAPO","AD","AD","AO","AO","BL"...) write.table(toptags_QLFLRT_table_PO, "All37Cmprd_QLTLRTTable_BaseGroupPO_toptags", sep="\t", quote=FALSE) With SB as base group: tiss_groups_SB <- factor(c("AASB","AASB","AASB","AD","AD","AO","AO","BL",..) write.table(toptags_QLFLRT_table_SB, "All37Cmprd_QLTLRTTable_BaseGroupSB_toptags", sep="\t", quote=FALSE) Thanks again for your time and valuable guidance. Regards, Manoj. ? Manoj Hariharan Staff Researcher The Salk Institute for Biological Studies La Jolla, CA 92037 Office: 858.453.4100 x2143 ________________________________ From: Gordon K Smyth <smyth at="" wehi.edu.au=""> To: Manoj Hariharan <h_manoj at="" yahoo.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Sent: Sunday, April 28, 2013 1:07 AM Subject: Re: Design matrix and BCV Dear Manoj, --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, http://www.statsci.org/smyth On Sat, 27 Apr 2013, Manoj Hariharan wrote: > Dear Gordon, > Thanks very much for your response. I updated to the latest version of > edgeR (edgeR_3.2.3). > 1. I checked the BCV of unrelated individuals mentioned in page 69 - > that was from study based on cell lines ("RNA-Seq profiles were made > from lymphoblastoid cell lines"). They are grown in controlled > conditions, uniformly. But, in my case, the samples are tissues > dissected from donors just after death. Each lymphoblastoid cell line is from a different person.? But, yes, I agree that samples from human tissue donors will be vary variable. > Anyway, I now filtered out the outliers by using a more > stringent cut-off of "keep <- rowSums(cpm(D)>1) > >= 30" and I get a BCV of 51% ("Disp = 0.26425 , BCV = 0.5141"). I have > also attached the BCV plot. > 2. About the ANOVA-type test: I still do not understand why the first > group gets treated as the baseline. In my case, all samples (or groups) > are normal. So all of these are in one sense the "wild-type". And, when > the first group gets absorbed in the intercept, the comparison of gene > expression is made to the first group (as it gets treated as the > baseline). I thought this approach does not require one group to be used > as a wild-type. The reason why one of the groups is absorbed into the intercept is that it is only possible to make 17 independent comparisons between 18 groups. So it is only meaningful to have 17 coefficients in the model apart from the intercept. You seem to be jumping to the conclusion that the reference sample must be a control sample, but this is not correct.? The use of one group as a reference in the intercept term is purely for mathematical convenience. The ANOVA test result remains exactly the same regardless of which group is absorbed into the intercept.? Indeed you can fit any design matrix you like, and define any test of 17 independent contrasts, and you will get the same ANOVA test.? It makes no difference, providing the null hypothesis remains that all 18 groups are equal.? You could for example use ? design <- model.matrix(~0+tiss_groups) and then define any set of 17 pairwise comparisons between the groups. This would lead to exactly the same ANOVA test.? It is just more convenient to do as you do below. > So should I use the following to get the actual expression values of > genes in each sample: > fit <- glmFit(D, design) > Fit_FittedVals <- fit$fitted.values edgeR is not designed to estimate actual expression values.? However, if you would like to get the average logCPM value for each tissue group, then you code will do that provided you have defined the design matrix by model.matrix(~0+tiss_groups). > and use the following to get the logFC of groups after the DE test: > QLF_lrt <- glmQLFTest(fit,coef=2:18) > QLTLRT_Table <- QLF_lrt$table I don't understand what you mean by "logFC of groups".? To get a logFC, it is necessary to compare one group with another.? Which two groups do you want to compare?? You have 18 tissue groups, so for each gene there are 153 possible pairwise comparisons between the groups.? That's a lot of logFCs. Best wishes Gordon > Thanks again for your advice. I would much appreciate on these follow-up > doubts too. > Regards, > Manoj. > ------------------------------ > Manoj Hariharan > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 ________________________________ ? From: Gordon K Smyth <smyth at="" wehi.edu.au=""> To: Manoj Hariharan <h_manoj at="" yahoo.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Sent: Thursday, April 25, 2013 11:38 PM Subject: Design matrix and BCV Dear Manoj, First of all, can I please persuade you to install the latest version of edgeR?? You need R 3.0.0 and Bioconductor Release 2.12. > Date: Wed, 24 Apr 2013 13:33:05 -0700 > From: Manoj Hariharan <h_manoj at="" yahoo.com=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] Design matrix and BCV > > Hello, > > I am new to RNA-seq analysis. I have worked on a few not-too- complicated > projects and have found edgeR to be right for my work. In this project I > have RNA-seq data from 18 human tissues (normal, no treatment). All > tissues except 5 of 18 have at least 2 replicates. The replicate tissues > are obtained from separate individuals (they are of different age and > sex). There are a few issues I need to discuss with the experts in the > group: > > 1. The BCV value is quite high (Disp = 0.36621 , BCV = 0.6052). I think > this is partly due to the way we have collected replicates - they are > from separate individuals - different age and sex. Is this really bad - > I had read in the User Guide that BCV of ~40% is acceptable in tumor > samples? Does adjusting the prior.df? help (I've attached the BCV > plots)? At a later stage I plan to include age and sex as "factors" and > re-do the analysis. I would view this BCV as unacceptably high in my own research.? Page 69 of the edgeR User's Guide shows a BCV plot for unrelated individuals: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst /doc/edgeRUsersGuide.pdf and I don't think that the BCV should get much higher than this for a designed experiment.? Another concern is that the dispersion trend in your data looks a bit strange. In your case, I'd be looking for outliers or batch effects or other problems.? The prior.df does not affect the common dispersion. > 2. I am interested in the differentially expressed genes - across these > 18 tissues. I guess I should be using the approach explained in section > 3.2.5 of the User Guide (ANOVA-like test). Yes. > Below, is the output. The problem is that the first tissue "AD" is > absorbed into the intercept. I have read in other discussion threads > that this is normal. Yes, this is normal.? I don't see why it should cause any problem. > But I do need the logFC values for the AD tissue also. The fitted model gives you logFC for AD vs each of the other tissues. > If I use the "design <- > model.matrix(~0+tiss_groups, data=D$samples)", I can get the AD column > in the design matrix, but then, I would not be able to get the baseline > intercept column, and I get all genes differentially expressed. Is there > a work-around? How can I handle this issue? There is no reason to do this. > 3. How best can I decide on the prior.df? I read the threads on choosing > the value based on the number of libraries and groups. But I am not > sure. So I tried with prior.df default (20), 10 and 2 with varying > number of DE genes. There is no need to set the prior.df, because the glmQLFTest() function estimates the prior.df for you automatically.? The idea is to use estimateGLMTrendedDisp() then call glmQLFTest(). Alternatively and better, please upgrade to the current version of edgeR and follow the case study in Section 4.6. It is not actually correct to input tagwise dispersion estimates to glmQLFTest.? There was no check against in this in edgeR version 3.0.X, but there is in the current release. Best wishes Gordon > R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" > Copyright (C) 2012 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > ? Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > > Loading required package: DBI > Loading required package: AnnotationDbi > Loading required package: BiocGenerics > > Attaching package: ?BiocGenerics? > > The following object(s) are masked from ?package:stats?: > > ??? xtabs > > The following object(s) are masked from ?package:base?: > > ??? anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, > ??? get, intersect, lapply, Map, mapply, mget, order, paste, pmax, > ??? pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, > ??? rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: Biobase > Welcome to Bioconductor > > ??? Vignettes contain introductory material; view with > ??? 'browseVignettes()'. To cite Bioconductor, see > ??? 'citation("Biobase")', and for packages 'citation("pkgname")'. > > > Loading Tcl/Tk interface ... done > > KEGG.db contains mappings based on older data because the original > ? resource was removed from the the public domain before the most > ? recent update was produced. This package should now be considered > ? deprecated and future versions of Bioconductor may not have it > ? available.? One possible alternative to consider is to look at the > ? reactome.db package > > [Workspace loaded from /users/manoj/.RData] > >> >> >> >> setwd('/Users/manoj/Data/SDEC_hg19/AllCountDataStrndd/') > Warning message: > package ?AnnotationDbi? was built under R version 2.15.2 >> >> library(edgeR) > Loading required package: limma > Warning messages: > 1: package ?edgeR? was built under R version 2.15.2 > 2: package ?limma? was built under R version 2.15.2 >> >> >> targets <- read.delim("AllCountData_AllTiss_Info" , stringsAsFactors = FALSE , header=TRUE) >> D <- readDGE(targets) >> keep <- rowSums(cpm(D)>1) >= 10 >> D <- D[keep,] >> tiss_groups <- factor(c("AD","AD","AO","AO","BL","EG","EG","FT","FT ","FT","GA","GA","GA","LG","LG","LI","LV","LV","OV","PA","PA","PO","PO ","PO","RA","RV","RV","SB","SB","SB","SG","SG","SG","SX","SX","SX","TH ")) >> design <- model.matrix(~tiss_groups) >> >> design > ?? (Intercept) tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > > >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> >> D <- estimateGLMTrendedDisp(D, design) > Loading required package: splines >> > > > >> D <- estimateGLMTagwiseDisp(D, design) >> plotBCV(D, main="BCV Plot: default prior df") >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> plotBCV(D, main="BCV Plot: default prior df of 10") > >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> plotBCV(D, main="BCV Plot: default prior df of 2") > > >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> > >> glm_lrt <- glmLRT(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(glm_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 11255 > > > Using different parameters (prior.df) for estimateGLMTagwiseDisp: >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> >> D <- estimateGLMTagwiseDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 10935 >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12622 >> >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12033 >> > > > > > > Design matrix without intercept: > >> design <- model.matrix(~0+tiss_groups, data=D$samples) >> design > ?? tiss_groupsAD tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > >> >> >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 20364 >> > > > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines?? stats???? graphics? grDevices utils???? datasets? methods?? base???? > > other attached packages: > [1] edgeR_3.0.7????????? limma_3.14.3???????? AnnotationDbi_1.20.3 Biobase_2.18.0?????? BiocGenerics_0.4.0?? RSQLite_0.11.2?????? DBI_0.2-5?????????? > > loaded via a namespace (and not attached): > ?[1] clusterProfiler_1.6.0 colorspace_1.2-0????? dichromat_1.2-4?????? digest_0.6.0????????? DO.db_2.5.0?????????? DOSE_1.4.0?????????? > ?[7] ggplot2_0.9.3???????? GO.db_2.8.0?????????? GOSemSim_1.16.1?????? grid_2.15.1?????????? gtable_0.1.2????????? igraph_0.6-3???????? > [13] IRanges_1.16.4??????? KEGG.db_2.8.0???????? labeling_0.1????????? MASS_7.3-23?????????? munsell_0.4?????????? parallel_2.15.1????? > [19] plyr_1.8????????????? proto_0.3-10????????? qvalue_1.32.0???????? RColorBrewer_1.0-5??? reshape2_1.2.2??????? scales_0.2.3???????? > [25] stats4_2.15.1???????? stringr_0.6.2???????? tcltk_2.15.1????????? tools_2.15.1???????? >> ? > > Thanks, > Manoj. > > ------------------------------ > > Manoj Hariharan, Ph.D. > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_dfDefault.png > Type: image/png > Size: 91573 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0003.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df10.png > Type: image/png > Size: 92661 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0004.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df2.png > Type: image/png > Size: 95313 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0005.png=""> > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}
modified 5.5 years ago by Manoj Hariharan110 • written 5.5 years ago by Gordon Smyth35k
0
5.5 years ago by
Manoj Hariharan110 wrote:
Dear Gordon, Thanks again for your inputs. I am quite clear of the method now. I agree that the DE genes are exactly the same (and in same order) whichever tissue I would take as a base group (the one that gets absorbed as the intercept). I was referring to the values of logFC.XX that is obtained from the topTags table. This is quite different based on the tissue that I use as base group. I guess this is not the log 2 fold change compared to the average across all groups, whereas, it is the fold change compared to the base group. I have attached a screen-shot of the topTags table for the top 47 DE genes in a few tissues to make the point, by using three different tissues as the basegroup. With FT as base group: tiss_groups <- factor(c("AAFT","AAFT","AAFT","AD","AD","AO","AO","BL",...) design <- model.matrix(~tiss_groups) QLF_lrt <- glmQLFTest(fit,coef=2:18) toptags_QLFLRT <- topTags(QLF_lrt, n=nrow(D$counts)) toptags_QLFLRT_table <- toptags_QLFLRT$table write.table(toptags_QLFLRT_table, "All37Cmprd_QLTLRTTable_BaseGroupFT_toptags", sep="\t", quote=FALSE) With PO as base group: tiss_groups_PO <- factor(c("AAPO","AAPO","AAPO","AD","AD","AO","AO","BL"...) write.table(toptags_QLFLRT_table_PO, "All37Cmprd_QLTLRTTable_BaseGroupPO_toptags", sep="\t", quote=FALSE) With SB as base group: tiss_groups_SB <- factor(c("AASB","AASB","AASB","AD","AD","AO","AO","BL",..) write.table(toptags_QLFLRT_table_SB, "All37Cmprd_QLTLRTTable_BaseGroupSB_toptags", sep="\t", quote=FALSE) Thanks again for your time and valuable guidance. Regards, Manoj. ? Manoj Hariharan Staff Researcher The Salk Institute for Biological Studies La Jolla, CA 92037 Office: 858.453.4100 x2143 ________________________________ From: Gordon K Smyth <smyth at="" wehi.edu.au=""> To: Manoj Hariharan <h_manoj at="" yahoo.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Sent: Sunday, April 28, 2013 1:07 AM Subject: Re: Design matrix and BCV Dear Manoj, --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, http://www.statsci.org/smyth On Sat, 27 Apr 2013, Manoj Hariharan wrote: > Dear Gordon, > Thanks very much for your response. I updated to the latest version of > edgeR (edgeR_3.2.3). > 1. I checked the BCV of unrelated individuals mentioned in page 69 - > that was from study based on cell lines ("RNA-Seq profiles were made > from lymphoblastoid cell lines"). They are grown in controlled > conditions, uniformly. But, in my case, the samples are tissues > dissected from donors just after death. Each lymphoblastoid cell line is from a different person.? But, yes, I agree that samples from human tissue donors will be vary variable. > Anyway, I now filtered out the outliers by using a more > stringent cut-off of "keep <- rowSums(cpm(D)>1) > >= 30" and I get a BCV of 51% ("Disp = 0.26425 , BCV = 0.5141"). I have > also attached the BCV plot. > 2. About the ANOVA-type test: I still do not understand why the first > group gets treated as the baseline. In my case, all samples (or groups) > are normal. So all of these are in one sense the "wild-type". And, when > the first group gets absorbed in the intercept, the comparison of gene > expression is made to the first group (as it gets treated as the > baseline). I thought this approach does not require one group to be used > as a wild-type. The reason why one of the groups is absorbed into the intercept is that it is only possible to make 17 independent comparisons between 18 groups. So it is only meaningful to have 17 coefficients in the model apart from the intercept. You seem to be jumping to the conclusion that the reference sample must be a control sample, but this is not correct.? The use of one group as a reference in the intercept term is purely for mathematical convenience. The ANOVA test result remains exactly the same regardless of which group is absorbed into the intercept.? Indeed you can fit any design matrix you like, and define any test of 17 independent contrasts, and you will get the same ANOVA test.? It makes no difference, providing the null hypothesis remains that all 18 groups are equal.? You could for example use ? design <- model.matrix(~0+tiss_groups) and then define any set of 17 pairwise comparisons between the groups. This would lead to exactly the same ANOVA test.? It is just more convenient to do as you do below. > So should I use the following to get the actual expression values of > genes in each sample: > fit <- glmFit(D, design) > Fit_FittedVals <- fit$fitted.values edgeR is not designed to estimate actual expression values.? However, if you would like to get the average logCPM value for each tissue group, then you code will do that provided you have defined the design matrix by model.matrix(~0+tiss_groups). > and use the following to get the logFC of groups after the DE test: > QLF_lrt <- glmQLFTest(fit,coef=2:18) > QLTLRT_Table <- QLF_lrt$table I don't understand what you mean by "logFC of groups".? To get a logFC, it is necessary to compare one group with another.? Which two groups do you want to compare?? You have 18 tissue groups, so for each gene there are 153 possible pairwise comparisons between the groups.? That's a lot of logFCs. Best wishes Gordon > Thanks again for your advice. I would much appreciate on these follow-up > doubts too. > Regards, > Manoj. > ------------------------------ > Manoj Hariharan > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 ________________________________ ? From: Gordon K Smyth <smyth at="" wehi.edu.au=""> To: Manoj Hariharan <h_manoj at="" yahoo.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Sent: Thursday, April 25, 2013 11:38 PM Subject: Design matrix and BCV Dear Manoj, First of all, can I please persuade you to install the latest version of edgeR?? You need R 3.0.0 and Bioconductor Release 2.12. > Date: Wed, 24 Apr 2013 13:33:05 -0700 > From: Manoj Hariharan <h_manoj at="" yahoo.com=""> > To: "bioconductor at stat.math.ethz.ch" <bioconductor at="" stat.math.ethz.ch=""> > Subject: [BioC] Design matrix and BCV > > Hello, > > I am new to RNA-seq analysis. I have worked on a few not-too- complicated > projects and have found edgeR to be right for my work. In this project I > have RNA-seq data from 18 human tissues (normal, no treatment). All > tissues except 5 of 18 have at least 2 replicates. The replicate tissues > are obtained from separate individuals (they are of different age and > sex). There are a few issues I need to discuss with the experts in the > group: > > 1. The BCV value is quite high (Disp = 0.36621 , BCV = 0.6052). I think > this is partly due to the way we have collected replicates - they are > from separate individuals - different age and sex. Is this really bad - > I had read in the User Guide that BCV of ~40% is acceptable in tumor > samples? Does adjusting the prior.df? help (I've attached the BCV > plots)? At a later stage I plan to include age and sex as "factors" and > re-do the analysis. I would view this BCV as unacceptably high in my own research.? Page 69 of the edgeR User's Guide shows a BCV plot for unrelated individuals: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst /doc/edgeRUsersGuide.pdf and I don't think that the BCV should get much higher than this for a designed experiment.? Another concern is that the dispersion trend in your data looks a bit strange. In your case, I'd be looking for outliers or batch effects or other problems.? The prior.df does not affect the common dispersion. > 2. I am interested in the differentially expressed genes - across these > 18 tissues. I guess I should be using the approach explained in section > 3.2.5 of the User Guide (ANOVA-like test). Yes. > Below, is the output. The problem is that the first tissue "AD" is > absorbed into the intercept. I have read in other discussion threads > that this is normal. Yes, this is normal.? I don't see why it should cause any problem. > But I do need the logFC values for the AD tissue also. The fitted model gives you logFC for AD vs each of the other tissues. > If I use the "design <- > model.matrix(~0+tiss_groups, data=D$samples)", I can get the AD column > in the design matrix, but then, I would not be able to get the baseline > intercept column, and I get all genes differentially expressed. Is there > a work-around? How can I handle this issue? There is no reason to do this. > 3. How best can I decide on the prior.df? I read the threads on choosing > the value based on the number of libraries and groups. But I am not > sure. So I tried with prior.df default (20), 10 and 2 with varying > number of DE genes. There is no need to set the prior.df, because the glmQLFTest() function estimates the prior.df for you automatically.? The idea is to use estimateGLMTrendedDisp() then call glmQLFTest(). Alternatively and better, please upgrade to the current version of edgeR and follow the case study in Section 4.6. It is not actually correct to input tagwise dispersion estimates to glmQLFTest.? There was no check against in this in edgeR version 3.0.X, but there is in the current release. Best wishes Gordon > R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" > Copyright (C) 2012 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > ? Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > > Loading required package: DBI > Loading required package: AnnotationDbi > Loading required package: BiocGenerics > > Attaching package: ?BiocGenerics? > > The following object(s) are masked from ?package:stats?: > > ??? xtabs > > The following object(s) are masked from ?package:base?: > > ??? anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, > ??? get, intersect, lapply, Map, mapply, mget, order, paste, pmax, > ??? pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, > ??? rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: Biobase > Welcome to Bioconductor > > ??? Vignettes contain introductory material; view with > ??? 'browseVignettes()'. To cite Bioconductor, see > ??? 'citation("Biobase")', and for packages 'citation("pkgname")'. > > > Loading Tcl/Tk interface ... done > > KEGG.db contains mappings based on older data because the original > ? resource was removed from the the public domain before the most > ? recent update was produced. This package should now be considered > ? deprecated and future versions of Bioconductor may not have it > ? available.? One possible alternative to consider is to look at the > ? reactome.db package > > [Workspace loaded from /users/manoj/.RData] > >> >> >> >> setwd('/Users/manoj/Data/SDEC_hg19/AllCountDataStrndd/') > Warning message: > package ?AnnotationDbi? was built under R version 2.15.2 >> >> library(edgeR) > Loading required package: limma > Warning messages: > 1: package ?edgeR? was built under R version 2.15.2 > 2: package ?limma? was built under R version 2.15.2 >> >> >> targets <- read.delim("AllCountData_AllTiss_Info" , stringsAsFactors = FALSE , header=TRUE) >> D <- readDGE(targets) >> keep <- rowSums(cpm(D)>1) >= 10 >> D <- D[keep,] >> tiss_groups <- factor(c("AD","AD","AO","AO","BL","EG","EG","FT","FT ","FT","GA","GA","GA","LG","LG","LI","LV","LV","OV","PA","PA","PO","PO ","PO","RA","RV","RV","SB","SB","SB","SG","SG","SG","SX","SX","SX","TH ")) >> design <- model.matrix(~tiss_groups) >> >> design > ?? (Intercept) tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > > >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> >> D <- estimateGLMTrendedDisp(D, design) > Loading required package: splines >> > > > >> D <- estimateGLMTagwiseDisp(D, design) >> plotBCV(D, main="BCV Plot: default prior df") >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> plotBCV(D, main="BCV Plot: default prior df of 10") > >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> plotBCV(D, main="BCV Plot: default prior df of 2") > > >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> > >> glm_lrt <- glmLRT(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(glm_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 11255 > > > Using different parameters (prior.df) for estimateGLMTagwiseDisp: >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> >> D <- estimateGLMTagwiseDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 10935 >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12622 >> >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12033 >> > > > > > > Design matrix without intercept: > >> design <- model.matrix(~0+tiss_groups, data=D$samples) >> design > ?? tiss_groupsAD tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > >> >> >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 20364 >> > > > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines?? stats???? graphics? grDevices utils???? datasets? methods?? base???? > > other attached packages: > [1] edgeR_3.0.7????????? limma_3.14.3???????? AnnotationDbi_1.20.3 Biobase_2.18.0?????? BiocGenerics_0.4.0?? RSQLite_0.11.2?????? DBI_0.2-5?????????? > > loaded via a namespace (and not attached): > ?[1] clusterProfiler_1.6.0 colorspace_1.2-0????? dichromat_1.2-4?????? digest_0.6.0????????? DO.db_2.5.0?????????? DOSE_1.4.0?????????? > ?[7] ggplot2_0.9.3???????? GO.db_2.8.0?????????? GOSemSim_1.16.1?????? grid_2.15.1?????????? gtable_0.1.2????????? igraph_0.6-3???????? > [13] IRanges_1.16.4??????? KEGG.db_2.8.0???????? labeling_0.1????????? MASS_7.3-23?????????? munsell_0.4?????????? parallel_2.15.1????? > [19] plyr_1.8????????????? proto_0.3-10????????? qvalue_1.32.0???????? RColorBrewer_1.0-5??? reshape2_1.2.2??????? scales_0.2.3???????? > [25] stats4_2.15.1???????? stringr_0.6.2???????? tcltk_2.15.1????????? tools_2.15.1???????? >> ? > > Thanks, > Manoj. > > ------------------------------ > > Manoj Hariharan, Ph.D. > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_dfDefault.png > Type: image/png > Size: 91573 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0003.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df10.png > Type: image/png > Size: 92661 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0004.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df2.png > Type: image/png > Size: 95313 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0005.png=""> > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ -------------- next part -------------- A non-text attachment was scrubbed... Name: Screen shot 2013-05-03 at 11.54.32 AM.png Type: image/png Size: 885915 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20130505="" ece92a60="" attachment-0001.png="">
0
5.5 years ago by
Manoj Hariharan110 wrote:
Thanks Gordon. I was wondering if I could have a quantitative value for the deviance of each group from the average, for each of the DE genes. I understand that the F value (from the F-statistic) is a measure of how far the gene is compared to the expression of all others across the samples. One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes: de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,]) scale <- D$samples$lib.size*D$samples$norm.factors normCounts <- round(t(t(D$counts)/scale)*mean(scale)) write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE) Essentially, I am trying to get the list of genes that shows a more "tissue-specific" behaviour. Most genes are not expressed strictly in one particular tissue - there would be related tissues where its expression would be almost similar. So I would like to rank them based on their expression values and for that I need to have all comparable values. Then,  I could consider those samples where the expression of the gene is higher than the 90th percentile. Hope that makes sense! Thanks, Manoj. ________________________________ From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Sunday, May 5, 2013 5:46 PM Subject: Re: Design matrix and BCV On Sun, 5 May 2013, Manoj Hariharan wrote: > Dear Gordon, > Thanks again for your inputs. I am quite clear of the method now. I > agree that the DE genes are exactly the same (and in same order) > whichever tissue I would take as a base group (the one that gets > absorbed as the intercept). > I was referring to the values of logFC.XX that is obtained from the > topTags table. This is quite different based on the tissue that I use as > base group. I guess this is not the log 2 fold change compared to the > average across all groups, whereas, it is the fold change compared to > the base group. Yes, that is correct.  The toptable shows you the estimated coefficients from the fitted model, and in this case you defined the coefficients relative to the base group. You can easily get the fold change compared to the average across all groups, if you wish, but that's not usually a very useful quantity. What your question? > I have attached a screen-shot of the topTags table for the top 47 DE > genes in a few tissues to make the point, by using three different > tissues as the basegroup. No need to give examples.  This is just documented behaviour of the software. Best wishes Gordon With FT as base group: tiss_groups <- factor(c("AAFT","AAFT","AAFT","AD","AD","AO","AO","BL",...) design <- model.matrix(~tiss_groups) QLF_lrt <- glmQLFTest(fit,coef=2:18) toptags_QLFLRT <- topTags(QLF_lrt, n=nrow(D$counts)) toptags_QLFLRT_table <- toptags_QLFLRT$table write.table(toptags_QLFLRT_table, "All37Cmprd_QLTLRTTable_BaseGroupFT_toptags", sep="\t", quote=FALSE) With PO as base group: tiss_groups_PO <- factor(c("AAPO","AAPO","AAPO","AD","AD","AO","AO","BL"...) write.table(toptags_QLFLRT_table_PO, "All37Cmprd_QLTLRTTable_BaseGroupPO_toptags", sep="\t", quote=FALSE) With SB as base group: tiss_groups_SB <- factor(c("AASB","AASB","AASB","AD","AD","AO","AO","BL",..) write.table(toptags_QLFLRT_table_SB, "All37Cmprd_QLTLRTTable_BaseGroupSB_toptags", sep="\t", quote=FALSE) Thanks again for your time and valuable guidance. Regards, Manoj. Manoj Hariharan Staff Researcher The Salk Institute for Biological Studies La Jolla, CA 92037 Office: 858.453.4100 x2143 ________________________________   From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Sunday, April 28, 2013 1:07 AM Subject: Re: Design matrix and BCV Dear Manoj, --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, http://www.statsci.org/smyth On Sat, 27 Apr 2013, Manoj Hariharan wrote: > Dear Gordon, > Thanks very much for your response. I updated to the latest version of > edgeR (edgeR_3.2.3). > 1. I checked the BCV of unrelated individuals mentioned in page 69 - > that was from study based on cell lines ("RNA-Seq profiles were made > from lymphoblastoid cell lines"). They are grown in controlled > conditions, uniformly. But, in my case, the samples are tissues > dissected from donors just after death. Each lymphoblastoid cell line is from a different person.  But, yes, I agree that samples from human tissue donors will be vary variable. > Anyway, I now filtered out the outliers by using a more > stringent cut-off of "keep <- rowSums(cpm(D)>1) > >= 30" and I get a BCV of 51% ("Disp = 0.26425 , BCV = 0.5141"). I have > also attached the BCV plot. > 2. About the ANOVA-type test: I still do not understand why the first > group gets treated as the baseline. In my case, all samples (or groups) > are normal. So all of these are in one sense the "wild-type". And, when > the first group gets absorbed in the intercept, the comparison of gene > expression is made to the first group (as it gets treated as the > baseline). I thought this approach does not require one group to be used > as a wild-type. The reason why one of the groups is absorbed into the intercept is that it is only possible to make 17 independent comparisons between 18 groups. So it is only meaningful to have 17 coefficients in the model apart from the intercept. You seem to be jumping to the conclusion that the reference sample must be a control sample, but this is not correct.  The use of one group as a reference in the intercept term is purely for mathematical convenience. The ANOVA test result remains exactly the same regardless of which group is absorbed into the intercept.  Indeed you can fit any design matrix you like, and define any test of 17 independent contrasts, and you will get the same ANOVA test.  It makes no difference, providing the null hypothesis remains that all 18 groups are equal.  You could for example use    design <- model.matrix(~0+tiss_groups) and then define any set of 17 pairwise comparisons between the groups. This would lead to exactly the same ANOVA test.  It is just more convenient to do as you do below. > So should I use the following to get the actual expression values of > genes in each sample: > fit <- glmFit(D, design) > Fit_FittedVals <- fit$fitted.values edgeR is not designed to estimate actual expression values. However, if you would like to get the average logCPM value for each tissue group, then you code will do that provided you have defined the design matrix by model.matrix(~0+tiss_groups). > and use the following to get the logFC of groups after the DE test: > QLF_lrt <- glmQLFTest(fit,coef=2:18) > QLTLRT_Table <- QLF_lrt$table I don't understand what you mean by "logFC of groups".  To get a logFC, it is necessary to compare one group with another.  Which two groups do you want to compare?  You have 18 tissue groups, so for each gene there are 153 possible pairwise comparisons between the groups.  That's a lot of logFCs. Best wishes Gordon > Thanks again for your advice. I would much appreciate on these follow-up > doubts too. > Regards, > Manoj. > ------------------------------ > Manoj Hariharan > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 ________________________________   From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Thursday, April 25, 2013 11:38 PM Subject: Design matrix and BCV Dear Manoj, First of all, can I please persuade you to install the latest version of edgeR?  You need R 3.0.0 and Bioconductor Release 2.12. > Date: Wed, 24 Apr 2013 13:33:05 -0700 > To: "bioconductor@stat.math.ethz.ch" <bioconductor@stat.math.ethz.ch> > Subject: [BioC] Design matrix and BCV > > Hello, > > I am new to RNA-seq analysis. I have worked on a few not-too- complicated > projects and have found edgeR to be right for my work. In this project I > have RNA-seq data from 18 human tissues (normal, no treatment). All > tissues except 5 of 18 have at least 2 replicates. The replicate tissues > are obtained from separate individuals (they are of different age and > sex). There are a few issues I need to discuss with the experts in the > group: > > 1. The BCV value is quite high (Disp = 0.36621 , BCV = 0.6052). I think > this is partly due to the way we have collected replicates - they are > from separate individuals - different age and sex. Is this really bad - > I had read in the User Guide that BCV of ~40% is acceptable in tumor > samples? Does adjusting the prior.df? help (I've attached the BCV > plots)? At a later stage I plan to include age and sex as "factors" and > re-do the analysis. I would view this BCV as unacceptably high in my own research.  Page 69 of the edgeR User's Guide shows a BCV plot for unrelated individuals: http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst /doc/edgeRUsersGuide.pdf and I don't think that the BCV should get much higher than this for a designed experiment.  Another concern is that the dispersion trend in your data looks a bit strange. In your case, I'd be looking for outliers or batch effects or other problems.  The prior.df does not affect the common dispersion. > 2. I am interested in the differentially expressed genes - across these > 18 tissues. I guess I should be using the approach explained in section > 3.2.5 of the User Guide (ANOVA-like test). Yes. > Below, is the output. The problem is that the first tissue "AD" is > absorbed into the intercept. I have read in other discussion threads > that this is normal. Yes, this is normal.  I don't see why it should cause any problem. > But I do need the logFC values for the AD tissue also. The fitted model gives you logFC for AD vs each of the other tissues. > If I use the "design <- > model.matrix(~0+tiss_groups, data=D$samples)", I can get the AD column > in the design matrix, but then, I would not be able to get the baseline > intercept column, and I get all genes differentially expressed. Is there > a work-around? How can I handle this issue? There is no reason to do this. > 3. How best can I decide on the prior.df? I read the threads on choosing > the value based on the number of libraries and groups. But I am not > sure. So I tried with prior.df default (20), 10 and 2 with varying > number of DE genes. There is no need to set the prior.df, because the glmQLFTest() function estimates the prior.df for you automatically. The idea is to use estimateGLMTrendedDisp() then call glmQLFTest(). Alternatively and better, please upgrade to the current version of edgeR and follow the case study in Section 4.6. It is not actually correct to input tagwise dispersion estimates to glmQLFTest. There was no check against in this in edgeR version 3.0.X, but there is in the current release. Best wishes Gordon > R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows" > Copyright (C) 2012 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > ? Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > > Loading required package: DBI > Loading required package: AnnotationDbi > Loading required package: BiocGenerics > > Attaching package: ?BiocGenerics? > > The following object(s) are masked from ?package:stats?: > > ??? xtabs > > The following object(s) are masked from ?package:base?: > > ??? anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find, > ??? get, intersect, lapply, Map, mapply, mget, order, paste, pmax, > ??? pmax.int, pmin, pmin.int, Position, rbind, Reduce, rep.int, > ??? rownames, sapply, setdiff, table, tapply, union, unique > > Loading required package: Biobase > Welcome to Bioconductor > > ??? Vignettes contain introductory material; view with > ??? 'browseVignettes()'. To cite Bioconductor, see > ??? 'citation("Biobase")', and for packages 'citation("pkgname")'. > > > Loading Tcl/Tk interface ... done > > KEGG.db contains mappings based on older data because the original > ? resource was removed from the the public domain before the most > ? recent update was produced. This package should now be considered > ? deprecated and future versions of Bioconductor may not have it > ? available.? One possible alternative to consider is to look at the > ? reactome.db package > > [Workspace loaded from /users/manoj/.RData] > >> >> >> >> setwd('/Users/manoj/Data/SDEC_hg19/AllCountDataStrndd/') > Warning message: > package ?AnnotationDbi? was built under R version 2.15.2 >> >> library(edgeR) > Loading required package: limma > Warning messages: > 1: package ?edgeR? was built under R version 2.15.2 > 2: package ?limma? was built under R version 2.15.2 >> >> >> targets <- read.delim("AllCountData_AllTiss_Info" , stringsAsFactors = FALSE , header=TRUE) >> D <- readDGE(targets) >> keep <- rowSums(cpm(D)>1) >= 10 >> D <- D[keep,] >> tiss_groups <- factor(c("AD","AD","AO","AO","BL","EG","EG","FT","FT ","FT","GA","GA","GA","LG","LG","LI","LV","LV","OV","PA","PA","PO","PO ","PO","RA","RV","RV","SB","SB","SB","SG","SG","SG","SX","SX","SX","TH ")) >> design <- model.matrix(~tiss_groups) >> >> design > ?? (Intercept) tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > > >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> >> D <- estimateGLMTrendedDisp(D, design) > Loading required package: splines >> > > > >> D <- estimateGLMTagwiseDisp(D, design) >> plotBCV(D, main="BCV Plot: default prior df") >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> plotBCV(D, main="BCV Plot: default prior df of 10") > >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> plotBCV(D, main="BCV Plot: default prior df of 2") > > >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> > >> glm_lrt <- glmLRT(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(glm_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 11255 > > > Using different parameters (prior.df) for estimateGLMTagwiseDisp: >> D <- calcNormFactors(D) >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 8308 >> >> D <- estimateGLMTagwiseDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 10935 >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=2) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12622 >> >> >> D <- estimateGLMTagwiseDisp(D, design, prior.df=10) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 12033 >> > > > > > > Design matrix without intercept: > >> design <- model.matrix(~0+tiss_groups, data=D$samples) >> design > ?? tiss_groupsAD tiss_groupsAO tiss_groupsBL tiss_groupsEG tiss_groupsFT tiss_groupsGA tiss_groupsLG tiss_groupsLI tiss_groupsLV tiss_groupsOV ... > attr(,"assign") > ?[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > attr(,"contrasts") > attr(,"contrasts")$tiss_groups > [1] "contr.treatment" > >> >> >> D <- estimateGLMCommonDisp(D, design, verbose=TRUE) > Disp = 0.36621 , BCV = 0.6052 >> D <- estimateGLMTrendedDisp(D, design) >> fit <- glmFit(D, design) >> QLF_lrt <- glmQLFTest(fit,coef=2:18) >> FDR_Stsfd <- p.adjust(QLF_lrt$table$PValue, method="BH") >> sum(FDR_Stsfd < 0.05) > [1] 20364 >> > > > >> sessionInfo() > R version 2.15.1 (2012-06-22) > Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines?? stats???? graphics? grDevices utils???? datasets? methods?? base???? > > other attached packages: > [1] edgeR_3.0.7????????? limma_3.14.3???????? AnnotationDbi_1.20.3 Biobase_2.18.0?????? BiocGenerics_0.4.0?? RSQLite_0.11.2?????? DBI_0.2-5?????????? > > loaded via a namespace (and not attached): > ?[1] clusterProfiler_1.6.0 colorspace_1.2-0????? dichromat_1.2-4?????? digest_0.6.0????????? DO.db_2.5.0?????????? DOSE_1.4.0?????????? > ?[7] ggplot2_0.9.3???????? GO.db_2.8.0?????????? GOSemSim_1.16.1?????? grid_2.15.1?????????? gtable_0.1.2????????? igraph_0.6-3???????? > [13] IRanges_1.16.4??????? KEGG.db_2.8.0???????? labeling_0.1????????? MASS_7.3-23?????????? munsell_0.4?????????? parallel_2.15.1????? > [19] plyr_1.8????????????? proto_0.3-10????????? qvalue_1.32.0???????? RColorBrewer_1.0-5??? reshape2_1.2.2??????? scales_0.2.3???????? > [25] stats4_2.15.1???????? stringr_0.6.2???????? tcltk_2.15.1????????? tools_2.15.1???????? >> ? > > Thanks, > Manoj. > > ------------------------------ > > Manoj Hariharan, Ph.D. > Staff Researcher > The Salk Institute for Biological Studies > La Jolla, CA 92037 > Office: 858.453.4100 x2143 > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_dfDefault.png > Type: image/png > Size: 91573 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0003.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df10.png > Type: image/png > Size: 92661 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0004.png=""> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: BCVPlot_df2.png > Type: image/png > Size: 95313 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201304="" 24="" e3360da1="" attachment-0005.png=""> > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
Dear Manoj, Why not simply find genes than are higher in one group than the average of the other groups? edgeR can do this sort of thing easily. Let's suppose suppose you going to using the quasi-lik approach of glmQFTest() rather than glmQRT(). First define a design matrix for which the intercept is the overall mean: contrasts(tiss_groups) <- contr.sum(tiss_groups) design <- model.matrix(~tiss_groups) Then estimate the trended dispersions: y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTrendedDisp(y, design) Then fit the basic linear model: fit <- glmFit(y, design) Then you can extract all the lists you want. For example ql <- glmQLFTest(fit, coef=2) top1 <- topTags(ql) will give you genes specifically up or specifically down in tissue 1, as compared to the average of all the other groups. ql <- glmQLFTest(fit, coef=3) top2 <- topTags(ql) will give you genes specifically up/down in tissue 2, and so on up to ql <- glmQLFTest(fit, coef=18) top17 <- topTags(de) will give you genes specifically up/down in tissue 17. Finally, to get genes up/down in tissue 18: cont <- rep(-1,18) cont[1] <- 0 ql <- glmQLFTest(fit, contrast=cont) top18 <- topTags(de) What you propose doesn't quite make sense to me. If you want to put genes on the same scale (and you don't need to for the above analysis), would it not be better to use rpkm()? Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Mon, 6 May 2013, Manoj Hariharan wrote: > Thanks Gordon. I was wondering if I could have a quantitative value for the deviance of each group from the average, for each of the DE genes. I understand that the F value (from the F-statistic) is a measure of how far the gene is compared to the expression of all others across the samples. One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes: de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,]) scale <- D$samples$lib.size*D$samples$norm.factors normCounts <- round(t(t(D$counts)/scale)*mean(scale)) write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE) Essentially, I am trying to get the list of genes that shows a more "tissue-specific" behaviour. Most genes are not expressed strictly in one particular tissue - there would be related tissues where its expression would be almost similar. So I would like to rank them based on their expression values and for that I need to have all comparable values. Then,? I could consider those samples where the expression of the gene is higher than the 90th percentile. Hope that makes sense! Thanks, Manoj. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}
Hi Gordon, Actually, I had never used the glmQRT() - I've always been using the glmQLFTest(). And, as you had suggested, when I do the contrasts(tiss_groups) <- contr.sum(tiss_groups) I get the following error: > contrasts(tiss_groups) <- contr.sum(tiss_groups) Error in contrasts<-(*tmp*, value = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0,  :   wrong number of contrast matrix rows I didn't really understand what the difference by using the   contrasts(tiss_groups) <- contr.sum(tiss_groups)   design <- model.matrix(~tiss_groups) rather than specifying design without the "contrasts(tiss_groups) <- contr.sum(tiss_groups)", as below: design <- model.matrix(~tiss_groups) I would still have the intercept and have the following for fit$design: attr(,"assign") [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$tiss_groups [1] "contr.treatment" Thanks, Manoj. ________________________________ From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Monday, May 6, 2013 11:39 PM Subject: edgeR: finding tissue specific genes [was: Design matrix and BCV] Dear Manoj, Why not simply find genes than are higher in one group than the average of the other groups?  edgeR can do this sort of thing easily. Let's suppose suppose you going to using the quasi-lik approach of glmQFTest() rather than glmQRT(). First define a design matrix for which the intercept is the overall mean:   contrasts(tiss_groups) <- contr.sum(tiss_groups)   design <- model.matrix(~tiss_groups) Then estimate the trended dispersions:   y <- estimateGLMCommonDisp(y, design)   y <- estimateGLMTrendedDisp(y, design) Then fit the basic linear model:   fit <- glmFit(y, design) Then you can extract all the lists you want.  For example   ql <- glmQLFTest(fit, coef=2)   top1 <- topTags(ql) will give you genes specifically up or specifically down in tissue 1, as compared to the average of all the other groups.   ql <- glmQLFTest(fit, coef=3)   top2 <- topTags(ql) will give you genes specifically up/down in tissue 2, and so on up to   ql <- glmQLFTest(fit, coef=18)   top17 <- topTags(de) will give you genes specifically up/down in tissue 17.  Finally, to get genes up/down in tissue 18:   cont <- rep(-1,18)   cont[1] <- 0   ql <- glmQLFTest(fit, contrast=cont)   top18 <- topTags(de) What you propose doesn't quite make sense to me.  If you want to put genes on the same scale (and you don't need to for the above analysis), would it not be better to use rpkm()? Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Mon, 6 May 2013, Manoj Hariharan wrote: > Thanks Gordon. I was wondering if I could have a quantitative value for the deviance of each group from the average, for each of the DE genes. I understand that the F value (from the F-statistic) is a measure of how far the gene is compared to the expression of all others across the samples. One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes: de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,]) scale <- D$samples$lib.size*D$samples$norm.factors normCounts <- round(t(t(D$counts)/scale)*mean(scale)) write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE) Essentially, I am trying to get the list of genes that shows a more "tissue-specific" behaviour. Most genes are not expressed strictly in one particular tissue - there would be related tissues where its expression would be almost similar. So I would like to rank them based on their expression values and for that I need to have all comparable values. Then,  I could consider those samples where the expression of the gene is [[elided Yahoo spam]] Thanks, Manoj. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}
Sorry, first command shoud have been contrasts(tiss_groups) <- contr.sum(levels(tiss_groups)) Your linear model can be parametrized in terms of any set of 18 coefficients. This command says that you want the effects to "sum" to zero, in other words the effects should be relative to the grand mean. Best wishes Gordon On Tue, 7 May 2013, Manoj Hariharan wrote: > Hi Gordon, Actually, I had never used the glmQRT() - I've always been using the glmQLFTest(). And, as you had suggested, when I do the contrasts(tiss_groups) <- contr.sum(tiss_groups) I get the following error: > contrasts(tiss_groups) <- contr.sum(tiss_groups) Error in contrasts<-(*tmp*, value = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0,? : ? wrong number of contrast matrix rows I didn't really understand what the difference by using the ? contrasts(tiss_groups) <- contr.sum(tiss_groups) ? design <- model.matrix(~tiss_groups) rather than specifying design without the "contrasts(tiss_groups) <- contr.sum(tiss_groups)", as below: design <- model.matrix(~tiss_groups) I would still have the intercept and have the following for fit$design: attr(,"assign") ?[1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$tiss_groups [1] "contr.treatment" Thanks, Manoj. ________________________________ From: Gordon K Smyth <smyth at="" wehi.edu.au=""> To: Manoj Hariharan <h_manoj at="" yahoo.com=""> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> Sent: Monday, May 6, 2013 11:39 PM Subject: edgeR: finding tissue specific genes [was: Design matrix and BCV] Dear Manoj, Why not simply find genes than are higher in one group than the average of the other groups?? edgeR can do this sort of thing easily. Let's suppose suppose you going to using the quasi-lik approach of glmQFTest() rather than glmQRT(). First define a design matrix for which the intercept is the overall mean: ? contrasts(tiss_groups) <- contr.sum(tiss_groups) ? design <- model.matrix(~tiss_groups) Then estimate the trended dispersions: ? y <- estimateGLMCommonDisp(y, design) ? y <- estimateGLMTrendedDisp(y, design) Then fit the basic linear model: ? fit <- glmFit(y, design) Then you can extract all the lists you want.? For example ? ql <- glmQLFTest(fit, coef=2) ? top1 <- topTags(ql) will give you genes specifically up or specifically down in tissue 1, as compared to the average of all the other groups. ? ql <- glmQLFTest(fit, coef=3) ? top2 <- topTags(ql) will give you genes specifically up/down in tissue 2, and so on up to ? ql <- glmQLFTest(fit, coef=18) ? top17 <- topTags(de) will give you genes specifically up/down in tissue 17.? Finally, to get genes up/down in tissue 18: ? cont <- rep(-1,18) ? cont[1] <- 0 ? ql <- glmQLFTest(fit, contrast=cont) ? top18 <- topTags(de) What you propose doesn't quite make sense to me.? If you want to put genes on the same scale (and you don't need to for the above analysis), would it not be better to use rpkm()? Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Mon, 6 May 2013, Manoj Hariharan wrote: > Thanks Gordon. I was wondering if I could have a quantitative value for the deviance of each group from the average, for each of the DE genes. I understand that the F value (from the F-statistic) is a measure of how far the gene is compared to the expression of all others across the samples. One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes: de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,]) scale <- D$samples$lib.size*D$samples$norm.factors normCounts <- round(t(t(D$counts)/scale)*mean(scale)) write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE) Essentially, I am trying to get the list of genes that shows a more "tissue-specific" behaviour. Most genes are not expressed strictly in one particular tissue - there would be related tissues where its expression would be almost similar. So I would like to rank them based on their expression values and for that I need to have all comparable values. Then,? I could consider those samples where the expression of the gene is higher than the 90th percentile. Hope that makes sense! Thanks, Manoj. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:5}}
Got it! It works well, and I think I finally have what I wanted..! Thanks a lot. Manoj. Manoj Hariharan Staff Researcher The Salk Institute for Biological Studies La Jolla, CA 92037 Office: 858.453.4100 x2143 ________________________________ From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Tuesday, May 7, 2013 5:26 PM Subject: Re: edgeR: finding tissue specific genes [was: Design matrix and BCV] Sorry, first command shoud have been   contrasts(tiss_groups) <- contr.sum(levels(tiss_groups)) Your linear model can be parametrized in terms of any set of 18 coefficients.  This command says that you want the effects to "sum" to zero, in other words the effects should be relative to the grand mean. Best wishes Gordon On Tue, 7 May 2013, Manoj Hariharan wrote: > Hi Gordon, Actually, I had never used the glmQRT() - I've always been using the glmQLFTest(). And, as you had suggested, when I do the contrasts(tiss_groups) <- contr.sum(tiss_groups) I get the following error: > contrasts(tiss_groups) <- contr.sum(tiss_groups) Error in contrasts<-(*tmp*, value = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0,  :   wrong number of contrast matrix rows I didn't really understand what the difference by using the   contrasts(tiss_groups) <- contr.sum(tiss_groups)   design <- model.matrix(~tiss_groups) rather than specifying design without the "contrasts(tiss_groups) <- contr.sum(tiss_groups)", as below: design <- model.matrix(~tiss_groups) I would still have the intercept and have the following for fit$design: attr(,"assign") [1] 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$tiss_groups [1] "contr.treatment" Thanks, Manoj. ________________________________   From: Gordon K Smyth <smyth@wehi.edu.au> Cc: Bioconductor mailing list <bioconductor@r-project.org> Sent: Monday, May 6, 2013 11:39 PM Subject: edgeR: finding tissue specific genes [was: Design matrix and BCV] Dear Manoj, Why not simply find genes than are higher in one group than the average of the other groups?  edgeR can do this sort of thing easily. Let's suppose suppose you going to using the quasi-lik approach of glmQFTest() rather than glmQRT(). First define a design matrix for which the intercept is the overall mean:   contrasts(tiss_groups) <- contr.sum(tiss_groups)   design <- model.matrix(~tiss_groups) Then estimate the trended dispersions:   y <- estimateGLMCommonDisp(y, design)   y <- estimateGLMTrendedDisp(y, design) Then fit the basic linear model:   fit <- glmFit(y, design) Then you can extract all the lists you want.  For example   ql <- glmQLFTest(fit, coef=2)   top1 <- topTags(ql) will give you genes specifically up or specifically down in tissue 1, as compared to the average of all the other groups.   ql <- glmQLFTest(fit, coef=3)   top2 <- topTags(ql) will give you genes specifically up/down in tissue 2, and so on up to   ql <- glmQLFTest(fit, coef=18)   top17 <- topTags(de) will give you genes specifically up/down in tissue 17.  Finally, to get genes up/down in tissue 18:   cont <- rep(-1,18)   cont[1] <- 0   ql <- glmQLFTest(fit, contrast=cont)   top18 <- topTags(de) What you propose doesn't quite make sense to me.  If you want to put genes on the same scale (and you don't need to for the above analysis), would it not be better to use rpkm()? Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Mon, 6 May 2013, Manoj Hariharan wrote: > Thanks Gordon. I was wondering if I could have a quantitative value for the deviance of each group from the average, for each of the DE genes. I understand that the F value (from the F-statistic) is a measure of how far the gene is compared to the expression of all others across the samples. One option, I could think of is to just get the normalized counts for each of the sample, for the set of DE genes: de_lrt <- rownames(top_lrt[top_lrt$FDR<0.05,]) scale <- D$samples$lib.size*D$samples$norm.factors normCounts <- round(t(t(D$counts)/scale)*mean(scale)) write.table(log(normCounts[de_lrt[1:5690],]+1), "All37_NormCounts_DEGenes", sep="\t", quote=FALSE) Essentially, I am trying to get the list of genes that shows a more "tissue-specific" behaviour. Most genes are not expressed strictly in one particular tissue - there would be related tissues where its expression would be almost similar. So I would like to rank them based on their expression values and for that I need to have all comparable values. Then,  I could consider those samples where the expression of the gene is [[elided Yahoo spam]] Thanks, Manoj. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:8}}