Search
Question: EdgeR GLM gene expression analysis of 3 groups
0
7.2 years ago by
Shona Wood70
Shona Wood70 wrote:
Hi, I have been trying to compare RNA-seq data 3 groups; young, middle aged and old. I have done pairwise comparisons between these groups but i would like to compare all three together. I have been trying to use the GLM method to do this, following the tutorial, 11 Case study: Oral carcinomas vs matched normal tissue. i understand that this only compares two conditions but i tried to adapt it to my needs. Being new at this I am unsure if I am doing it right and I get an error when trying to estimateCRdisp. Though I think the problem is actually to do with the way i have specified the design because "middle" doesnt seem to be in the design. Please see below: R version 2.12.2 (2011-02-25) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: i386-pc-mingw32/i386 (32-bit) library(edgeR) setwd ("/Users/swood/Documents/rna-seq analysis") targets<-read.delim (file="targets.txt", + stringsAsFactors = FALSE, row.names = "label") Error in data[[rowvar]] : attempt to select less than one element targets<-read.delim (file="targets.txt") targets$age<-factor(targets$age) targets$sample<-factor(targets$sample) targets X files age sample 1 p1 p1.txt young 1 2 p2 p2.txt young 2 3 p3 p3.txt young 3 4 p4 p4.txt old 4 5 p5 p5.txt old 5 6 p6 p6.txt old 6 7 p7 p7.txt middle 7 8 p8 p8.txt middle 8 9 p9 p9.txt middle 9 d<-readDGE(targets, skip=1, comment.char='#') colnames(d)<-row.names(targets) d<-calcNormFactors(d) d An object of class "DGEList" $samples X files age sample lib.size norm.factors 1 p1 p1.txt young 1 674816.7 0.7598895 2 p2 p2.txt young 2 648856.9 0.9095203 3 p3 p3.txt young 3 449797.3 1.2774867 4 p4 p4.txt old 4 699036.0 0.8532141 5 p5 p5.txt old 5 441649.7 1.1919854 6 p6 p6.txt old 6 670755.9 0.8147330 7 p7 p7.txt middle 7 699630.1 1.0262252 8 p8 p8.txt middle 8 505795.8 1.2979023 9 p9 p9.txt middle 9 478527.1 1.0262468$counts 1 2 3 4 5 6 7 ENSRNOG00000000007 38.23700 37.89900 34.85820 43.6703 40.81260 43.4009 42.4969 ENSRNOG00000000008 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 ENSRNOG00000000009 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 ENSRNOG00000000010 1.40963 8.28797 2.09491 17.1853 1.76635 14.7567 1.0840 ENSRNOG00000000012 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 8 9 ENSRNOG00000000007 45.81270 48.6937 ENSRNOG00000000008 0.00000 0.0000 ENSRNOG00000000009 0.00000 0.0000 ENSRNOG00000000010 5.44889 17.7295 ENSRNOG00000000012 0.00000 0.0000 29510 more rows ... d<-d[rowSums(d$counts)>9,] design <- model.matrix(~ sample + age, data = targets) design (Intercept) sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 1 1 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0 0 0 0 3 1 0 1 0 0 0 0 0 0 4 1 0 0 1 0 0 0 0 0 5 1 0 0 0 1 0 0 0 0 6 1 0 0 0 0 1 0 0 0 7 1 0 0 0 0 0 1 0 0 8 1 0 0 0 0 0 0 1 0 9 1 0 0 0 0 0 0 0 1 ageold ageyoung 1 0 1 2 0 1 3 0 1 4 1 0 5 1 0 6 1 0 7 0 0 8 0 0 9 0 0 attr(,"assign") [1] 0 1 1 1 1 1 1 1 1 2 2 attr(,"contrasts") attr(,"contrasts")$sample [1] "contr.treatment" attr(,"contrasts")age [1] "contr.treatment" d<-estimateCRDisp(d, design) Error in beta[i, ] <- z %*% X : number of items to replace is not a multiple of replacement length If you could help me out with the design or suggest a reason for the error i would really appreciate it. Thanks! ADD COMMENTlink modified 7.2 years ago by Davis McCarthy260 • written 7.2 years ago by Shona Wood70 0 7.2 years ago by Davis McCarthy260 wrote: Hi Shona I'll do my best to help with your query. From looking at the output you gave there are a few problems. First, I am concerned that you have non-integer counts. The methods in edgeR are designed to work on raw counts, that is the raw number of mapped reads that align to your features of interest (whether they be genes, exons or other genomic regions of interest). Where do your counts come from in this dataset? You should absolutely not do any RPKM normalization or anything like that before differential expression (DE) analysis in edgeR. Second, the model will not fit because your design matrix is overspecified. With that design matrix you are actually trying to fit more parameters (columns of the design matrix, 11), than you have samples (9). I hope you can recognise that this would be problematic! You should not be trying to fit a parameter for each sample. All you need to do here is drop the sample indicator from your call to model.matrix. So you would have: R> design <- model.matrix(~ age, data=targets) This fits has a parameter for each of your age groups (so three parameters/columns in design matrix, one of which is the intercept term). Do you have any other covariate information? If so, this could be included in the design matrix, but the design simply with age will allow you to perform the analysis that you have said you want to do. Third, you were worried that you seemed to be missing "middle" in the design matrix. This is actually correct. It is to do with the default way in which R forms the design matrix, using "treatment contrasts" (google-able). See the help for model.matrix and contrasts. From the contrasts help file: "contr.treatment contrasts each level with the baseline level (specified by base): the baseline level is omitted." This means that your design matrix is set up to contrast (compare) each age with the baseline level, which here is "middle". So the "ageold" column of the design contrasts "old" with "middle" and "ageyoung" contrasts "young" with "middle". When you have an intercept in the model, as you do here, the "middle" baseline level is absorbed into that and so 'disappears' from the design matrix. It is nothing to worry about. So, to test the effect of having three different ages as opposed to having no different ages, you would form the design matrix as I've described above and then call the following to estimate the tagwise dispersions. R> d <-estimateCRDisp(d, design) ## NB: this function is being deprecated in the next release of edgeR, coming up in about 2 weeks--- we will have new, better GLM methods for RNA-seq analysis! ## Apologies for the uninformative error message here - we're working hard on the package to improve it. To carry out your actual test, you need to fit the full model: R> glmfit <- glmFit(d, design, dispersion=dCR.tagwise.dispersion) Then you can conduct the likelihood-ratio test (chi-square on 2 degrees of freedom): R> results <- glmLRT(d, glmfit, contrast=c(0,1,1)) Then you can look at the top DE genes: R> topTags(results) And so on. Hope this helps with your analysis. Best wishes Davis On Mar 24, 2011, at 10:35 PM, Shona Wood wrote: > Hi, > > I have been trying to compare RNA-seq data 3 groups; young, middle aged and old. > I have done pairwise comparisons between these groups but i would like to > compare all three together. I have been trying to use the GLM method to do this, > following the tutorial, 11 Case study: Oral carcinomas vs matched normal tissue. > i understand that this only compares two conditions but i tried to adapt it to > my needs. Being new at this I am unsure if I am doing it right and I get an > error when trying to estimateCRdisp. Though I think the problem is actually to > do with the way i have specified the design because "middle" doesnt seem to be > in the design. Please see below: > > R version 2.12.2 (2011-02-25) > Copyright (C) 2011 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: i386-pc-mingw32/i386 (32-bit) > > library(edgeR) > setwd ("/Users/swood/Documents/rna-seq analysis") > targets<-read.delim (file="targets.txt", > + stringsAsFactors = FALSE, row.names = "label") > Error in data[[rowvar]] : attempt to select less than one element > targets<-read.delim (file="targets.txt") > targets$age<-factor(targets$age) > targets$sample<-factor(targets$sample) > targets > X files age sample > 1 p1 p1.txt young 1 > 2 p2 p2.txt young 2 > 3 p3 p3.txt young 3 > 4 p4 p4.txt old 4 > 5 p5 p5.txt old 5 > 6 p6 p6.txt old 6 > 7 p7 p7.txt middle 7 > 8 p8 p8.txt middle 8 > 9 p9 p9.txt middle 9 > d<-readDGE(targets, skip=1, comment.char='#') > colnames(d)<-row.names(targets) > d<-calcNormFactors(d) > d > An object of class "DGEList" > $samples > X files age sample lib.size norm.factors > 1 p1 p1.txt young 1 674816.7 0.7598895 > 2 p2 p2.txt young 2 648856.9 0.9095203 > 3 p3 p3.txt young 3 449797.3 1.2774867 > 4 p4 p4.txt old 4 699036.0 0.8532141 > 5 p5 p5.txt old 5 441649.7 1.1919854 > 6 p6 p6.txt old 6 670755.9 0.8147330 > 7 p7 p7.txt middle 7 699630.1 1.0262252 > 8 p8 p8.txt middle 8 505795.8 1.2979023 > 9 p9 p9.txt middle 9 478527.1 1.0262468 > >$counts > 1 2 3 4 5 6 7 > ENSRNOG00000000007 38.23700 37.89900 34.85820 43.6703 40.81260 43.4009 42.4969 > ENSRNOG00000000008 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > ENSRNOG00000000009 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > ENSRNOG00000000010 1.40963 8.28797 2.09491 17.1853 1.76635 14.7567 1.0840 > ENSRNOG00000000012 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > 8 9 > ENSRNOG00000000007 45.81270 48.6937 > ENSRNOG00000000008 0.00000 0.0000 > ENSRNOG00000000009 0.00000 0.0000 > ENSRNOG00000000010 5.44889 17.7295 > ENSRNOG00000000012 0.00000 0.0000 > 29510 more rows ... > > d<-d[rowSums(d$counts)>9,] > design <- model.matrix(~ sample + age, data = targets) > design > (Intercept) sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 > 1 1 0 0 0 0 0 0 0 0 > 2 1 1 0 0 0 0 0 0 0 > 3 1 0 1 0 0 0 0 0 0 > 4 1 0 0 1 0 0 0 0 0 > 5 1 0 0 0 1 0 0 0 0 > 6 1 0 0 0 0 1 0 0 0 > 7 1 0 0 0 0 0 1 0 0 > 8 1 0 0 0 0 0 0 1 0 > 9 1 0 0 0 0 0 0 0 1 > ageold ageyoung > 1 0 1 > 2 0 1 > 3 0 1 > 4 1 0 > 5 1 0 > 6 1 0 > 7 0 0 > 8 0 0 > 9 0 0 > attr(,"assign") > [1] 0 1 1 1 1 1 1 1 1 2 2 > attr(,"contrasts") > attr(,"contrasts")$sample > [1] "contr.treatment" > > attr(,"contrasts")$age > [1] "contr.treatment" > > d<-estimateCRDisp(d, design) > Error in beta[i, ] <- z %*% X : > number of items to replace is not a multiple of replacement length > > > If you could help me out with the design or suggest a reason for the error i > would really appreciate it. > > Thanks! > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ---------------------------------------------------------------------- -- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia dmccarthy at wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ ADD COMMENTlink written 7.2 years ago by Davis McCarthy260 0 7.2 years ago by Mark Robinson1.1k Mark Robinson1.1k wrote: Hi Shona. 1. Your counts aren't integers. edgeR expects raw counts. 2. My guess is you don't really need a sample-specific effect. Try: design <- model.matrix(~ age, data = targets) Cheers, Mark On 2011-03-24, at 10:35 PM, Shona Wood wrote: > Hi, > > I have been trying to compare RNA-seq data 3 groups; young, middle aged and old. > I have done pairwise comparisons between these groups but i would like to > compare all three together. I have been trying to use the GLM method to do this, > following the tutorial, 11 Case study: Oral carcinomas vs matched normal tissue. > i understand that this only compares two conditions but i tried to adapt it to > my needs. Being new at this I am unsure if I am doing it right and I get an > error when trying to estimateCRdisp. Though I think the problem is actually to > do with the way i have specified the design because "middle" doesnt seem to be > in the design. Please see below: > > R version 2.12.2 (2011-02-25) > Copyright (C) 2011 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: i386-pc-mingw32/i386 (32-bit) > > library(edgeR) > setwd ("/Users/swood/Documents/rna-seq analysis") > targets<-read.delim (file="targets.txt", > + stringsAsFactors = FALSE, row.names = "label") > Error in data[[rowvar]] : attempt to select less than one element > targets<-read.delim (file="targets.txt") > targets$age<-factor(targets$age) > targets$sample<-factor(targets$sample) > targets > X files age sample > 1 p1 p1.txt young 1 > 2 p2 p2.txt young 2 > 3 p3 p3.txt young 3 > 4 p4 p4.txt old 4 > 5 p5 p5.txt old 5 > 6 p6 p6.txt old 6 > 7 p7 p7.txt middle 7 > 8 p8 p8.txt middle 8 > 9 p9 p9.txt middle 9 > d<-readDGE(targets, skip=1, comment.char='#') > colnames(d)<-row.names(targets) > d<-calcNormFactors(d) > d > An object of class "DGEList" >$samples > X files age sample lib.size norm.factors > 1 p1 p1.txt young 1 674816.7 0.7598895 > 2 p2 p2.txt young 2 648856.9 0.9095203 > 3 p3 p3.txt young 3 449797.3 1.2774867 > 4 p4 p4.txt old 4 699036.0 0.8532141 > 5 p5 p5.txt old 5 441649.7 1.1919854 > 6 p6 p6.txt old 6 670755.9 0.8147330 > 7 p7 p7.txt middle 7 699630.1 1.0262252 > 8 p8 p8.txt middle 8 505795.8 1.2979023 > 9 p9 p9.txt middle 9 478527.1 1.0262468 > > $counts > 1 2 3 4 5 6 7 > ENSRNOG00000000007 38.23700 37.89900 34.85820 43.6703 40.81260 43.4009 42.4969 > ENSRNOG00000000008 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > ENSRNOG00000000009 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > ENSRNOG00000000010 1.40963 8.28797 2.09491 17.1853 1.76635 14.7567 1.0840 > ENSRNOG00000000012 0.00000 0.00000 0.00000 0.0000 0.00000 0.0000 0.0000 > 8 9 > ENSRNOG00000000007 45.81270 48.6937 > ENSRNOG00000000008 0.00000 0.0000 > ENSRNOG00000000009 0.00000 0.0000 > ENSRNOG00000000010 5.44889 17.7295 > ENSRNOG00000000012 0.00000 0.0000 > 29510 more rows ... > > d<-d[rowSums(d$counts)>9,] > design <- model.matrix(~ sample + age, data = targets) > design > (Intercept) sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 > 1 1 0 0 0 0 0 0 0 0 > 2 1 1 0 0 0 0 0 0 0 > 3 1 0 1 0 0 0 0 0 0 > 4 1 0 0 1 0 0 0 0 0 > 5 1 0 0 0 1 0 0 0 0 > 6 1 0 0 0 0 1 0 0 0 > 7 1 0 0 0 0 0 1 0 0 > 8 1 0 0 0 0 0 0 1 0 > 9 1 0 0 0 0 0 0 0 1 > ageold ageyoung > 1 0 1 > 2 0 1 > 3 0 1 > 4 1 0 > 5 1 0 > 6 1 0 > 7 0 0 > 8 0 0 > 9 0 0 > attr(,"assign") > [1] 0 1 1 1 1 1 1 1 1 2 2 > attr(,"contrasts") > attr(,"contrasts")$sample > [1] "contr.treatment" > > attr(,"contrasts")$age > [1] "contr.treatment" > > d<-estimateCRDisp(d, design) > Error in beta[i, ] <- z %*% X : > number of items to replace is not a multiple of replacement length > > > If you could help me out with the design or suggest a reason for the error i > would really appreciate it. > > Thanks! > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: mrobinson at wehi.edu.au e: m.robinson at garvan.org.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}