Paired comparisons in EdgeR
3
0
Entering edit mode
Jussi Salmi ▴ 20
@jussi-salmi-6542
Last seen 10.1 years ago
Hello! Thank you for the nice software and clear user guide. I am using EdgeR to analyse RNA-seq data. In the experiment there are several cell cultures with different knock-downs. The cell cultures are stimulated in two different ways. In the end, I want to compare the treated cultures to the untreated ones. The same untreated knock-down is compared to the same treated knock-down culture. I think that the study design produces paired comparisons because the same cultures are first used as the untreated ones and then they are treated and compared against the untreated. I have come up with the following code, based on the user guide: x<-read.delim("htseqout.edger", sep=" ",row.names="Symbol") group<-factor(c(21,21,19,19,6,6,8,8,12,12,18,18,20,20,11,11,9,9,23... y<-DGEList(counts=x,group=group) design<-model.matrix(~0+group, data=y$samples) colnames(design)<-levels(y$samples$group) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) fit<-glmFit(y,design) ?lrt0203<-glmLRT(fit,contrast=c(0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0)) tt0203<-topTags(lrt0203, n=13000) write.table(tt0203,file="tt0203") [several other comparisons) I tried to understand whether this is a good way to analyse paired samples or is there a better way? Thanks, Jussi Salmi PhD computer Science Centre for Biotechnology, Turku, Finland > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 [4] LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] edgeR_3.4.2 limma_3.18.13 loaded via a namespace (and not attached): [1] tools_3.1.0 -- Jussi Salmi, PhD http://www.btk.fi/index.php?id=12&sort=&pid=282
• 2.0k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 15 hours ago
United States
Hi Jussi, On 5/8/2014 7:24 AM, Jussi Salmi wrote: > Hello! > > > Thank you for the nice software and clear user guide. I am using EdgeR to analyse RNA-seq data. In the experiment there are several cell cultures with different knock-downs. The cell cultures are stimulated in two different ways. In the end, I want to compare the treated cultures to the untreated ones. The same untreated knock-down is compared to the same treated knock-down culture. I think that the study design produces paired comparisons because the same cultures are first used as the untreated ones and then they are treated and compared against the untreated. > > I have come up with the following code, based on the user guide: > > x<-read.delim("htseqout.edger", sep=" ",row.names="Symbol") > group<-factor(c(21,21,19,19,6,6,8,8,12,12,18,18,20,20,11,11,9,9,23... > y<-DGEList(counts=x,group=group) > design<-model.matrix(~0+group, data=y$samples) You want to include a factor for treatment as well. Something like trt <- factor(rep(1:2, length(group)/2)) design <- model.matrix(~group+trt) fit <- glmFit(y, design) lrt <- glmLRT(fit) topTags(lrt) Will give you the genes that change between treatments, after controlling for the paired nature of your experiment. Best, Jim > colnames(design)<-levels(y$samples$group) > y<-estimateGLMTrendedDisp(y,design) > y<-estimateGLMTagwiseDisp(y,design) > fit<-glmFit(y,design) > ?lrt0203<-glmLRT(fit,contrast=c(0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0)) > tt0203<-topTags(lrt0203, n=13000) > write.table(tt0203,file="tt0203") > > [several other comparisons) > > I tried to understand whether this is a good way to analyse paired samples or is there a better way? > > Thanks, > > Jussi Salmi > PhD computer Science > Centre for Biotechnology, Turku, Finland > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 > [4] LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 > [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods base > > other attached packages: > [1] edgeR_3.4.2 limma_3.18.13 > > loaded via a namespace (and not attached): > [1] tools_3.1.0 > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT
0
Entering edit mode
Jussi Salmi ▴ 20
@jussi-salmi-6542
Last seen 10.1 years ago
Thank you for the advice! There is a further problem, though. I got the error "Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, : Design matrix not of full rank. The following coefficients not estimable: treatment2 treatment3" My code is now: x<-read.delim(".. group<-factor(c(21,19,6,8,12,18,20,11,9,... treatment<-factor(c(3,3,3,1,1,2,3,1,1,... y<-DGEList(counts=x,group=group) design<-model.matrix(~group+treatment, data=y$samples) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) fit<-glmFit(y,design) lrt <- glmLRT(fit) tt<-topTags(lrt) write.table(tt,file="tt") I have been looking at this all day, but it seems I am too stupid to understand this. I have 72 samples in 24 groups (3 replicates in each group). There are 3 treatments, and factor "treatment"'s length is 72. For each sample I have given the group in factor "group". There are three treatments: no treatment (1), treatment A (2) and treatment B (3). In the design matrix there is an intercept column, then the 24 groups and finally two treatment columns. Jussi PhD, Computer Science University of Turku, Finland 08.05.2014 17:23, James W. MacDonald kirjoitti: > Hi Jussi, > > On 5/8/2014 7:24 AM, Jussi Salmi wrote: >> Hello! >> >> >> Thank you for the nice software and clear user guide. I am using EdgeR to analyse RNA-seq data. In the experiment there are several cell cultures with different knock-downs. The cell cultures are stimulated in two different ways. In the end, I want to compare the treated cultures to the untreated ones. The same untreated knock-down is compared to the same treated knock-down culture. I think that the study design produces paired comparisons because the same cultures are first used as the untreated ones and then they are treated and compared against the untreated. >> >> I have come up with the following code, based on the user guide: >> >> x<-read.delim("htseqout.edger", sep=" ",row.names="Symbol") >> group<-factor(c(21,21,19,19,6,6,8,8,12,12,18,18,20,20,11,11,9,9,23... >> y<-DGEList(counts=x,group=group) >> design<-model.matrix(~0+group, data=y$samples) > > You want to include a factor for treatment as well. Something like > > trt <- factor(rep(1:2, length(group)/2)) > > design <- model.matrix(~group+trt) > fit <- glmFit(y, design) > lrt <- glmLRT(fit) > topTags(lrt) > > Will give you the genes that change between treatments, after > controlling for the paired nature of your experiment. > > Best, > > Jim > > >> colnames(design)<-levels(y$samples$group) >> y<-estimateGLMTrendedDisp(y,design) >> y<-estimateGLMTagwiseDisp(y,design) >> fit<-glmFit(y,design) >> ?lrt0203<-glmLRT(fit,contrast=c(0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0)) >> tt0203<-topTags(lrt0203, n=13000) >> write.table(tt0203,file="tt0203") >> >> [several other comparisons) >> >> I tried to understand whether this is a good way to analyse paired samples or is there a better way? >> >> Thanks, >> >> Jussi Salmi >> PhD computer Science >> Centre for Biotechnology, Turku, Finland >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 >> [4] LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 >> [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.4.2 limma_3.18.13 >> >> loaded via a namespace (and not attached): >> [1] tools_3.1.0 >> > -- Jussi Salmi, PhD http://www.btk.fi/index.php?id=12&sort=&pid=282
ADD COMMENT
0
Entering edit mode
Hi Jussi, Do any of the groups span multiple treatments, or are all the samples in each group part of the same treatment? If it is the latter, then note that splitting the samples by group also splits them by treatment as well, so also including treatment in the model is redundant. This is why you get the error: your model effectively has the treatment effects included twice. Perhaps you want to model each treatment as the mean of its constituent groups? If so, you may want to try something like this: group<-factor(c(21,19,6,8,12,18,20,11,9,...)) treatment<-factor(c(3,3,3,1,1,2,3,1,1,...)) design <- model.matrix(~0+group) g.by.t <- split(group, treatment) # Here we express each treatment symbolically as the mean of its # constituent groups treatment.expressions <- sapply(g.by.t, function(g) { g <- levels(droplevels(g)) numerator <- paste("group", g, sep="", collapse=" + ") denominator <- as.character(length(g)) sprintf("(( %s ) / %s)", numerator, denominator) }) ## Here we construct the contrasts between the treatment expressions. desired.contrasts <- c( Trt.1vs2=sprintf("%s - %s", treatment.expressions[["2"]], treatment.expressions[["1"]]), Trt.1vs3=sprintf("%s - %s", treatment.expressions[["3"]], treatment.expressions[["1"]])) y<-DGEList(counts=x,group=group) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) fit<-glmFit(y,design) ## We conduct a separate test and get a separate top table for each ## treatment contrast. The "makeContrasts" function converts the ## symbolic expressions that we constructed into numeric contrast ## matrices. lrts <- lapply(desired.contrasts, function(ct) glmLRT(fit, contrast=makeContrasts(contrasts=ct, levels=design))) tts<-lapply(lrt, topTags) ## We need to write each table to a separate file. for (i in names(tts)) { write.table(tts[[i]],file=sprintf("tt_%s", i)) } However, note that modeling the treatments as the mean of their constituent groups is not equivalent to treating the groups as samples from a population. In particular, this will only model the random variation within groups, while variation between groups of the same treatment is effectively ignored (it is averaged out by taking the mean). To properly model both sample-within-group variation and group-within-treatment variation, I believe you would need to use a mixed model or similar method. I think you could possibly fit an appropriate model using limma, voom, and limma's duplicateCorrelation function. For information about this see the limma user's guide section 9.7 "Multi-level Experiments". In that case, you would use ~treatment as your design and group as the blocking factor (random effect) for duplicateCorrelation. Hopefully one of the above tips helps you. -Ryan On 5/12/14, 4:36 AM, Jussi Salmi wrote: > Thank you for the advice! There is a further problem, though. > > I got the error > > "Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, : > Design matrix not of full rank. The following coefficients not estimable: > treatment2 treatment3" > > > My code is now: > x<-read.delim(".. > group<-factor(c(21,19,6,8,12,18,20,11,9,... > treatment<-factor(c(3,3,3,1,1,2,3,1,1,... > y<-DGEList(counts=x,group=group) > design<-model.matrix(~group+treatment, data=y$samples) > y<-estimateGLMTrendedDisp(y,design) > y<-estimateGLMTagwiseDisp(y,design) > fit<-glmFit(y,design) > > lrt <- glmLRT(fit) > tt<-topTags(lrt) > write.table(tt,file="tt") > > > I have been looking at this all day, but it seems I am too stupid to understand this. I have 72 samples in 24 groups (3 replicates in each group). There are 3 treatments, and factor "treatment"'s length is 72. For each sample I have given the group in factor "group". There are three treatments: no treatment (1), treatment A (2) and treatment B (3). > > In the design matrix there is an intercept column, then the 24 groups and finally two treatment columns. > > Jussi > > PhD, Computer Science > University of Turku, Finland > > > 08.05.2014 17:23, James W. MacDonald kirjoitti: >> Hi Jussi, >> >> On 5/8/2014 7:24 AM, Jussi Salmi wrote: >>> Hello! >>> >>> >>> Thank you for the nice software and clear user guide. I am using EdgeR to analyse RNA-seq data. In the experiment there are several cell cultures with different knock-downs. The cell cultures are stimulated in two different ways. In the end, I want to compare the treated cultures to the untreated ones. The same untreated knock-down is compared to the same treated knock-down culture. I think that the study design produces paired comparisons because the same cultures are first used as the untreated ones and then they are treated and compared against the untreated. >>> >>> I have come up with the following code, based on the user guide: >>> >>> x<-read.delim("htseqout.edger", sep=" ",row.names="Symbol") >>> group<-factor(c(21,21,19,19,6,6,8,8,12,12,18,18,20,20,11,11,9,9,23... >>> y<-DGEList(counts=x,group=group) >>> design<-model.matrix(~0+group, data=y$samples) >> You want to include a factor for treatment as well. Something like >> >> trt <- factor(rep(1:2, length(group)/2)) >> >> design <- model.matrix(~group+trt) >> fit <- glmFit(y, design) >> lrt <- glmLRT(fit) >> topTags(lrt) >> >> Will give you the genes that change between treatments, after >> controlling for the paired nature of your experiment. >> >> Best, >> >> Jim >> >> >>> colnames(design)<-levels(y$samples$group) >>> y<-estimateGLMTrendedDisp(y,design) >>> y<-estimateGLMTagwiseDisp(y,design) >>> fit<-glmFit(y,design) >>> ?lrt0203<-glmLRT(fit,contrast=c(0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ,0,0,0,0,0,0,0)) >>> tt0203<-topTags(lrt0203, n=13000) >>> write.table(tt0203,file="tt0203") >>> >>> [several other comparisons) >>> >>> I tried to understand whether this is a good way to analyse paired samples or is there a better way? >>> >>> Thanks, >>> >>> Jussi Salmi >>> PhD computer Science >>> Centre for Biotechnology, Turku, Finland >>> >>>> sessionInfo() >>> R version 3.1.0 (2014-04-10) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> >>> locale: >>> [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 >>> [4] LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 >>> [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C >>> [10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] splines stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] edgeR_3.4.2 limma_3.18.13 >>> >>> loaded via a namespace (and not attached): >>> [1] tools_3.1.0 >>> > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Jussi Salmi ▴ 10
@jussi-salmi-6547
Last seen 10.1 years ago
Thank you for the advice! There is a further problem, though. I got the error "Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, : Design matrix not of full rank. The following coefficients not estimable: treatment2 treatment3" My code is now: x<-read.delim(".. group<-factor(c(21,19,6,8,12,18,20,11,9,... treatment<-factor(c(3,3,3,1,1,2,3,1,1,... y<-DGEList(counts=x,group=group) design<-model.matrix(~group+treatment, data=y$samples) y<-estimateGLMTrendedDisp(y,design) y<-estimateGLMTagwiseDisp(y,design) fit<-glmFit(y,design) lrt <- glmLRT(fit) tt<-topTags(lrt) write.table(tt,file="tt") I have been looking at this all day, but it seems I am too stupid to understand this. I have 72 samples in 24 groups (3 replicates in each group). There are 3 treatments, and factor "treatment"'s length is 72. For each sample I have given the group in factor "group". There are three treatments: no treatment (1), treatment A (2) and treatment B (3). In the design matrix there is an intercept column, then the 24 groups and finally two treatment columns. Jussi PhD, Computer Science University of Turku, Finland 08.05.2014 17:23, James W. MacDonald kirjoitti: > Hi Jussi, > > On 5/8/2014 7:24 AM, Jussi Salmi wrote: >> Hello! >> >> >> Thank you for the nice software and clear user guide. I am using EdgeR to analyse RNA-seq data. In the experiment there are several cell cultures with different knock-downs. The cell cultures are stimulated in two different ways. In the end, I want to compare the treated cultures to the untreated ones. The same untreated knock-down is compared to the same treated knock-down culture. I think that the study design produces paired comparisons because the same cultures are first used as the untreated ones and then they are treated and compared against the untreated. >> >> I have come up with the following code, based on the user guide: >> >> x<-read.delim("htseqout.edger", sep=" ",row.names="Symbol") >> group<-factor(c(21,21,19,19,6,6,8,8,12,12,18,18,20,20,11,11,9,9,23... >> y<-DGEList(counts=x,group=group) >> design<-model.matrix(~0+group, data=y$samples) > > You want to include a factor for treatment as well. Something like > > trt <- factor(rep(1:2, length(group)/2)) > > design <- model.matrix(~group+trt) > fit <- glmFit(y, design) > lrt <- glmLRT(fit) > topTags(lrt) > > Will give you the genes that change between treatments, after > controlling for the paired nature of your experiment. > > Best, > > Jim > > >> colnames(design)<-levels(y$samples$group) >> y<-estimateGLMTrendedDisp(y,design) >> y<-estimateGLMTagwiseDisp(y,design) >> fit<-glmFit(y,design) >> ?lrt0203<-glmLRT(fit,contrast=c(0,1,-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0)) >> tt0203<-topTags(lrt0203, n=13000) >> write.table(tt0203,file="tt0203") >> >> [several other comparisons) >> >> I tried to understand whether this is a good way to analyse paired samples or is there a better way? >> >> Thanks, >> >> Jussi Salmi >> PhD computer Science >> Centre for Biotechnology, Turku, Finland >> >>> sessionInfo() >> R version 3.1.0 (2014-04-10) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=fi_FI.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 >> [4] LC_COLLATE=fi_FI.UTF-8 LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=fi_FI.UTF-8 >> [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C >> [10] LC_TELEPHONE=C LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] splines stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] edgeR_3.4.2 limma_3.18.13 >> >> loaded via a namespace (and not attached): >> [1] tools_3.1.0 >> > -- Jussi Salmi, PhD http://www.btk.fi/index.php?id=12&sort=&pid=282
ADD COMMENT

Login before adding your answer.

Traffic: 548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6