EdgeR: Using estimateCommondisp for housekeeping genes

0

Entering edit mode

Tonya Mariko Brunetti ▴ 10

@tonya-mariko-brunetti-4956

Last seen 9.6 years ago

Hello, My name is Tonya and I am very new to both R and edgeR so sorry if this seems silly. I have recently gotten back results of two samples from a 454 and do not have replicates of either. I was reading the edgeR manual section about what to do about calculating common dispersion factors if no replicates are available. One of the options was to use genes that are not suppose to be differentially expressed (ie housekeeping genes) to determine the common dispersion. In the manaul they show an example do<-estimateCommonDisp(d[housekeeping,]). Would anyone please explain to me how I can use the housekeeping genes in the data I have collected to estimate this value. I have tried numerous things for input into the function estimateCommonDisp (see below some of what I have tried) but I guess I don't know how to specify just the housekeeping genes?? Or if anyone has a method for common dispersion in edgeR that will work for no replicates that would be appreciated as well! estimateCommonDisp(d['RpS2','RpS28b]) (where the stuff in brackets are my housekeeping genes and d is my normalized DGEList estimateCommonDisp(d[RpS2,RpS28b]) Thank you so much! Tonya

edgeR edgeR • 2.8k views

ADD COMMENT • link updated 12.5 years ago by Mark Robinson ▴ 880 • written 12.5 years ago by Tonya Mariko Brunetti ▴ 10

0

Entering edit mode

Mark Robinson ▴ 880

@mark-robinson-4908

Last seen 5.5 years ago

Hi Tonya, I believe your question comes down to how to subset a DGEList object. As the example 'd[housekeeping,]' suggests, it is much like subletting a matrix (if you don't know how to do this, you should consult an Intro to R manual). Here is an example: y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4) rownames(y) <- paste("Gene",1:nrow(y),sep=".") group <- factor(c(1,1,2,2)) d <- DGEList(counts=y,group=group,lib.size=rep(1000,4)) If you knew your housekeeping genes were in rows {4,6,10,15} of your table, you could simply call: do <- estimateCommonDisp(d[c(4,6,10,15),]) Of course, there are lots of ways to subset, e.g.: http://www.ats.ucla.edu/stat/r/modules/subsetting.htm Equivalent to above but slight different approach, you could do: gkeep <- paste("Gene",c(4,6,10,15),sep=".") do <- estimateCommonDisp(d[rownames(d) %in% gkeep,]) ? and so on. On the whole enterprise of doing these analyses w/o replicates, there has been a lot of discussion: http://seqanswers.com/forums/showthread.php?t=4055 http://seqanswers.com/forums/showthread.php?t=10137 http://seqanswers.com/forums/showthread.php?t=11081 https://stat.ethz.ch/pipermail/bioconductor/2011-July/040296.html ? and so on. All the best, Mark ---------- Prof. Dr. Mark Robinson Bioinformatics Institute of Molecular Life Sciences University of Zurich Winterthurerstrasse 190 8057 Zurich Switzerland v: +41 44 635 4848 f: +41 44 635 6898 e: mark.robinson at imls.uzh.ch o: Y32-J-34 w: http://tiny.cc/mrobin On 12.11.2011, at 23:12, Tonya Mariko Brunetti wrote: > Hello, > > My name is Tonya and I am very new to both R and edgeR so sorry if this seems silly. I have recently gotten back results of two samples from a 454 and do not have replicates of either. I was reading the edgeR manual section about what to do about calculating common dispersion factors if no replicates are available. > > One of the options was to use genes that are not suppose to be differentially expressed (ie housekeeping genes) to determine the common dispersion. In the manaul they show an example do<-estimateCommonDisp(d[housekeeping,]). > > Would anyone please explain to me how I can use the housekeeping genes in the data I have collected to estimate this value. I have tried numerous things for input into the function estimateCommonDisp (see below some of what I have tried) but I guess I don't know how to specify just the housekeeping genes?? Or if anyone has a method for common dispersion in edgeR that will work for no replicates that would be appreciated as well! > > estimateCommonDisp(d['RpS2','RpS28b]) (where the stuff in brackets are my housekeeping genes and d is my normalized DGEList > estimateCommonDisp(d[RpS2,RpS28b]) > > > Thank you so much! > > Tonya > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.5 years ago Mark Robinson ▴ 880

0

Entering edit mode

This was very helpful to me. But when I compare an R created subset to an expression set I created manually based on the same criteria, they "look" the same, but the R function equals() says FALSE: > equals(c_eset, co) [1] FALSE > equals.Verbose(c_eset, co) [1] FALSE attr(,"reason") [1] "Not same class" > class(c_eset) [1] "matrix" > class(co) [1] "matrix" > head(c_eset) 01A_H1_08a_446VD.CEL 01A_H1_08b_446VD.CEL 01A_H1_22a_446VD.CEL 01A_H1_22b_446VD.CEL 02A_H1_10a_446VD.CEL 02A_H1_10b_446VD.CEL 02A_H1_24a_446VD.CEL 02A_H1_24b_446VD.CEL hsa-miR-105_st 6.795107 6.913433 3.580257 2.749580 6.854192 7.477093 3.834115 3.678890 hsa-miR-10b_st 5.949392 6.248459 6.824051 6.804110 6.812655 5.530932 5.739557 6.231381 hsa-miR-1180_st 6.294834 5.884889 7.562048 6.952669 6.779352 5.939806 7.055804 7.188879 hsa-miR-1225-5p_st 4.754944 5.403615 3.803957 3.752518 5.513498 6.534705 5.981105 5.429791 hsa-miR-1260_st 3.995493 3.621449 2.846316 2.297834 3.051231 2.566250 3.323427 3.978176 hsa-miR-127-5p_st 6.252869 6.829449 7.170236 7.086132 5.743895 5.812096 6.813619 6.844393 > head(co) 01A_H1_08a_446VD.CEL 01A_H1_08b_446VD.CEL 01A_H1_22a_446VD.CEL 01A_H1_22b_446VD.CEL 02A_H1_10a_446VD.CEL 02A_H1_10b_446VD.CEL 02A_H1_24a_446VD.CEL 02A_H1_24b_446VD.CEL hsa-miR-105_st 6.795107 6.913433 3.580257 2.749580 6.854192 7.477093 3.834115 3.678890 hsa-miR-10b_st 5.949392 6.248459 6.824051 6.804110 6.812655 5.530932 5.739557 6.231381 hsa-miR-1180_st 6.294834 5.884889 7.562048 6.952669 6.779352 5.939806 7.055804 7.188879 hsa-miR-1225-5p_st 4.754944 5.403615 3.803957 3.752518 5.513498 6.534705 5.981105 5.429791 hsa-miR-1260_st 3.995493 3.621449 2.846316 2.297834 3.051231 2.566250 3.323427 3.978176 hsa-miR-127-5p_st 6.252869 6.829449 7.170236 7.086132 5.743895 5.812096 6.813619 6.844393 > summary(co) 01A_H1_08a_446VD.CEL 01A_H1_08b_446VD.CEL 01A_H1_22a_446VD.CEL 01A_H1_22b_446VD.CEL 02A_H1_10a_446VD.CEL 02A_H1_10b_446VD.CEL 02A_H1_24a_446VD.CEL 02A_H1_24b_446VD.CEL Min. : 1.877 Min. : 2.176 Min. : 1.767 Min. : 1.727 Min. : 1.774 Min. : 2.022 Min. : 1.779 Min. : 1.738 1st Qu.: 3.976 1st Qu.: 4.085 1st Qu.: 2.789 1st Qu.: 2.669 1st Qu.: 3.881 1st Qu.: 4.085 1st Qu.: 2.618 1st Qu.: 2.585 Median : 5.621 Median : 5.331 Median : 3.767 Median : 3.812 Median : 5.557 Median : 5.522 Median : 3.568 Median : 3.581 Mean : 5.567 Mean : 5.616 Mean : 4.247 Mean : 4.176 Mean : 5.643 Mean : 5.708 Mean : 4.087 Mean : 4.180 3rd Qu.: 6.816 3rd Qu.: 6.861 3rd Qu.: 5.316 3rd Qu.: 5.221 3rd Qu.: 7.040 3rd Qu.: 6.828 3rd Qu.: 5.098 3rd Qu.: 5.184 Max. :10.387 Max. :10.703 Max. :11.126 Max. :11.105 Max. :10.939 Max. :10.958 Max. :10.796 Max. :10.930 > summary(c_eset) 01A_H1_08a_446VD.CEL 01A_H1_08b_446VD.CEL 01A_H1_22a_446VD.CEL 01A_H1_22b_446VD.CEL 02A_H1_10a_446VD.CEL 02A_H1_10b_446VD.CEL 02A_H1_24a_446VD.CEL 02A_H1_24b_446VD.CEL Min. : 1.877 Min. : 2.176 Min. : 1.767 Min. : 1.727 Min. : 1.774 Min. : 2.022 Min. : 1.779 Min. : 1.738 1st Qu.: 3.976 1st Qu.: 4.085 1st Qu.: 2.789 1st Qu.: 2.669 1st Qu.: 3.881 1st Qu.: 4.085 1st Qu.: 2.618 1st Qu.: 2.585 Median : 5.621 Median : 5.331 Median : 3.767 Median : 3.812 Median : 5.557 Median : 5.522 Median : 3.568 Median : 3.581 Mean : 5.567 Mean : 5.616 Mean : 4.247 Mean : 4.176 Mean : 5.643 Mean : 5.708 Mean : 4.087 Mean : 4.180 3rd Qu.: 6.816 3rd Qu.: 6.861 3rd Qu.: 5.316 3rd Qu.: 5.221 3rd Qu.: 7.040 3rd Qu.: 6.828 3rd Qu.: 5.098 3rd Qu.: 5.184 Max. :10.387 Max. :10.703 Max. :11.126 Max. :11.105 Max. :10.939 Max. :10.958 Max. :10.796 Max. :10.930 I don't understand why equals() gives "FALSE". They seem to be identical. thanks for any suggestions or references. Tom MMI DNA Services Core Facility 503-494-2442 kellert at ohsu.edu Office: 6588 RJH (CROET/BasicScience) OHSU Shared Resources On Nov 13, 2011, at 7:35 AM, Mark Robinson wrote: > Hi Tonya, > > I believe your question comes down to how to subset a DGEList object. As the example 'd[housekeeping,]' suggests, it is much like subletting a matrix (if you don't know how to do this, you should consult an Intro to R manual). > > Here is an example: > > y <- matrix(rnbinom(80,size=1/0.2,mu=10),nrow=20,ncol=4) > rownames(y) <- paste("Gene",1:nrow(y),sep=".") > group <- factor(c(1,1,2,2)) > d <- DGEList(counts=y,group=group,lib.size=rep(1000,4)) > > If you knew your housekeeping genes were in rows {4,6,10,15} of your table, you could simply call: > > do <- estimateCommonDisp(d[c(4,6,10,15),]) > > Of course, there are lots of ways to subset, e.g.: > http://www.ats.ucla.edu/stat/r/modules/subsetting.htm > > Equivalent to above but slight different approach, you could do: > > gkeep <- paste("Gene",c(4,6,10,15),sep=".") > do <- estimateCommonDisp(d[rownames(d) %in% gkeep,]) > > ? and so on. > > On the whole enterprise of doing these analyses w/o replicates, there has been a lot of discussion: > > http://seqanswers.com/forums/showthread.php?t=4055 > http://seqanswers.com/forums/showthread.php?t=10137 > http://seqanswers.com/forums/showthread.php?t=11081 > https://stat.ethz.ch/pipermail/bioconductor/2011-July/040296.html > ? and so on. > > All the best, > Mark > > ---------- > Prof. Dr. Mark Robinson > Bioinformatics > Institute of Molecular Life Sciences > University of Zurich > Winterthurerstrasse 190 > 8057 Zurich > Switzerland > > v: +41 44 635 4848 > f: +41 44 635 6898 > e: mark.robinson at imls.uzh.ch > o: Y32-J-34 > w: http://tiny.cc/mrobin > > > > > On 12.11.2011, at 23:12, Tonya Mariko Brunetti wrote: > >> Hello, >> >> My name is Tonya and I am very new to both R and edgeR so sorry if this seems silly. I have recently gotten back results of two samples from a 454 and do not have replicates of either. I was reading the edgeR manual section about what to do about calculating common dispersion factors if no replicates are available. >> >> One of the options was to use genes that are not suppose to be differentially expressed (ie housekeeping genes) to determine the common dispersion. In the manaul they show an example do<-estimateCommonDisp(d[housekeeping,]). >> >> Would anyone please explain to me how I can use the housekeeping genes in the data I have collected to estimate this value. I have tried numerous things for input into the function estimateCommonDisp (see below some of what I have tried) but I guess I don't know how to specify just the housekeeping genes?? Or if anyone has a method for common dispersion in edgeR that will work for no replicates that would be appreciated as well! >> >> estimateCommonDisp(d['RpS2','RpS28b]) (where the stuff in brackets are my housekeeping genes and d is my normalized DGEList >> estimateCommonDisp(d[RpS2,RpS28b]) >> >> >> Thank you so much! >> >> Tonya >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.5 years ago Tom Keller ▴ 70

0

Entering edit mode

Dear all, I am starting to analyze the results of a DGE experiment, and I have some doubts about how to perform multi-group comparisons. Our data set consists in just one developmental stage, but I have data from 12 different genotypes (I have duplicates of the samples). I will like to identify all the differentially expressed genes among the different genotypes. I will like to know your opinion about edgeR and baySeq packages as both seam to be able to perform multi-group analysis. And I will also appreciate if you could give me any idea about the best way to design mi contrast hypothesis. *When working with edgeR I have employ this type of contrast matrix for my analysis* > cultivar <- factor(c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)) > design <- model.matrix(~cultivar) > design (Intercept) cultivar2 cultivar3 cultivar4 cultivar5 cultivar6 cultivar7 .... 1 1 0 0 0 0 0 0 2 1 0 0 0 0 0 0 3 1 1 0 0 0 0 0 4 1 1 0 0 0 0 0 5 1 0 1 0 0 0 0 6 1 0 1 0 0 0 0 7 1 0 0 1 0 0 0 8 1 0 0 1 0 0 0 9 1 0 0 0 1 0 0 10 1 0 0 0 1 0 0 : : : attr(,"assign") [1] 0 1 1 1 1 1 1 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$cultivar [1] "contr.treatment" *However I believe that will just help me to identify the genes that are differentially expressed in one of the cultivars with respect to the rest. Besides if I keep going doing all the contrast at the same time as people from the bioconductor mail list suggest me to extract the differentially expressed genes I get a mistake.* > glmfit.d2 <- glmFit(d2, design, dispersion = d2$common.dispersion) > names(glmfit.d2) [1] "coefficients" "fitted.values" "deviance" "df.residual" [5] "abundance" "design" "offset" "dispersion" [9] "method" "counts" "samples" > lrt.d2 <- glmLRT(d2, glmfit.d2, coef =2:12) > summary(decideTestsDGE(lrt.d2, adjust.method="BH", p.value=0.05)) Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : attempt to set an attribute on NULL *If I check just one of my possible contrasts it works* > lrt.d2test <- glmLRT(d2, glmfit.d2, coef =2) > summary(decideTestsDGE(lrt.d2test, adjust.method="BH", p.value=0.05)) [,1] -1 875 0 14026 1 1460 *However I think I am just getting genes whose expression is different in one of the cultivars, but not changing in the rest. And I will like just to get all the genes differentially expressed no matter in which cultivar or cultivars.* *I have also try to carry these analysis with baySeq package, and I have design this contrast hypothesis.* Slot "groups": $NDE [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 $DE [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 *But, I believe with these approach I will not be able to identify genes that for example may have the same expression in cultivars 1,2,3,4 but different to the expression in cultivars 5,6,7,8. What do you think about it? Do you think I need to add model for each of the possible combinations? * *I will appreciate any suggestion that you could give me about the best way to analyze those data.* *Thanks in advance for your attention* *Laura Pascual* [[alternative HTML version deleted]]

ADD REPLY • link 12.5 years ago lpascual ▴ 50

Login before adding your answer.