deseq for multiple groups with no replicate

0

Entering edit mode

Yolande Tra ▴ 120

@yolande-tra-4309

Last seen 11.3 years ago

Dear list members, I would like to know if it is possible to use deseq for data with multiple groups (more than two) with no replicates. I went through the vignette but it only illustrates the case for two groups with no replicates. If it is not possible for the moment, do you know a method to study similarity/dissimilarity of gene/protein count data based on multiple groups with no replicates. I appreciate your feedback. Yolande [[alternative HTML version deleted]]

DESeq DESeq • 2.5k views

ADD COMMENT • link updated 14.7 years ago by Wolfgang Huber ★ 13k • written 14.7 years ago by Yolande Tra ▴ 120

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 3 months ago

EMBL European Molecular Biology Laborat…

Dear Yolande Thanks. Can you clarify what you want to do: - per gene: ANOVA like estimation of effects in a factorial design - per sample: clustering or classification of samples based on overall (dis)similarity of expression profiles? From your post, I assume it's the latter. This is described in Section "Sample Clustering" of the vignette. (For the former, there are the functions nbinomFitGLM and nbinomGLMTest, but these will with the current implementation of DESeq only work if each cell of the design is replicated.) Thank you and best wishes Wolfgang Yolande Tra scripsit 04/04/11 00:49: > Dear list members, > > I would like to know if it is possible to use deseq for data with multiple > groups (more than two) with no replicates. I went through the vignette but > it only illustrates the case for two groups with no replicates. If it is not > possible for the moment, do you know a method to study > similarity/dissimilarity of gene/protein count data based on multiple groups > with no replicates. > > I appreciate your feedback. > > Yolande > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 14.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Wolfgang, I do have protein count data for 40 people (so no replicate). These are healthy people, no grouping. The goal is to look at similarity/dissimilarity of the 40 samples based on protein count (differential expression IF POSSIBLE) AND clustering of the proteins. As you said, clustering of samples can be done with the section "Sample Clustering" of the vignette. How would I go for clustering the proteins and look for differential expression (IF POSSIBLE). Thanks, Yolande On Mon, Apr 4, 2011 at 5:11 AM, Wolfgang Huber <whuber@embl.de> wrote: > Dear Yolande > > Thanks. Can you clarify what you want to do: > - per gene: ANOVA like estimation of effects in a factorial design > - per sample: clustering or classification of samples based on overall > (dis)similarity of expression profiles? > > From your post, I assume it's the latter. This is described in Section > "Sample Clustering" of the vignette. > > (For the former, there are the functions nbinomFitGLM and nbinomGLMTest, > but these will with the current implementation of DESeq only work if each > cell of the design is replicated.) > > Thank you and best wishes > Wolfgang > > Yolande Tra scripsit 04/04/11 00:49: > >> Dear list members, >> >> I would like to know if it is possible to use deseq for data with multiple >> groups (more than two) with no replicates. I went through the vignette but >> it only illustrates the case for two groups with no replicates. If it is >> not >> possible for the moment, do you know a method to study >> similarity/dissimilarity of gene/protein count data based on multiple >> groups >> with no replicates. >> >> I appreciate your feedback. >> >> Yolande >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Yolande Tra ▴ 120

0

Entering edit mode

Hi Yolanda On 04/04/2011 03:42 PM, Yolande Tra wrote: > I do have protein count data for 40 people (so no replicate). These are > healthy people, no grouping. The goal is to look at similarity/dissimilarity > of the 40 samples based on protein count (differential expression IF > POSSIBLE) AND clustering of the proteins. As you said, clustering of samples > can be done with the section "Sample Clustering" of the vignette. How would > I go for clustering the proteins and look for differential expression (IF > POSSIBLE). The whole point of DESeq is to allow you to work in a small sample- size setting, where you need to pool data from several genes to get useful dispersion estimates. With 40 people, you are beyond that, and you can use any conventional tests that are suitable for overdispersed count data. I don't quite know what you mean by differential expression in this case anyway. No two persons will have the same protein level, so everything is differentially expressed in some way. Maybe, you may want to estimate the variance of the proteins and look for strongly varying versus weakly varying ones. Supplementary Note A of our paper on DESeq describes a simple method-of-moments estimate for the biological variance that subtracts the Poisson noise and deals with different sequencing depths. For a discussion of the clustering, DESeq's variance-stabilizing transformation might help for clustering genes in a similar way as for clustering samples. Simon

ADD REPLY • link 14.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Hi Simon, Thank you for your reply. For clustering the proteins what would be conds (defined in the vignette) in cds3<-newCountdataSet(countsTable,conds) since there are many proteins with no specific condition. Yolande On Mon, Apr 4, 2011 at 1:11 PM, Simon Anders <anders@embl.de> wrote: > Hi Yolanda > > > On 04/04/2011 03:42 PM, Yolande Tra wrote: > >> I do have protein count data for 40 people (so no replicate). These are >> healthy people, no grouping. The goal is to look at >> similarity/dissimilarity >> of the 40 samples based on protein count (differential expression IF >> POSSIBLE) AND clustering of the proteins. As you said, clustering of >> samples >> can be done with the section "Sample Clustering" of the vignette. How >> would >> I go for clustering the proteins and look for differential expression (IF >> POSSIBLE). >> > > The whole point of DESeq is to allow you to work in a small sample- size > setting, where you need to pool data from several genes to get useful > dispersion estimates. > > With 40 people, you are beyond that, and you can use any conventional tests > that are suitable for overdispersed count data. > > > I don't quite know what you mean by differential expression in this case > anyway. No two persons will have the same protein level, so everything is > differentially expressed in some way. > > Maybe, you may want to estimate the variance of the proteins and look for > strongly varying versus weakly varying ones. Supplementary Note A of our > paper on DESeq describes a simple method-of-moments estimate for the > biological variance that subtracts the Poisson noise and deals with > different sequencing depths. > > For a discussion of the clustering, DESeq's variance-stabilizing > transformation might help for clustering genes in a similar way as for > clustering samples. > > Simon > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Yolande Tra ▴ 120

0

Entering edit mode

Hi Yolanda On 04/04/2011 07:34 PM, Yolande Tra wrote: > For clustering the proteins what would be conds (defined in the > vignette) in > cds3<-newCountdataSet(countsTable,conds) > since there are many proteins with no specific condition. Just put something, e.g., cds3 <- newCountDataSet( countsTable, rep( "dummy", ncol(countsTable) ) This assigns the same condition to all of them. Of course, you cannot use the 'nbinomTest' function, if you have only one condition, but that wouldn't make sense anyway. > In the vignette there were two conditions: "N" and "T". > For three different people (not 40), how you would define the cds in the command > res<=nbinom(cds,"N","T"). In a contrast, you compare two condition. In the vignette example, we ask: Is the expression in "T" stronger or weaker than in "N"? How would you generalize such a question to three conditions? I don't see what other option there is than to make three pair-wise comparisons -- unless you want to go away from the "which condition has the stronger expression?" kind of question to some other hypothesis formulation. Simon

ADD REPLY • link 14.7 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Hi Simon, I have tried clustering the proteins, first I transposed the data and applied the minimal set of commands but could not go further. Here is my code and the error message. tspectral=t(spectral) cds2=newCountDataSet(tspectral,rep( "dummy", ncol(tspectral))) > cds2 CountDataSet (storageMode: environment) assayData: 5 features, 36 samples element names: counts protocolData: none phenoData sampleNames: Gene_01 Gene_02 Gene_03 Gene_04 ... Gene_36 (36 total) varLabels: sizeFactor condition varMetadata: labelDescription featureData: none experimentData: use 'experimentData(object)' Annotation: > cds2=estimateSizeFactors(cds2) > cds2=estimateVarianceFunctions(cds2) Error in estimateVarianceFunctions(cds2) : NAs found in size factors. Have you called already 'estimateSizeFactors'? Thank you for your help, Yolande On Tue, Apr 5, 2011 at 5:22 AM, Simon Anders <anders@embl.de> wrote: > Hi Yolanda > > > On 04/04/2011 07:34 PM, Yolande Tra wrote: > >> For clustering the proteins what would be conds (defined in the >> vignette) in >> cds3<-newCountdataSet(countsTable,conds) >> since there are many proteins with no specific condition. >> > > Just put something, e.g., > > cds3 <- newCountDataSet( countsTable, > rep( "dummy", ncol(countsTable) ) > > This assigns the same condition to all of them. Of course, you cannot use > the 'nbinomTest' function, if you have only one condition, but that wouldn't > make sense anyway. > > > In the vignette there were two conditions: "N" and "T". >> For three different people (not 40), how you would define the cds in the >> command >> res<=nbinom(cds,"N","T"). >> > > In a contrast, you compare two condition. In the vignette example, we ask: > Is the expression in "T" stronger or weaker than in "N"? > > How would you generalize such a question to three conditions? I don't see > what other option there is than to make three pair-wise comparisons -- > unless you want to go away from the "which condition has the stronger > expression?" kind of question to some other hypothesis formulation. > > Simon > > > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Yolande Tra ▴ 120

0

Entering edit mode

Dear Yolande did you note the error message and have you checked whether there are NA or other non-numeric values in your dataset? Wolfgang Il Apr/13/11 2:31 AM, Yolande Tra ha scritto: > Hi Simon, > > I have tried clustering the proteins, first I transposed the data and > applied the minimal set of commands but could not go further. Here is my > code and the error message. > > tspectral=t(spectral) > cds2=newCountDataSet(tspectral,rep( "dummy", ncol(tspectral))) >> cds2 > CountDataSet (storageMode: environment) > assayData: 5 features, 36 samples > element names: counts > protocolData: none > phenoData > sampleNames: Gene_01 Gene_02 Gene_03 Gene_04 ... Gene_36 (36 > total) > varLabels: sizeFactor condition > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: >> cds2=estimateSizeFactors(cds2) >> cds2=estimateVarianceFunctions(cds2) > Error in estimateVarianceFunctions(cds2) : > NAs found in size factors. Have you called already 'estimateSizeFactors'? > > Thank you for your help, > Yolande > On Tue, Apr 5, 2011 at 5:22 AM, Simon Anders<anders at="" embl.de=""> wrote: > >> Hi Yolanda >> >> >> On 04/04/2011 07:34 PM, Yolande Tra wrote: >> >>> For clustering the proteins what would be conds (defined in the >>> vignette) in >>> cds3<-newCountdataSet(countsTable,conds) >>> since there are many proteins with no specific condition. >>> >> >> Just put something, e.g., >> >> cds3<- newCountDataSet( countsTable, >> rep( "dummy", ncol(countsTable) ) >> >> This assigns the same condition to all of them. Of course, you cannot use >> the 'nbinomTest' function, if you have only one condition, but that wouldn't >> make sense anyway. >> >> >> In the vignette there were two conditions: "N" and "T". >>> For three different people (not 40), how you would define the cds in the >>> command >>> res<=nbinom(cds,"N","T"). >>> >> >> In a contrast, you compare two condition. In the vignette example, we ask: >> Is the expression in "T" stronger or weaker than in "N"? >> >> How would you generalize such a question to three conditions? I don't see >> what other option there is than to make three pair-wise comparisons -- >> unless you want to go away from the "which condition has the stronger >> expression?" kind of question to some other hypothesis formulation. >> >> Simon >> >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 14.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Yolande did you note the error message and have you checked whether there are NA or other non-numeric values in your dataset? Wolfgang Il Apr/13/11 2:31 AM, Yolande Tra ha scritto: > Hi Simon, > > I have tried clustering the proteins, first I transposed the data and > applied the minimal set of commands but could not go further. Here is my > code and the error message. > > tspectral=t(spectral) > cds2=newCountDataSet(tspectral,rep( "dummy", ncol(tspectral))) >> cds2 > CountDataSet (storageMode: environment) > assayData: 5 features, 36 samples > element names: counts > protocolData: none > phenoData > sampleNames: Gene_01 Gene_02 Gene_03 Gene_04 ... Gene_36 (36 > total) > varLabels: sizeFactor condition > varMetadata: labelDescription > featureData: none > experimentData: use 'experimentData(object)' > Annotation: >> cds2=estimateSizeFactors(cds2) >> cds2=estimateVarianceFunctions(cds2) > Error in estimateVarianceFunctions(cds2) : > NAs found in size factors. Have you called already 'estimateSizeFactors'? > > Thank you for your help, > Yolande > On Tue, Apr 5, 2011 at 5:22 AM, Simon Anders<anders at="" embl.de=""> wrote: > >> Hi Yolanda >> >> >> On 04/04/2011 07:34 PM, Yolande Tra wrote: >> >>> For clustering the proteins what would be conds (defined in the >>> vignette) in >>> cds3<-newCountdataSet(countsTable,conds) >>> since there are many proteins with no specific condition. >>> >> >> Just put something, e.g., >> >> cds3<- newCountDataSet( countsTable, >> rep( "dummy", ncol(countsTable) ) >> >> This assigns the same condition to all of them. Of course, you cannot use >> the 'nbinomTest' function, if you have only one condition, but that wouldn't >> make sense anyway. >> >> >> In the vignette there were two conditions: "N" and "T". >>> For three different people (not 40), how you would define the cds in the >>> command >>> res<=nbinom(cds,"N","T"). >>> >> >> In a contrast, you compare two condition. In the vignette example, we ask: >> Is the expression in "T" stronger or weaker than in "N"? >> >> How would you generalize such a question to three conditions? I don't see >> what other option there is than to make three pair-wise comparisons -- >> unless you want to go away from the "which condition has the stronger >> expression?" kind of question to some other hypothesis formulation. >> >> Simon >> >> >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 14.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

In the vignette there were two conditions: "N" and "T". For three different people (not 40), how you would define the cds in the command res<=nbinom(cds,"N","T"). Thanks, Yolande On Mon, Apr 4, 2011 at 1:11 PM, Simon Anders <anders@embl.de> wrote: > Hi Yolanda > > > On 04/04/2011 03:42 PM, Yolande Tra wrote: > >> I do have protein count data for 40 people (so no replicate). These are >> healthy people, no grouping. The goal is to look at >> similarity/dissimilarity >> of the 40 samples based on protein count (differential expression IF >> POSSIBLE) AND clustering of the proteins. As you said, clustering of >> samples >> can be done with the section "Sample Clustering" of the vignette. How >> would >> I go for clustering the proteins and look for differential expression (IF >> POSSIBLE). >> > > The whole point of DESeq is to allow you to work in a small sample- size > setting, where you need to pool data from several genes to get useful > dispersion estimates. > > With 40 people, you are beyond that, and you can use any conventional tests > that are suitable for overdispersed count data. > > > I don't quite know what you mean by differential expression in this case > anyway. No two persons will have the same protein level, so everything is > differentially expressed in some way. > > Maybe, you may want to estimate the variance of the proteins and look for > strongly varying versus weakly varying ones. Supplementary Note A of our > paper on DESeq describes a simple method-of-moments estimate for the > biological variance that subtracts the Poisson noise and deals with > different sequencing depths. > > For a discussion of the clustering, DESeq's variance-stabilizing > transformation might help for clustering genes in a similar way as for > clustering samples. > > Simon > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 14.7 years ago Yolande Tra ▴ 120

Login before adding your answer.