DESeq2: Paired Test

2

Entering edit mode

Pankaj Agarwal ▴ 100

@pankaj-agarwal-6492

Last seen 7.8 years ago

United States

Hi, This query is in reference to the following post https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html where it was mentioned that "DESeq does not support paired tests (yet)". I have a need for performing rna-seq data analysis with matched tumor/normal pairs of data and was wondering if this capability is now available in the latest version of DESeq2. I searched the list site and also the latest vignette, but could not find anything indicating this capability is now available in DEseq. I would be grateful if, in case this feature is not yet available, someone could advice me how to do the analysis. I have samples from two patients, and from each patient there is a "normal" and a "tumor" rna-seq sample data. Thanks, - Pankaj [[alternative HTML version deleted]]

DESeq DESeq2 DESeq DESeq2 • 16k views

ADD COMMENT • link 11.9 years ago • updated 11.8 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

hi Pankaj, You should follow the multifactor analysis section of the vignette. You would include the patient effect in the design like so: design(dds) <- ~ patient + condition Where 'patient' and 'condition' are factors which are columns in colData(dds), and condition has levels: normal, tumor (where normal is the base level). Then results(dds) will build a result table for tumor vs normal, controlling for the patient effect. Mike On Thu, Apr 10, 2014 at 3:52 PM, Pankaj Agarwal <p.agarwal@duke.edu> wrote: > Hi, > > This query is in reference to the following post > https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html > where it was mentioned that "DESeq does not support paired tests (yet)". > I have a need for performing rna-seq data analysis with matched > tumor/normal pairs of data and was wondering if this capability is now > available in the latest version of DESeq2. I searched the list site and > also the latest vignette, but could not find anything indicating this > capability is now available in DEseq. I would be grateful if, in case this > feature is not yet available, someone could advice me how to do the > analysis. I have samples from two patients, and from each patient there is > a "normal" and a "tumor" rna-seq sample data. > > Thanks, > > - Pankaj > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD COMMENT • link 11.9 years ago Michael Love 43k

0

Entering edit mode

Hi Michael, I followed the vignette for the section on htseq-count and your suggestion as best as I could and got some results. But I am not sure if I did everything right, so I would really appreciate if you could verify if the following steps is the correct way to analyze this data set: There are 4 samples for two patients: Patient A: abc.htseq.count (Tumor), pqr.htseq.count (Normal) Patient B: def.htseq.count (Tumor), xyz.htseq.count (Normal) > sampleFiles=list.files(directory) > sampleFiles [1] "abc.htseq.count" "def.htseq.count" [3] "pqr.htseq.count" "xyz.htseq.count" > > sampleCondition=c("Tumor","Tumor","Normal","Normal") > samplePatient=c("A","B","A","B") > sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition, patient=samplePatient) > dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory=directory, design= ~ patient + condition) > > colData(dds) DataFrame with 4 rows and 2 columns condition patient <factor> <factor> abc.htseq.count Tumor A def.htseq.count Tumor B pqr.htseq.count Normal A xyz.htseq.countt Normal B > > dds$condition [1] Tumor Tumor Normal Normal Levels: Normal Tumor > > dds$patient [1] A B A B Levels: A B > > dds = DESeq(dds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing > > result=results(dds) > > resOrdered2=result[order(result$padj),] > > head(resOrdered2,20) DataFrame with 20 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> GAL1 873531.57 -11.645190 0.7175710 -16.22863 3.164692e-59 5.627771e-55 ... ... I did not get the normalization factors in the output, whereas it does show up in the results in the vignette, so I am not sure if I missed some step. Also, would you expect to get meaningful results with just two matched pairs of samples with this analysis? I am wondering if it is better to just report the FC rather than the p-values. Thank you, - Pankaj ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Thursday, April 10, 2014 4:08 PM To: Pankaj Agarwal Cc: bioconductor@r-project.org Subject: Re: [BioC] DESeq2: Paired Test hi Pankaj, You should follow the multifactor analysis section of the vignette. You would include the patient effect in the design like so: design(dds) <- ~ patient + condition Where 'patient' and 'condition' are factors which are columns in colData(dds), and condition has levels: normal, tumor (where normal is the base level). Then results(dds) will build a result table for tumor vs normal, controlling for the patient effect. Mike On Thu, Apr 10, 2014 at 3:52 PM, Pankaj Agarwal <p.agarwal@duke.edu<mailto:p.agarwal@duke.edu>> wrote: Hi, This query is in reference to the following post https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html where it was mentioned that "DESeq does not support paired tests (yet)". I have a need for performing rna-seq data analysis with matched tumor/normal pairs of data and was wondering if this capability is now available in the latest version of DESeq2. I searched the list site and also the latest vignette, but could not find anything indicating this capability is now available in DEseq. I would be grateful if, in case this feature is not yet available, someone could advice me how to do the analysis. I have samples from two patients, and from each patient there is a "normal" and a "tumor" rna-seq sample data. Thanks, - Pankaj [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor@r-project.org<mailto:bioconductor@r-project.org> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor [[alternative HTML version deleted]]

ADD REPLY • link 11.8 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

Hi Pankaj On 16.04.2014 21:59, Pankaj Agarwal wrote: > I followed the vignette for the section on htseq-count and your suggestion as best as I could and got some results. > But I am not sure if I did everything right, so I would really appreciate if you could verify if the following steps is the correct way to analyze this data set: They all look correct. > I did not get the normalization factors in the output, whereas it does show up in the results in the vignette, so I am not sure if I missed some step. Just type "sizeFactors(dds)" to see them. > Also, would you expect to get meaningful results with just two matched pairs of samples with this analysis? I am wondering if it is better to just report the FC rather than the p-values. That depends a lot on the biology and the treatment of the samples. You would need to tell us much more. In general, you should disregard genes without significant adjusted p values even if they have seemingly strong fold changes. On the other hand, if you have very many genes with significant adjusted p values, then you will find the most interesting among these not by looking for the smallest p value but rather by looking for the strongest (shrunken) fold changes. See our preprint for more on this. Simon

ADD REPLY • link 11.8 years ago Simon Anders ★ 3.8k

0

Entering edit mode

Pankaj Agarwal ▴ 100

@pankaj-agarwal-6492

Last seen 7.8 years ago

United States

Hi, I am trying to use DESeq2 for a set of counts obtained via summerizedOverlaps and have the counts for 4 samples in one file. Earlier I had used DESeq2 with counts from htseq-count, which reports the counts for each samples in a separate file. Now that I have these in one file, is there a method analogous to "DESeqDataSetFromHTSeqCount" for constructing the dds object. With htseq-count I had used the following set of commands: "directory" contains the 4 count files. > sampleFiles=list.files(directory) > sampleFiles > [1] "abc.htseq.count" "def.htseq.count" > [3] "pqr.htseq.count" "xyz.htseq.count" > > sampleCondition=c("Tumor","Tumor","Normal","Normal") > samplePatient=c("A","B","A","B") > sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles, condition=sampleCondition, patient=samplePatient) > dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, directory=directory, design= ~ patient + condition) Thanks, - Pankaj ________________________________ From: Michael Love [michaelisaiahlove@gmail.com] Sent: Thursday, April 17, 2014 9:18 AM To: Pankaj Agarwal Subject: RE: [BioC] DESeq2: Paired Test Hi Pankaj For future questions to Bioconductor maintainers, always cc the mailing list. This helps avoids duplicate questions, which we get a lot of. Answer below, On Apr 17, 2014 8:49 AM, "Pankaj Agarwal" <p.agarwal@duke.edu<mailto:p.agarwal@duke.edu>> wrote: > > Thanks for your help and response. > > >> I did not get the normalization factors in the output, whereas it does > >> show up in the results in the vignette, so I am not sure if I missed some step. > > >I'm confused, which section of the vignette are you referencing? > > I was referring to Section 1.5 which report "sizeFactor" as the third column in the table, but when I execute the same function I dont see that column: You have to run DESeq or estimateSizeFactor first, see ?DESeq > > colData(dds) > DataFrame with 4 rows and 2 columns > condition patient > <factor> <factor> > abc.htseq.count Tumor A > def.htseq.count Tumor B > pqr.htseq.count Normal A > xyz.htseq.countt Normal B > > Is the third column in Section 1.5 the output of estimateSizeFactors( dds )? Yes > > Thanks, > > - Pankaj > > -----Original Message----- > From: Michael Love [mailto:michaelisaiahlove@gmail.com<mailto:michae lisaiahlove@gmail.com="">] > Sent: Wednesday, April 16, 2014 4:37 PM > To: Pankaj Agarwal > Subject: Re: [BioC] DESeq2: Paired Test > > hi Pankaj, > > On Wed, Apr 16, 2014 at 3:59 PM, Pankaj Agarwal <p.agarwal@duke.edu<mailto:p.agarwal@duke.edu>> wrote: > > Hi Michael, > > > > I followed the vignette for the section on htseq-count and your > > suggestion as best as I could and got some results. > > But I am not sure if I did everything right, so I would really > > appreciate if you could verify if the following steps is the correct > > way to analyze this data set: > > > > There are 4 samples for two patients: > > Patient A: abc.htseq.count (Tumor), pqr.htseq.count (Normal) Patient > > B: def.htseq.count (Tumor), xyz.htseq.count (Normal) > > > >> sampleFiles=list.files(directory) > >> sampleFiles > > [1] "abc.htseq.count" "def.htseq.count" > > [3] "pqr.htseq.count" "xyz.htseq.count" > >> > >> sampleCondition=c("Tumor","Tumor","Normal","Normal") > >> samplePatient=c("A","B","A","B") > >> sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles, > >> condition=sampleCondition, patient=samplePatient) > >> dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, > >> directory=directory, design= ~ patient + condition) > >> > >> colData(dds) > > DataFrame with 4 rows and 2 columns > > condition patient > > <factor> <factor> > > abc.htseq.count Tumor A > > def.htseq.count Tumor B > > pqr.htseq.count Normal A > > xyz.htseq.countt Normal B > >> > >> dds$condition > > [1] Tumor Tumor Normal Normal > > Levels: Normal Tumor > >> > >> dds$patient > > [1] A B A B > > Levels: A B > >> > >> dds = DESeq(dds) > > estimating size factors > > estimating dispersions > > gene-wise dispersion estimates > > mean-dispersion relationship > > final dispersion estimates > > fitting model and testing > >> > >> result=results(dds) > >> > >> resOrdered2=result[order(result$padj),] > >> > >> head(resOrdered2,20) > > DataFrame with 20 rows and 6 columns > > baseMean log2FoldChange lfcSE stat pvalue > > padj > > <numeric> <numeric> <numeric> <numeric> <numeric> > > <numeric> > > GAL1 873531.57 -11.645190 0.7175710 -16.22863 3.164692e-59 > > 5.627771e-55 > > ... > > ... > > > > Your lines of code look correct. > > > I did not get the normalization factors in the output, whereas it does > > show up in the results in the vignette, so I am not sure if I missed some step. > > I'm confused, which section of the vignette are you referencing? > > > > > Also, would you expect to get meaningful results with just two matched > > pairs of samples with this analysis? I am wondering if it is better > > to just report the FC rather than the p-values. > > The analysis has 4 samples and 3 parameters to estimate (intercept, patient, condition), so you have one residual degree of freedom. So you can estimate the dispersion and the p-values should be meaningful. > > One drawback with having only two matched pairs is that there are too few samples to look for outliers. DESeq2 looks for outliers when there are at least 3 samples per condition (actually, per unique combination of factor variables in the design). I would take a look at the genes with the highest counts and see if these should be filtered manually. > For example, the first gene in your ordered results table looks to me like it just contains an outliers count. However with no replicates, you have to perform such filtering manually. > > Mike > > > > > Thank you, > > > > - Pankaj > > > > > > ________________________________ > > From: Michael Love [michaelisaiahlove@gmail.com<mailto:michaelisaiahlove@gmail.com>] > > Sent: Thursday, April 10, 2014 4:08 PM > > To: Pankaj Agarwal > > Cc: bioconductor@r-project.org<mailto:bioconductor@r-project.org> > > Subject: Re: [BioC] DESeq2: Paired Test > > > > hi Pankaj, > > > > You should follow the multifactor analysis section of the vignette. > > You would include the patient effect in the design like so: > > > > design(dds) <- ~ patient + condition > > > > Where 'patient' and 'condition' are factors which are columns in > > colData(dds), and condition has levels: normal, tumor (where normal is > > the base level). > > > > Then results(dds) will build a result table for tumor vs normal, > > controlling for the patient effect. > > > > Mike > > > > > > On Thu, Apr 10, 2014 at 3:52 PM, Pankaj Agarwal <p.agarwal@duke.edu<mailto:p.agarwal@duke.edu>> wrote: > >> > >> Hi, > >> > >> This query is in reference to the following post > >> https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html > >> where it was mentioned that "DESeq does not support paired tests (yet)". > >> I have a need for performing rna-seq data analysis with matched > >> tumor/normal pairs of data and was wondering if this capability is > >> now available in the latest version of DESeq2. I searched the list > >> site and also the latest vignette, but could not find anything > >> indicating this capability is now available in DEseq. I would be > >> grateful if, in case this feature is not yet available, someone could > >> advice me how to do the analysis. I have samples from two patients, and from each patient there is a "normal" and a "tumor" > >> rna-seq sample data. > >> > >> Thanks, > >> > >> - Pankaj > >> > >> [[alternative HTML version deleted]] > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org<mailto:bioconductor@r-project.org> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > [[alternative HTML version deleted]]

ADD COMMENT • link 11.8 years ago Pankaj Agarwal ▴ 100

0

Entering edit mode

hi Pankaj, The output of summarizeOverlaps() is a SummarizedExperiment, which can be used in object construction with: dds <- DESeqDataSet(se, ~ condition) for a SummarizedExperiment, se. See the first part of either of the vignettes, and the manual page for ?DESeqDataSet. Or if you can read in any arbitrary counts from a file (using read.table, read.csv, etc.), then check the manual page for ?DESeqDataSetFromMatrix Mike On Wed, Apr 30, 2014 at 4:37 PM, Pankaj Agarwal <p.agarwal at="" duke.edu=""> wrote: > Hi, > > I am trying to use DESeq2 for a set of counts obtained via > summerizedOverlaps and have the counts for 4 samples in one file. Earlier I > had used DESeq2 with counts from htseq-count, which reports the counts for > each samples in a separate file. Now that I have these in one file, is > there a method analogous to "DESeqDataSetFromHTSeqCount" for constructing > the dds object. With htseq-count I had used the following set of commands: > > "directory" contains the 4 count files. > >> sampleFiles=list.files(directory) >> sampleFiles >> [1] "abc.htseq.count" "def.htseq.count" >> [3] "pqr.htseq.count" "xyz.htseq.count" >> >> sampleCondition=c("Tumor","Tumor","Normal","Normal") >> samplePatient=c("A","B","A","B") >> sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles, >> condition=sampleCondition, patient=samplePatient) >> dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, >> directory=directory, design= ~ patient + condition) > > Thanks, > > - Pankaj > ________________________________ > From: Michael Love [michaelisaiahlove at gmail.com] > > Sent: Thursday, April 17, 2014 9:18 AM > To: Pankaj Agarwal > Subject: RE: [BioC] DESeq2: Paired Test > > Hi Pankaj > > For future questions to Bioconductor maintainers, always cc the mailing > list. This helps avoids duplicate questions, which we get a lot of. Answer > below, > > On Apr 17, 2014 8:49 AM, "Pankaj Agarwal" <p.agarwal at="" duke.edu=""> wrote: >> >> Thanks for your help and response. >> >> >> I did not get the normalization factors in the output, whereas it does >> >> show up in the results in the vignette, so I am not sure if I missed >> >> some step. >> >> >I'm confused, which section of the vignette are you referencing? >> >> I was referring to Section 1.5 which report "sizeFactor" as the third >> column in the table, but when I execute the same function I don?t see that >> column: > > You have to run DESeq or estimateSizeFactor first, see ?DESeq > >> >> colData(dds) >> DataFrame with 4 rows and 2 columns >> condition patient >> <factor> <factor> >> abc.htseq.count Tumor A >> def.htseq.count Tumor B >> pqr.htseq.count Normal A >> xyz.htseq.countt Normal B >> >> Is the third column in Section 1.5 the output of estimateSizeFactors( dds >> )? > > Yes > >> >> Thanks, >> >> - Pankaj >> >> -----Original Message----- >> From: Michael Love [mailto:michaelisaiahlove at gmail.com] >> Sent: Wednesday, April 16, 2014 4:37 PM >> To: Pankaj Agarwal >> Subject: Re: [BioC] DESeq2: Paired Test >> >> hi Pankaj, >> >> On Wed, Apr 16, 2014 at 3:59 PM, Pankaj Agarwal <p.agarwal at="" duke.edu=""> >> wrote: >> > Hi Michael, >> > >> > I followed the vignette for the section on htseq-count and your >> > suggestion as best as I could and got some results. >> > But I am not sure if I did everything right, so I would really >> > appreciate if you could verify if the following steps is the correct >> > way to analyze this data set: >> > >> > There are 4 samples for two patients: >> > Patient A: abc.htseq.count (Tumor), pqr.htseq.count (Normal) Patient >> > B: def.htseq.count (Tumor), xyz.htseq.count (Normal) >> > >> >> sampleFiles=list.files(directory) >> >> sampleFiles >> > [1] "abc.htseq.count" "def.htseq.count" >> > [3] "pqr.htseq.count" "xyz.htseq.count" >> >> >> >> sampleCondition=c("Tumor","Tumor","Normal","Normal") >> >> samplePatient=c("A","B","A","B") >> >> sampleTable=data.frame(sampleName=sampleFiles, fileName=sampleFiles, >> >> condition=sampleCondition, patient=samplePatient) >> >> dds=DESeqDataSetFromHTSeqCount(sampleTable = sampleTable, >> >> directory=directory, design= ~ patient + condition) >> >> >> >> colData(dds) >> > DataFrame with 4 rows and 2 columns >> > condition patient >> > <factor> <factor> >> > abc.htseq.count Tumor A >> > def.htseq.count Tumor B >> > pqr.htseq.count Normal A >> > xyz.htseq.countt Normal B >> >> >> >> dds$condition >> > [1] Tumor Tumor Normal Normal >> > Levels: Normal Tumor >> >> >> >> dds$patient >> > [1] A B A B >> > Levels: A B >> >> >> >> dds = DESeq(dds) >> > estimating size factors >> > estimating dispersions >> > gene-wise dispersion estimates >> > mean-dispersion relationship >> > final dispersion estimates >> > fitting model and testing >> >> >> >> result=results(dds) >> >> >> >> resOrdered2=result[order(result$padj),] >> >> >> >> head(resOrdered2,20) >> > DataFrame with 20 rows and 6 columns >> > baseMean log2FoldChange lfcSE stat pvalue >> > padj >> > <numeric> <numeric> <numeric> <numeric> <numeric> >> > <numeric> >> > GAL1 873531.57 -11.645190 0.7175710 -16.22863 3.164692e-59 >> > 5.627771e-55 >> > ... >> > ... >> > >> >> Your lines of code look correct. >> >> > I did not get the normalization factors in the output, whereas it does >> > show up in the results in the vignette, so I am not sure if I missed >> > some step. >> >> I'm confused, which section of the vignette are you referencing? >> >> > >> > Also, would you expect to get meaningful results with just two matched >> > pairs of samples with this analysis? I am wondering if it is better >> > to just report the FC rather than the p-values. >> >> The analysis has 4 samples and 3 parameters to estimate (intercept, >> patient, condition), so you have one residual degree of freedom. So you can >> estimate the dispersion and the p-values should be meaningful. >> >> One drawback with having only two matched pairs is that there are too few >> samples to look for outliers. DESeq2 looks for outliers when there are at >> least 3 samples per condition (actually, per unique combination of factor >> variables in the design). I would take a look at the genes with the highest >> counts and see if these should be filtered manually. >> For example, the first gene in your ordered results table looks to me like >> it just contains an outliers count. However with no replicates, you have to >> perform such filtering manually. >> >> Mike >> >> > >> > Thank you, >> > >> > - Pankaj >> > >> > >> > ________________________________ >> > From: Michael Love [michaelisaiahlove at gmail.com] >> > Sent: Thursday, April 10, 2014 4:08 PM >> > To: Pankaj Agarwal >> > Cc: bioconductor at r-project.org >> > Subject: Re: [BioC] DESeq2: Paired Test >> > >> > hi Pankaj, >> > >> > You should follow the multifactor analysis section of the vignette. >> > You would include the patient effect in the design like so: >> > >> > design(dds) <- ~ patient + condition >> > >> > Where 'patient' and 'condition' are factors which are columns in >> > colData(dds), and condition has levels: normal, tumor (where normal is >> > the base level). >> > >> > Then results(dds) will build a result table for tumor vs normal, >> > controlling for the patient effect. >> > >> > Mike >> > >> > >> > On Thu, Apr 10, 2014 at 3:52 PM, Pankaj Agarwal <p.agarwal at="" duke.edu=""> >> > wrote: >> >> >> >> Hi, >> >> >> >> This query is in reference to the following post >> >> https://stat.ethz.ch/pipermail/bioconductor/2010-April/032869.html >> >> where it was mentioned that "DESeq does not support paired tests >> >> (yet)". >> >> I have a need for performing rna-seq data analysis with matched >> >> tumor/normal pairs of data and was wondering if this capability is >> >> now available in the latest version of DESeq2. I searched the list >> >> site and also the latest vignette, but could not find anything >> >> indicating this capability is now available in DEseq. I would be >> >> grateful if, in case this feature is not yet available, someone could >> >> advice me how to do the analysis. I have samples from two patients, >> >> and from each patient there is a "normal" and a "tumor" >> >> rna-seq sample data. >> >> >> >> Thanks, >> >> >> >> - Pankaj >> >> >> >> [[alternative HTML version deleted]] >> >> >> >> _______________________________________________ >> >> Bioconductor mailing list >> >> Bioconductor at r-project.org >> >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> >> Search the archives: >> >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >

ADD REPLY • link 11.8 years ago Michael Love 43k

Login before adding your answer.