filtering before using DESeq

0

Entering edit mode

Akula, Nirmala NIH/NIMH [C] ▴ 190

@akula-nirmala-nihnimh-c-5007

Last seen 4.5 years ago

Hi, We counted the reads in our RNA-seq data using HT-seq and removed any isoforms that have <5 reads/sample. We then used DESeq for differential expression analysis. Here's an example of a transcript that has the following read counts: GeneA_cases counts: 85.78942 19.11753 1471.813 61.71464 GeneA_control counts: 2088.722 2681.746 2413.892 1628.187 DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts (that have high variance between samples as shown in the above example) before giving the data to DESeq or will DESeq take this into account while calculating the normalization? Thank you very much. Regards, Nirmala ---------------------------------------------------------------------- -------------------------------------------------------- Contractor Buiding 35, Room 1A-205 35 Convent Drive, National Institute of Mental Health/NIH Bethesda MD - 20892 Phone# 301-451-4258 [[alternative HTML version deleted]]

DESeq DESeq • 1.6k views

ADD COMMENT • link updated 11.4 years ago by Wolfgang Huber ★ 13k • written 11.4 years ago by Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

Sean Davis 21k

@sean-davis-490

Last seen 4 months ago

United States

On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] < akulan@mail.nih.gov> wrote: > Hi, > > We counted the reads in our RNA-seq data using HT-seq and removed any > isoforms that have <5 reads/sample. We then used DESeq for differential > expression analysis. > > Here's an example of a transcript that has the following read counts: > > > GeneA_cases counts: > 85.78942 > > 19.11753 > > 1471.813 > > 61.71464 > > > GeneA_control counts: > > 2088.722 > > 2681.746 > > 2413.892 > > 1628.187 > > > > DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts > (that have high variance between samples as shown in the above example) > before giving the data to DESeq or will DESeq take this into account while > calculating the normalization? > Hi, Nirmala. If you mean filtering out transcripts that show one or more outliers within a given group, then you should ABSOLUTELY NOT do that as this will bias your statistical results. If you mean filtering based on overall variance (across groups) to find highly-variable transcripts, that is a different story and is acceptable. Sean [[alternative HTML version deleted]]

ADD COMMENT • link 11.4 years ago Sean Davis 21k

0

Entering edit mode

Thanks Sean for your response. Regards, Nirmala ---------------------------------------------------------------------- -------------------------------------------------------- Contractor Buiding 35, Room 1A-205 35 Convent Drive, National Institute of Mental Health/NIH Bethesda MD - 20892 Phone# 301-451-4258 From: Davis, Sean (NCI) On Behalf Of Davis, Sean (NIH/NCI) [E] Sent: Friday, December 14, 2012 4:45 PM To: Akula, Nirmala (NIH/NIMH) [C] Cc: bioconductor@r-project.org Subject: Re: [BioC] filtering before using DESeq On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] <akulan@mail.nih.gov<mailto:akulan@mail.nih.gov>> wrote: Hi, We counted the reads in our RNA-seq data using HT-seq and removed any isoforms that have <5 reads/sample. We then used DESeq for differential expression analysis. Here's an example of a transcript that has the following read counts: GeneA_cases counts: 85.78942 19.11753 1471.813 61.71464 GeneA_control counts: 2088.722 2681.746 2413.892 1628.187 DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts (that have high variance between samples as shown in the above example) before giving the data to DESeq or will DESeq take this into account while calculating the normalization? Hi, Nirmala. If you mean filtering out transcripts that show one or more outliers within a given group, then you should ABSOLUTELY NOT do that as this will bias your statistical results. If you mean filtering based on overall variance (across groups) to find highly-variable transcripts, that is a different story and is acceptable. Sean [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

Dear Akula, Sean besides overall variance, overall sum is also a good filter statistic. Akula, please note that DESeq expects counts, which need to be positive integer values. The values you state are not integers. Best wishes Wolfgang Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto: > On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] < > akulan at mail.nih.gov> wrote: > >> Hi, >> >> We counted the reads in our RNA-seq data using HT-seq and removed any >> isoforms that have <5 reads/sample. We then used DESeq for differential >> expression analysis. >> >> Here's an example of a transcript that has the following read counts: >> >> >> GeneA_cases counts: >> 85.78942 >> >> 19.11753 >> >> 1471.813 >> >> 61.71464 >> >> >> GeneA_control counts: >> >> 2088.722 >> >> 2681.746 >> >> 2413.892 >> >> 1628.187 >> >> >> >> DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts >> (that have high variance between samples as shown in the above example) >> before giving the data to DESeq or will DESeq take this into account while >> calculating the normalization? >> > > Hi, Nirmala. > > If you mean filtering out transcripts that show one or more outliers within > a given group, then you should ABSOLUTELY NOT do that as this will bias > your statistical results. If you mean filtering based on overall variance > (across groups) to find highly-variable transcripts, that is a different > story and is acceptable. > > Sean > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi, What would be a reasonable/widely used cut-off for overall variance and overall sum? Thanks for pointing out the number format. The example I gave is from eXpress software and I rounded the numbers to closest integer before I input into DESeq. Regards, Nirmala ________________________________________ From: Wolfgang Huber [whuber@embl.de] Sent: Saturday, December 15, 2012 11:05 AM To: Davis, Sean (NIH/NCI) [E] Cc: Akula, Nirmala (NIH/NIMH) [C]; bioconductor at r-project.org Subject: Re: [BioC] filtering before using DESeq Dear Akula, Sean besides overall variance, overall sum is also a good filter statistic. Akula, please note that DESeq expects counts, which need to be positive integer values. The values you state are not integers. Best wishes Wolfgang Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto: > On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] < > akulan at mail.nih.gov> wrote: > >> Hi, >> >> We counted the reads in our RNA-seq data using HT-seq and removed any >> isoforms that have <5 reads/sample. We then used DESeq for differential >> expression analysis. >> >> Here's an example of a transcript that has the following read counts: >> >> >> GeneA_cases counts: >> 85.78942 >> >> 19.11753 >> >> 1471.813 >> >> 61.71464 >> >> >> GeneA_control counts: >> >> 2088.722 >> >> 2681.746 >> >> 2413.892 >> >> 1628.187 >> >> >> >> DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts >> (that have high variance between samples as shown in the above example) >> before giving the data to DESeq or will DESeq take this into account while >> calculating the normalization? >> > > Hi, Nirmala. > > If you mean filtering out transcripts that show one or more outliers within > a given group, then you should ABSOLUTELY NOT do that as this will bias > your statistical results. If you mean filtering based on overall variance > (across groups) to find highly-variable transcripts, that is a different > story and is acceptable. > > Sean > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.4 years ago Akula, Nirmala NIH/NIMH [C] ▴ 190

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 25 days ago

EMBL European Molecular Biology Laborat…

Il giorno Dec 15, 2012, alle ore 5:53 PM, "Akula, Nirmala (NIH/NIMH) [C]" <akulan at="" mail.nih.gov=""> ha scritto: > Hi, > > What would be a reasonable/widely used cut-off for overall variance and overall sum? > > Thanks for pointing out the number format. The example I gave is from eXpress software and I rounded the numbers to closest integer before I input into DESeq Nirmala, it's a bit more subtle than that. DESeq expects actual counts of fragments, please do read the DESeq vignette. I have no experience with combining eXpress and DESeq, or whether what you are doing is scientifically valid, but unless you are comfortable with making your own statistical models and strategies, I'd recommend following an established path rather than cutting your own - where you would be on your own. Best wishes Wolfgang > Regards, > Nirmala > ________________________________________ > From: Wolfgang Huber [whuber at embl.de] > Sent: Saturday, December 15, 2012 11:05 AM > To: Davis, Sean (NIH/NCI) [E] > Cc: Akula, Nirmala (NIH/NIMH) [C]; bioconductor at r-project.org > Subject: Re: [BioC] filtering before using DESeq > > Dear Akula, Sean > > besides overall variance, overall sum is also a good filter statistic. > > Akula, please note that DESeq expects counts, which need to be positive integer values. The values you state are not integers. > > Best wishes > Wolfgang > > > Il giorno Dec 14, 2012, alle ore 10:45 PM, Sean Davis <sdavis2 at="" mail.nih.gov=""> ha scritto: > >> On Fri, Dec 14, 2012 at 2:42 PM, Akula, Nirmala (NIH/NIMH) [C] < >> akulan at mail.nih.gov> wrote: >> >>> Hi, >>> >>> We counted the reads in our RNA-seq data using HT-seq and removed any >>> isoforms that have <5 reads/sample. We then used DESeq for differential >>> expression analysis. >>> >>> Here's an example of a transcript that has the following read counts: >>> >>> >>> GeneA_cases counts: >>> 85.78942 >>> >>> 19.11753 >>> >>> 1471.813 >>> >>> 61.71464 >>> >>> >>> GeneA_control counts: >>> >>> 2088.722 >>> >>> 2681.746 >>> >>> 2413.892 >>> >>> 1628.187 >>> >>> >>> >>> DESeq p-value for GeneA is 10-4. Do we have to filter out transcripts >>> (that have high variance between samples as shown in the above example) >>> before giving the data to DESeq or will DESeq take this into account while >>> calculating the normalization? >>> >> >> Hi, Nirmala. >> >> If you mean filtering out transcripts that show one or more outliers within >> a given group, then you should ABSOLUTELY NOT do that as this will bias >> your statistical results. If you mean filtering based on overall variance >> (across groups) to find highly-variable transcripts, that is a different >> story and is acceptable. >> >> Sean >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 11.4 years ago Wolfgang Huber ★ 13k

Login before adding your answer.