(EdgeR) statistical justification of partitioning dataset for multiple analysis

0

Entering edit mode

Adriaan Sticker ▴ 90

@adriaan-sticker-6368

Last seen 9.6 years ago

Dear all, I'm doing analysis on allready mapped reads from sequencing data for differential expression with EdgeR. My experimental setup is as follow: I have samples from 4 different subjects. Material of each subject wast treated with 2 different treatments (and a control) for 2 timepoints. I want to analyse the effect of the treatments (compared to control and compared to eachother) In EdgeR I used following design model.matrix(~ subject+ Treatment + Time +Treatment : Time) I considered 2 strategies to analye te data: Estimate parameters from above mentioned design with all data (all samples) and use different contrasts to get the differential expressed genes I want. OR Use only the samples of the two treatments (eg. control vs treatment1, treatment 1 vs treatment 2) I want to compare to fit the parameters. Repeat the previous 3 times till I have compared all 3 treatments with eachother. So exctually 3 different analysis using only a subset (2/3 th) of the data. I noticed that I could find considerably more significant differential expressed genes between 2 treatments with the last approach. But I wondered how correct this approach is? Will I have for example problems with multiple testing? (I control each analysis on fdr 5% with bejamin Hochberg) thanks in advance Kind regard [[alternative HTML version deleted]]

Sequencing edgeR Sequencing edgeR • 986 views

ADD COMMENT • link updated 10.2 years ago by Ryan C. Thompson ★ 7.9k • written 10.2 years ago by Adriaan Sticker ▴ 90

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi Adriaan, If I understand correctly, you have 3 different treatments, i.e. control, treatment 1, and treatment 2, and you have fit the same model formula to the full dataset as well as all 3 combinations of only 2 treatments, and you are getting significantly different results between the 3-treatment fit and the 2-treatment fits. I think the first thing you need to do is to look at the result of plotBCV for each analysis. It is possible that one of your treatments has significantly more biological variability across all genes than the others. edgeR assumes that each gene has the same BCV across all conditions, so that it can more robustly estimate a single dispersion value for each gene. So look at the plotBCV output from all your analyses, and see if the BCV estimates differ significantly. This would surely explain what you are seeing. You may also want to estimate dispersions from each treatment group individually (drop Treatment from the model formula in this case). The tagwise dispersions will not be very robust in this case, but the trend and common dispersions can help you figure out which treatment has the most biological variability. If the dispersion estimates don't explain your differing p-values, ask back here and maybe someone else will have another idea. Good luck, -Ryan On 1/30/14, 9:43 AM, Adriaan Sticker wrote: > Dear all, > > I'm doing analysis on allready mapped reads from sequencing data for > differential expression with EdgeR. My experimental setup is as follow: > I have samples from 4 different subjects. Material of each subject wast > treated with 2 different treatments (and a control) for 2 timepoints. > > I want to analyse the effect of the treatments (compared to control and > compared to eachother) > > In EdgeR I used following design > model.matrix(~ subject+ Treatment + Time +Treatment : Time) > > I considered 2 strategies to analye te data: > > Estimate parameters from above mentioned design with all data (all samples) > and use different contrasts to get the differential expressed genes I want. > > OR > > Use only the samples of the two treatments (eg. control vs treatment1, > treatment 1 vs treatment 2) I want to compare to fit the parameters. Repeat > the previous 3 times till I have compared all 3 treatments with eachother. > So exctually 3 different analysis using only a subset (2/3 th) of the data. > > I noticed that I could find considerably more significant differential > expressed genes between 2 treatments with the last approach. But I wondered > how correct this approach is? Will I have for example problems with > multiple testing? (I control each analysis on fdr 5% with bejamin Hochberg) > > thanks in advance > Kind regard > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.2 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Dear Thanks for your input. I did as you suggested. For all treatment groups combined i got common BCV = 0.08 When I look split up my dataset in 3 treatments groups and calculate the BCV for each seperately I got common BCV: control: 0.081 treatment1: 0.085 treantment2: 0.096 When I split the data for each analysis I got common BCV; control + treat1: 0.078 control + treat2: 0.084 treat1 +treat2: 0.082 So it seems that treatment2 has some extra BCV compared to the others but thes differences are not so big when you look at each analysis for treatment comparison. I also don't think the BCVs for each analysis look much different when you look at the BCV plots themself (in attachment) I have to revise my statement about finding more genes after splitting the dataset compared to an analysis on the full dataset. I find more genes (almost double) for treatment 1 vs control when I split the dataset. I find less genes (almost half) for treatment 2 vs control when I split the dataset. I find more or less (it depends at which timepoint you look) for treatment 2 vs treatment 1 when I split the dataset. This puzzles me a bit. But in general, when all BCVs are more or less the same. Would you gain something by splitting the dataset or doesn't that make much sense statistically? Best regards Adriaan 2014-01-30 Ryan <rct at="" thompsonclan.org="">: > Hi Adriaan, > > If I understand correctly, you have 3 different treatments, i.e. control, > treatment 1, and treatment 2, and you have fit the same model formula to > the full dataset as well as all 3 combinations of only 2 treatments, and > you are getting significantly different results between the 3-treatment fit > and the 2-treatment fits. I think the first thing you need to do is to look > at the result of plotBCV for each analysis. It is possible that one of your > treatments has significantly more biological variability across all genes > than the others. edgeR assumes that each gene has the same BCV across all > conditions, so that it can more robustly estimate a single dispersion value > for each gene. So look at the plotBCV output from all your analyses, and > see if the BCV estimates differ significantly. This would surely explain > what you are seeing. You may also want to estimate dispersions from each > treatment group individually (drop Treatment from the model formula in this > case). The tagwise dispersions will not be very robust in this case, but > the trend and common dispersions can help you figure out which treatment > has the most biological variability. > > If the dispersion estimates don't explain your differing p-values, ask > back here and maybe someone else will have another idea. > > Good luck, > > -Ryan > > > On 1/30/14, 9:43 AM, Adriaan Sticker wrote: > >> Dear all, >> >> I'm doing analysis on allready mapped reads from sequencing data for >> differential expression with EdgeR. My experimental setup is as follow: >> I have samples from 4 different subjects. Material of each subject wast >> treated with 2 different treatments (and a control) for 2 timepoints. >> >> I want to analyse the effect of the treatments (compared to control and >> compared to eachother) >> >> In EdgeR I used following design >> model.matrix(~ subject+ Treatment + Time +Treatment : Time) >> >> I considered 2 strategies to analye te data: >> >> Estimate parameters from above mentioned design with all data (all >> samples) >> and use different contrasts to get the differential expressed genes I >> want. >> >> OR >> >> Use only the samples of the two treatments (eg. control vs treatment1, >> treatment 1 vs treatment 2) I want to compare to fit the parameters. >> Repeat >> the previous 3 times till I have compared all 3 treatments with eachother. >> So exctually 3 different analysis using only a subset (2/3 th) of the >> data. >> >> I noticed that I could find considerably more significant differential >> expressed genes between 2 treatments with the last approach. But I >> wondered >> how correct this approach is? Will I have for example problems with >> multiple testing? (I control each analysis on fdr 5% with bejamin >> Hochberg) >> >> thanks in advance >> Kind regard >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane. >> science.biology.informatics.conductor >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_all.png Type: image/png Size: 32597 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140131="" c9780c8b="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_control_treat1.png Type: image/png Size: 30368 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140131="" c9780c8b="" attachment-0001.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_control_treat2.png Type: image/png Size: 31007 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140131="" c9780c8b="" attachment-0002.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: bcv_treat1_treat2.png Type: image/png Size: 30864 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140131="" c9780c8b="" attachment-0003.png="">

ADD REPLY • link 10.2 years ago Adriaan Sticker ▴ 90

0

Entering edit mode

On Fri Jan 31 07:01:51 2014, Adriaan Sticker wrote: > > when all BCVs are more or less the same. Would you > gain something by splitting the dataset or doesn't that make much > sense statistically? No, when all BCVs are consistent across treatments, you want to combine all of them into one dataset to the the most robust BCV estimates possible. > > Best regards > Adriaan > > 2014-01-30 Ryan <rct at="" thompsonclan.org="" <mailto:rct="" at="" thompsonclan.org="">>: > > Hi Adriaan, > > If I understand correctly, you have 3 different treatments, i.e. > control, treatment 1, and treatment 2, and you have fit the same > model formula to the full dataset as well as all 3 combinations of > only 2 treatments, and you are getting significantly different > results between the 3-treatment fit and the 2-treatment fits. I > think the first thing you need to do is to look at the result of > plotBCV for each analysis. It is possible that one of your > treatments has significantly more biological variability across > all genes than the others. edgeR assumes that each gene has the > same BCV across all conditions, so that it can more robustly > estimate a single dispersion value for each gene. So look at the > plotBCV output from all your analyses, and see if the BCV > estimates differ significantly. This would surely explain what you > are seeing. You may also want to estimate dispersions from each > treatment group individually (drop Treatment from the model > formula in this case). The tagwise dispersions will not be very > robust in this case, but the trend and common dispersions can help > you figure out which treatment has the most biological variability. > > If the dispersion estimates don't explain your differing p-values, > ask back here and maybe someone else will have another idea. > > Good luck, > > -Ryan > > > On 1/30/14, 9:43 AM, Adriaan Sticker wrote: > > Dear all, > > I'm doing analysis on allready mapped reads from sequencing > data for > differential expression with EdgeR. My experimental setup is > as follow: > I have samples from 4 different subjects. Material of each > subject wast > treated with 2 different treatments (and a control) for 2 > timepoints. > > I want to analyse the effect of the treatments (compared to > control and > compared to eachother) > > In EdgeR I used following design > model.matrix(~ subject+ Treatment + Time +Treatment : Time) > > I considered 2 strategies to analye te data: > > Estimate parameters from above mentioned design with all data > (all samples) > and use different contrasts to get the differential expressed > genes I want. > > OR > > Use only the samples of the two treatments (eg. control vs > treatment1, > treatment 1 vs treatment 2) I want to compare to fit the > parameters. Repeat > the previous 3 times till I have compared all 3 treatments > with eachother. > So exctually 3 different analysis using only a subset (2/3 th) > of the data. > > I noticed that I could find considerably more significant > differential > expressed genes between 2 treatments with the last approach. > But I wondered > how correct this approach is? Will I have for example problems > with > multiple testing? (I control each analysis on fdr 5% with > bejamin Hochberg) > > thanks in advance > Kind regard > > [[alternative HTML version deleted]] > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > >

ADD REPLY • link 10.2 years ago Ryan C. Thompson ★ 7.9k

Login before adding your answer.