Question

Questions on doing EdgeR Analysis of Timeseries Data

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 9 minutes ago

WEHI, Melbourne, Australia

Hi Mark, This is an analysis question rather than a devel suggestion, so I've transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. If you have divided your samples into two groups, and you have 7 samples, then I recommend you set prior.n=4. See https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.html for an explanation. I can't answer your other questions without knowing whether you have biological replicates in your experiment. You have 7 samples. Do these samples all correspond to distinct lifecycle stages, or are some stages observed more than once? How many stages are there? Best wishes Gordon > Date: Thu, 9 Jun 2011 18:19:42 +0000 > From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> > To: "bioc-sig-sequencing at r-project.org" > <bioc-sig-sequencing at="" r-project.org=""> > Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of > Timeseries Data > > (I apologize if this message was sent more than once) > > Dear Analysis Gurus > > I am currently performing a gene expression analysis on a plant > parasite. I have mapped Illumina read counts for various stages in this > parasites lifecycle. Of interest for us in this analysis are genes that > are differentially expressed during these lifecycles. To determine this, > I have focused on two types of differential expression: "peaks" and > "cliffs." "Peaks" occur when a gene is differentially expressed in one > time sample (either higher or lower than the remaining samples) and > "cliffs" occur when a gene is differentially expressed between two > groups of sample (for instance higher expression in the first three > samples than the last three). > > To determine these peaks and cliffs, I have been creating groups in > which the desired peak/cliff is "case" and the remaining samples are > "control." I then run common dispersion and/or tagwise dispersion and > extract those reads with an FDR of less than 0.1. So, my questions: > > 1.) How much filtering of data should I do? Right now I have a fair > amount of genes that are expressed in 0, 1, 2 etc. samples. It seems > logical that I would filter out genes that have no expression, but at > what level should it stop? Also, should there be different filtering > depending on the analysis (peak or cliff)? > > 2.) When doing tagwise dispersion, what should I set my prior.n to (I > currently have 7 samples)? Does it depend on the type of analysis? > > 3.) Should I investigate using a more advanced glm based analysis? Any > advice on crafting a design for this? > > 4.) Any other ideas on analyses to perform on a set of timeseries data > with EdgeR? > > I greatly appreciate any help/advice and thank you in advance! > > > Mark J. Lawson, Ph.D. > Bioinformatics Research Scientist > Center for Public Health Genomics, UVA > mlawson at virginia.edu ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

edgeR edgeR • 1.4k views

ADD COMMENT • link 12.9 years ago Gordon Smyth 50k

score 0 · Answer 1 · 2011-06-17

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 9 minutes ago

WEHI, Melbourne, Australia

Dear Mark, I'd suggest that you filter out genes that fail to achieve a reasonable count in any library. I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway. In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries. If y is a DGEList object, you might use cpm <- t(t(y$counts)/y$samples$lib.size) keep <- rowSums(cpm>1)>0 y.filtered <- y[keep,] Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates. design <- model.matrix(~lifecycle) y.filtered <- estimateGLMCommonDisp(y.filtered,design) Perhaps also y.filtered <- estimateGLMTrendedDisp(y.filtered,design) Then fit <- glmFit(y.filtered,design) Then you could test for genes that change at all through the 6 lifecycles by lrt <- glmLRT(y.filtered,fit,coef=2:6) topTags(lrt) This will perform a likelihood ratio test for each gene on 5 degrees of freedom. Then you could examine the significant genes to see what patterns they show across the lifecycles: is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1 library(limma) plotlines(lrt$coef[is.sig,],first.column.origin=TRUE) Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote: > Hi Gordon > > Thanks for getting back to me on this. I don't mind the transfer, > whatever leads to answers to my questions is fine by me. > > I am sorry that I forgot to classify the lifecycles of the samples. They > represent 6 distinct lifecycles (the first measured lifecycle has two > samples). Will this affect the prior.n you suggested? > > ~ Mark > > On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote: > >> Hi Mark, >> >> This is an analysis question rather than a devel suggestion, so I've >> transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. >> >> If you have divided your samples into two groups, and you have 7 >> samples, then I recommend you set prior.n=4. See >> >> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.html >> >> for an explanation. >> >> I can't answer your other questions without knowing whether you have >> biological replicates in your experiment. You have 7 samples. Do >> these samples all correspond to distinct lifecycle stages, or are some >> stages observed more than once? How many stages are there? >> >> Best wishes >> Gordon >> >>> Date: Thu, 9 Jun 2011 18:19:42 +0000 >>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> >>> To: "bioc-sig-sequencing at r-project.org" >>> <bioc-sig-sequencing at="" r-project.org=""> >>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of >>> Timeseries Data >>> >>> (I apologize if this message was sent more than once) >>> >>> Dear Analysis Gurus >>> >>> I am currently performing a gene expression analysis on a plant >>> parasite. I have mapped Illumina read counts for various stages in >>> this parasites lifecycle. Of interest for us in this analysis are >>> genes that are differentially expressed during these lifecycles. To >>> determine this, I have focused on two types of differential >>> expression: "peaks" and "cliffs." "Peaks" occur when a gene is >>> differentially expressed in one time sample (either higher or lower >>> than the remaining samples) and "cliffs" occur when a gene is >>> differentially expressed between two groups of sample (for instance >>> higher expression in the first three samples than the last three). >>> >>> To determine these peaks and cliffs, I have been creating groups in >>> which the desired peak/cliff is "case" and the remaining samples are >>> "control." I then run common dispersion and/or tagwise dispersion and >>> extract those reads with an FDR of less than 0.1. So, my questions: >>> >>> 1.) How much filtering of data should I do? Right now I have a fair >>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems >>> logical that I would filter out genes that have no expression, but at >>> what level should it stop? Also, should there be different filtering >>> depending on the analysis (peak or cliff)? >>> >>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I >>> currently have 7 samples)? Does it depend on the type of analysis? >>> >>> 3.) Should I investigate using a more advanced glm based analysis? Any >>> advice on crafting a design for this? >>> >>> 4.) Any other ideas on analyses to perform on a set of timeseries data >>> with EdgeR? >>> >>> I greatly appreciate any help/advice and thank you in advance! >>> >>> >>> Mark J. Lawson, Ph.D. >>> Bioinformatics Research Scientist >>> Center for Public Health Genomics, UVA >>> mlawson at virginia.edu ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 12.9 years ago Gordon Smyth 50k

0

Entering edit mode

Hi Dr. Smith, Sorry if I am coming back to an issue previously discussed. I am following up on Dr. Lawson's question because it can be useful to him as well, I think. Can you ,please, clarify how to interpret a topTag table containing more than one coefficient? As an example, those are the first two lines of the table I have. I understand why there are the coefficients for day34 and groupNT. What does locConc, LR P.value and FDR refers to? > topTags(lrt_glmfit_ENS_exon_data_c_TMM) Coefficient: day34 groupNT logConc day34 groupNT LR P.Value FDR ENSBTAG00000020056 -7.499402 6.832396 0.16466282 559.8360 2.711126e-122 4.432692e-118 ENSBTAG00000010050 -8.214241 7.723259 -0.23952379 531.3709 4.114163e-116 3.363329e-112 Thanks so much, sir Fernando > sessionInfo() R version 2.13.0 (2011-04-13) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] splines grid stats graphics grDevices utils datasets methods base other attached packages: [1] VennDiagram_1.0.1 baySeq_1.6.0 DESeq_1.4.1 locfit_1.5-6 lattice_0.19-26 akima_0.5-4 edgeR_2.2.5 Biobase_2.12.1 loaded via a namespace (and not attached): [1] annotate_1.30.0 AnnotationDbi_1.14.1 DBI_0.2-5 genefilter_1.34.0 geneplotter_1.30.0 limma_3.8.2 RColorBrewer_1.0-2 RSQLite_0.9-4 [9] survival_2.36-9 tools_2.13.0 xtable_1.5-6 > -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Gordon K Smyth Sent: Thursday, June 16, 2011 11:51 PM To: Lawson, Mark (mjl3p) Cc: Bioconductor mailing list Subject: Re: [BioC] Questions on doing EdgeR Analysis of Timeseries Data Dear Mark, I'd suggest that you filter out genes that fail to achieve a reasonable count in any library. I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway. In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries. If y is a DGEList object, you might use cpm <- t(t(y$counts)/y$samples$lib.size) keep <- rowSums(cpm>1)>0 y.filtered <- y[keep,] Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates. design <- model.matrix(~lifecycle) y.filtered <- estimateGLMCommonDisp(y.filtered,design) Perhaps also y.filtered <- estimateGLMTrendedDisp(y.filtered,design) Then fit <- glmFit(y.filtered,design) Then you could test for genes that change at all through the 6 lifecycles by lrt <- glmLRT(y.filtered,fit,coef=2:6) topTags(lrt) This will perform a likelihood ratio test for each gene on 5 degrees of freedom. Then you could examine the significant genes to see what patterns they show across the lifecycles: is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1 library(limma) plotlines(lrt$coef[is.sig,],first.column.origin=TRUE) Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. Tel: (03) 9345 2326, Fax (03) 9347 0852, smyth at wehi.edu.au http://www.wehi.edu.au http://www.statsci.org/smyth On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote: > Hi Gordon > > Thanks for getting back to me on this. I don't mind the transfer, > whatever leads to answers to my questions is fine by me. > > I am sorry that I forgot to classify the lifecycles of the samples. > They represent 6 distinct lifecycles (the first measured lifecycle has > two samples). Will this affect the prior.n you suggested? > > ~ Mark > > On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote: > >> Hi Mark, >> >> This is an analysis question rather than a devel suggestion, so I've >> transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. >> >> If you have divided your samples into two groups, and you have 7 >> samples, then I recommend you set prior.n=4. See >> >> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.h >> tml >> >> for an explanation. >> >> I can't answer your other questions without knowing whether you have >> biological replicates in your experiment. You have 7 samples. Do >> these samples all correspond to distinct lifecycle stages, or are >> some stages observed more than once? How many stages are there? >> >> Best wishes >> Gordon >> >>> Date: Thu, 9 Jun 2011 18:19:42 +0000 >>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> >>> To: "bioc-sig-sequencing at r-project.org" >>> <bioc-sig-sequencing at="" r-project.org=""> >>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of >>> Timeseries Data >>> >>> (I apologize if this message was sent more than once) >>> >>> Dear Analysis Gurus >>> >>> I am currently performing a gene expression analysis on a plant >>> parasite. I have mapped Illumina read counts for various stages in >>> this parasites lifecycle. Of interest for us in this analysis are >>> genes that are differentially expressed during these lifecycles. To >>> determine this, I have focused on two types of differential >>> expression: "peaks" and "cliffs." "Peaks" occur when a gene is >>> differentially expressed in one time sample (either higher or lower >>> than the remaining samples) and "cliffs" occur when a gene is >>> differentially expressed between two groups of sample (for instance >>> higher expression in the first three samples than the last three). >>> >>> To determine these peaks and cliffs, I have been creating groups in >>> which the desired peak/cliff is "case" and the remaining samples are >>> "control." I then run common dispersion and/or tagwise dispersion >>> and extract those reads with an FDR of less than 0.1. So, my questions: >>> >>> 1.) How much filtering of data should I do? Right now I have a fair >>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems >>> logical that I would filter out genes that have no expression, but >>> at what level should it stop? Also, should there be different >>> filtering depending on the analysis (peak or cliff)? >>> >>> 2.) When doing tagwise dispersion, what should I set my prior.n to >>> (I currently have 7 samples)? Does it depend on the type of analysis? >>> >>> 3.) Should I investigate using a more advanced glm based analysis? >>> Any advice on crafting a design for this? >>> >>> 4.) Any other ideas on analyses to perform on a set of timeseries >>> data with EdgeR? >>> >>> I greatly appreciate any help/advice and thank you in advance! >>> >>> >>> Mark J. Lawson, Ph.D. >>> Bioinformatics Research Scientist >>> Center for Public Health Genomics, UVA mlawson at virginia.edu ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}

ADD REPLY • link 12.9 years ago Biase, Fernando ▴ 150

0

Entering edit mode

Counts per million should have been: cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size) Gordon On Fri, 17 Jun 2011, Gordon K Smyth wrote: > Dear Mark, > > I'd suggest that you filter out genes that fail to achieve a reasonable count > in any library. I can't give you a firm cutoff for this, it depends on the > sequencing depth, and the analysis should not be sensitive to this anyway. > In my lab, we have tended to keep genes that achieve at least 1 count per > thousand reads in at least a minimum number of libraries. If y is a DGEList > object, you might use > > cpm <- t(t(y$counts)/y$samples$lib.size) > keep <- rowSums(cpm>1)>0 > y.filtered <- y[keep,] > > Since you have two replicates of one lifecycle, you could base your estimate > of variability (the biological coefficient of variation) on these replicates. > > design <- model.matrix(~lifecycle) > y.filtered <- estimateGLMCommonDisp(y.filtered,design) > > Perhaps also > > y.filtered <- estimateGLMTrendedDisp(y.filtered,design) > > Then > > fit <- glmFit(y.filtered,design) > > Then you could test for genes that change at all through the 6 lifecycles by > > lrt <- glmLRT(y.filtered,fit,coef=2:6) > topTags(lrt) > > This will perform a likelihood ratio test for each gene on 5 degrees of > freedom. Then you could examine the significant genes to see what patterns > they show across the lifecycles: > > is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1 > library(limma) > plotlines(lrt$coef[is.sig,],first.column.origin=TRUE) > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > Tel: (03) 9345 2326, Fax (03) 9347 0852, > smyth at wehi.edu.au > http://www.wehi.edu.au > http://www.statsci.org/smyth > > On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote: > >> Hi Gordon >> >> Thanks for getting back to me on this. I don't mind the transfer, whatever >> leads to answers to my questions is fine by me. >> >> I am sorry that I forgot to classify the lifecycles of the samples. They >> represent 6 distinct lifecycles (the first measured lifecycle has two >> samples). Will this affect the prior.n you suggested? >> >> ~ Mark >> >> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote: >> >>> Hi Mark, >>> >>> This is an analysis question rather than a devel suggestion, so I've >>> transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. >>> >>> If you have divided your samples into two groups, and you have 7 samples, >>> then I recommend you set prior.n=4. See >>> >>> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.html >>> >>> for an explanation. >>> >>> I can't answer your other questions without knowing whether you have >>> biological replicates in your experiment. You have 7 samples. Do these >>> samples all correspond to distinct lifecycle stages, or are some stages >>> observed more than once? How many stages are there? >>> >>> Best wishes >>> Gordon >>> >>>> Date: Thu, 9 Jun 2011 18:19:42 +0000 >>>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> >>>> To: "bioc-sig-sequencing at r-project.org" >>>> <bioc-sig-sequencing at="" r-project.org=""> >>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of >>>> Timeseries Data >>>> >>>> (I apologize if this message was sent more than once) >>>> >>>> Dear Analysis Gurus >>>> >>>> I am currently performing a gene expression analysis on a plant parasite. >>>> I have mapped Illumina read counts for various stages in this parasites >>>> lifecycle. Of interest for us in this analysis are genes that are >>>> differentially expressed during these lifecycles. To determine this, I >>>> have focused on two types of differential expression: "peaks" and >>>> "cliffs." "Peaks" occur when a gene is differentially expressed in one >>>> time sample (either higher or lower than the remaining samples) and >>>> "cliffs" occur when a gene is differentially expressed between two groups >>>> of sample (for instance higher expression in the first three samples than >>>> the last three). >>>> >>>> To determine these peaks and cliffs, I have been creating groups in which >>>> the desired peak/cliff is "case" and the remaining samples are "control." >>>> I then run common dispersion and/or tagwise dispersion and extract those >>>> reads with an FDR of less than 0.1. So, my questions: >>>> >>>> 1.) How much filtering of data should I do? Right now I have a fair >>>> amount of genes that are expressed in 0, 1, 2 etc. samples. It seems >>>> logical that I would filter out genes that have no expression, but at >>>> what level should it stop? Also, should there be different filtering >>>> depending on the analysis (peak or cliff)? >>>> >>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I >>>> currently have 7 samples)? Does it depend on the type of analysis? >>>> >>>> 3.) Should I investigate using a more advanced glm based analysis? Any >>>> advice on crafting a design for this? >>>> >>>> 4.) Any other ideas on analyses to perform on a set of timeseries data >>>> with EdgeR? >>>> >>>> I greatly appreciate any help/advice and thank you in advance! >>>> >>>> >>>> Mark J. Lawson, Ph.D. >>>> Bioinformatics Research Scientist >>>> Center for Public Health Genomics, UVA >>>> mlawson at virginia.edu > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 12.9 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks for you help on this Gordon. I am still playing around with the analysis but just to make sure I am on the right track, I want to ensure that I created my design correctly. lifecycles <- c("1G", "1G", "2G", "3G", "4G", "61G", "62G") lifecycles <- factor(lifecycles, levels=c("1G", "2G", "3G", "4G", "61G", "62G")) design <- model.matrix(~0+lifecycles) colnames(design) <- levels(lifecycles) (Recall, the first lifecycle has two sets of data) ~ Mark On Jun 17, 2011, at 7:45 PM, Gordon K Smyth wrote: > Counts per million should have been: > > cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size) > > Gordon > > On Fri, 17 Jun 2011, Gordon K Smyth wrote: > >> Dear Mark, >> >> I'd suggest that you filter out genes that fail to achieve a reasonable count in any library. I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway. In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries. If y is a DGEList object, you might use >> >> cpm <- t(t(y$counts)/y$samples$lib.size) >> keep <- rowSums(cpm>1)>0 >> y.filtered <- y[keep,] >> >> Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates. >> >> design <- model.matrix(~lifecycle) >> y.filtered <- estimateGLMCommonDisp(y.filtered,design) >> >> Perhaps also >> >> y.filtered <- estimateGLMTrendedDisp(y.filtered,design) >> >> Then >> >> fit <- glmFit(y.filtered,design) >> >> Then you could test for genes that change at all through the 6 lifecycles by >> >> lrt <- glmLRT(y.filtered,fit,coef=2:6) >> topTags(lrt) >> >> This will perform a likelihood ratio test for each gene on 5 degrees of freedom. Then you could examine the significant genes to see what patterns they show across the lifecycles: >> >> is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1 >> library(limma) >> plotlines(lrt$coef[is.sig,],first.column.origin=TRUE) >> >> Best wishes >> Gordon >> >> --------------------------------------------- >> Professor Gordon K Smyth, >> Bioinformatics Division, >> Walter and Eliza Hall Institute of Medical Research, >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> Tel: (03) 9345 2326, Fax (03) 9347 0852, >> smyth at wehi.edu.au >> http://www.wehi.edu.au >> http://www.statsci.org/smyth >> >> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote: >> >>> Hi Gordon >>> Thanks for getting back to me on this. I don't mind the transfer, whatever leads to answers to my questions is fine by me. >>> I am sorry that I forgot to classify the lifecycles of the samples. They represent 6 distinct lifecycles (the first measured lifecycle has two samples). Will this affect the prior.n you suggested? >>> ~ Mark >>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote: >>>> Hi Mark, >>>> This is an analysis question rather than a devel suggestion, so I've transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. >>>> If you have divided your samples into two groups, and you have 7 samples, then I recommend you set prior.n=4. See >>>> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.html >>>> for an explanation. >>>> I can't answer your other questions without knowing whether you have biological replicates in your experiment. You have 7 samples. Do these samples all correspond to distinct lifecycle stages, or are some stages observed more than once? How many stages are there? >>>> Best wishes >>>> Gordon >>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000 >>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> >>>>> To: "bioc-sig-sequencing at r-project.org" >>>>> <bioc-sig-sequencing at="" r-project.org=""> >>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of >>>>> Timeseries Data >>>>> (I apologize if this message was sent more than once) >>>>> Dear Analysis Gurus >>>>> I am currently performing a gene expression analysis on a plant parasite. I have mapped Illumina read counts for various stages in this parasites lifecycle. Of interest for us in this analysis are genes that are differentially expressed during these lifecycles. To determine this, I have focused on two types of differential expression: "peaks" and "cliffs." "Peaks" occur when a gene is differentially expressed in one time sample (either higher or lower than the remaining samples) and "cliffs" occur when a gene is differentially expressed between two groups of sample (for instance higher expression in the first three samples than the last three). >>>>> To determine these peaks and cliffs, I have been creating groups in which the desired peak/cliff is "case" and the remaining samples are "control." I then run common dispersion and/or tagwise dispersion and extract those reads with an FDR of less than 0.1. So, my questions: >>>>> 1.) How much filtering of data should I do? Right now I have a fair amount of genes that are expressed in 0, 1, 2 etc. samples. It seems logical that I would filter out genes that have no expression, but at what level should it stop? Also, should there be different filtering depending on the analysis (peak or cliff)? >>>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I currently have 7 samples)? Does it depend on the type of analysis? >>>>> 3.) Should I investigate using a more advanced glm based analysis? Any advice on crafting a design for this? >>>>> 4.) Any other ideas on analyses to perform on a set of timeseries data with EdgeR? >>>>> I greatly appreciate any help/advice and thank you in advance! >>>>> Mark J. Lawson, Ph.D. >>>>> Bioinformatics Research Scientist >>>>> Center for Public Health Genomics, UVA >>>>> mlawson at virginia.edu >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:17}}

ADD REPLY • link 12.9 years ago Lawson, Mark mjl3p ▴ 10

score 0 · Answer 2 · 2011-06-22

Dear Mark, In principle, you can work with any valid design matrix, and there's nothing wrong in principle with the one you have defined. However edgeR, as we've currently programmed it, is only able to do the likelihood ratio test on 5 df that I suggested if you define the design matrix as in my suggested code: model.matrix(~lifecycles) in which each coefficient is compared back to the first lifecycle. In this parametrization, the first coefficient is the level for 1G, the others are 2G-1G, 3G-1G and so on. Best wishes Gordon On Tue, 21 Jun 2011, Lawson, Mark (mjl3p) wrote: > Thanks for you help on this Gordon. > > I am still playing around with the analysis but just to make sure I am on the right track, I want to ensure that I created my design correctly. > > lifecycles <- c("1G", "1G", "2G", "3G", "4G", "61G", "62G") > lifecycles <- factor(lifecycles, levels=c("1G", "2G", "3G", "4G", "61G", "62G")) > design <- model.matrix(~0+lifecycles) > colnames(design) <- levels(lifecycles) > > (Recall, the first lifecycle has two sets of data) > > ~ Mark > > On Jun 17, 2011, at 7:45 PM, Gordon K Smyth wrote: > >> Counts per million should have been: >> >> cpm <- 1e6 * t(t(y$counts)/y$samples$lib.size) >> >> Gordon >> >> On Fri, 17 Jun 2011, Gordon K Smyth wrote: >> >>> Dear Mark, >>> >>> I'd suggest that you filter out genes that fail to achieve a reasonable count in any library. I can't give you a firm cutoff for this, it depends on the sequencing depth, and the analysis should not be sensitive to this anyway. In my lab, we have tended to keep genes that achieve at least 1 count per thousand reads in at least a minimum number of libraries. If y is a DGEList object, you might use >>> >>> cpm <- t(t(y$counts)/y$samples$lib.size) >>> keep <- rowSums(cpm>1)>0 >>> y.filtered <- y[keep,] >>> >>> Since you have two replicates of one lifecycle, you could base your estimate of variability (the biological coefficient of variation) on these replicates. >>> >>> design <- model.matrix(~lifecycle) >>> y.filtered <- estimateGLMCommonDisp(y.filtered,design) >>> >>> Perhaps also >>> >>> y.filtered <- estimateGLMTrendedDisp(y.filtered,design) >>> >>> Then >>> >>> fit <- glmFit(y.filtered,design) >>> >>> Then you could test for genes that change at all through the 6 lifecycles by >>> >>> lrt <- glmLRT(y.filtered,fit,coef=2:6) >>> topTags(lrt) >>> >>> This will perform a likelihood ratio test for each gene on 5 degrees of freedom. Then you could examine the significant genes to see what patterns they show across the lifecycles: >>> >>> is.sig <- p.adjust(lrt$table$p.value,method="BH") < 0.1 >>> library(limma) >>> plotlines(lrt$coef[is.sig,],first.column.origin=TRUE) >>> >>> Best wishes >>> Gordon >>> >>> --------------------------------------------- >>> Professor Gordon K Smyth, >>> Bioinformatics Division, >>> Walter and Eliza Hall Institute of Medical Research, >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> Tel: (03) 9345 2326, Fax (03) 9347 0852, >>> smyth at wehi.edu.au >>> http://www.wehi.edu.au >>> http://www.statsci.org/smyth >>> >>> On Tue, 14 Jun 2011, Lawson, Mark (mjl3p) wrote: >>> >>>> Hi Gordon >>>> Thanks for getting back to me on this. I don't mind the transfer, whatever leads to answers to my questions is fine by me. >>>> I am sorry that I forgot to classify the lifecycles of the samples. They represent 6 distinct lifecycles (the first measured lifecycle has two samples). Will this affect the prior.n you suggested? >>>> ~ Mark >>>> On Jun 10, 2011, at 9:14 PM, Gordon K Smyth wrote: >>>>> Hi Mark, >>>>> This is an analysis question rather than a devel suggestion, so I've transferred it from Bioc-sig-seq to Bioconductor. Hope you don't mind. >>>>> If you have divided your samples into two groups, and you have 7 samples, then I recommend you set prior.n=4. See >>>>> https://stat.ethz.ch/pipermail/bioc-sig- sequencing/2011-June/002042.html >>>>> for an explanation. >>>>> I can't answer your other questions without knowing whether you have biological replicates in your experiment. You have 7 samples. Do these samples all correspond to distinct lifecycle stages, or are some stages observed more than once? How many stages are there? >>>>> Best wishes >>>>> Gordon >>>>>> Date: Thu, 9 Jun 2011 18:19:42 +0000 >>>>>> From: "Lawson, Mark (mjl3p)" <mjl3p at="" virginia.edu=""> >>>>>> To: "bioc-sig-sequencing at r-project.org" >>>>>> <bioc-sig-sequencing at="" r-project.org=""> >>>>>> Subject: [Bioc-sig-seq] Questions on doing EdgeR Analysis of >>>>>> Timeseries Data >>>>>> (I apologize if this message was sent more than once) >>>>>> Dear Analysis Gurus >>>>>> I am currently performing a gene expression analysis on a plant parasite. I have mapped Illumina read counts for various stages in this parasites lifecycle. Of interest for us in this analysis are genes that are differentially expressed during these lifecycles. To determine this, I have focused on two types of differential expression: "peaks" and "cliffs." "Peaks" occur when a gene is differentially expressed in one time sample (either higher or lower than the remaining samples) and "cliffs" occur when a gene is differentially expressed between two groups of sample (for instance higher expression in the first three samples than the last three). >>>>>> To determine these peaks and cliffs, I have been creating groups in which the desired peak/cliff is "case" and the remaining samples are "control." I then run common dispersion and/or tagwise dispersion and extract those reads with an FDR of less than 0.1. So, my questions: >>>>>> 1.) How much filtering of data should I do? Right now I have a fair amount of genes that are expressed in 0, 1, 2 etc. samples. It seems logical that I would filter out genes that have no expression, but at what level should it stop? Also, should there be different filtering depending on the analysis (peak or cliff)? >>>>>> 2.) When doing tagwise dispersion, what should I set my prior.n to (I currently have 7 samples)? Does it depend on the type of analysis? >>>>>> 3.) Should I investigate using a more advanced glm based analysis? Any advice on crafting a design for this? >>>>>> 4.) Any other ideas on analyses to perform on a set of timeseries data with EdgeR? >>>>>> I greatly appreciate any help/advice and thank you in advance! >>>>>> Mark J. Lawson, Ph.D. >>>>>> Bioinformatics Research Scientist >>>>>> Center for Public Health Genomics, UVA >>>>>> mlawson at virginia.edu >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the addressee. >> You must not disclose, forward, print or use it without the permission of the sender. >> ______________________________________________________________________ > > Mark J. Lawson, Ph.D. > Bioinformatics Research Scientist > Center for Public Health Genomics, UVA > mlawson at virginia.edu > > > > > > > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}