Interspecies differential expression of orthologs with Edger

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

Dear Assaf, You are getting the sort of results that I would expect you to get when you try to compare two RNA sources that are very different. The diagonal lines in the MA plot are simply a result of having low counts (0,1,2 etc) in one species and high counts in the other for the same genes. When you compare different species, I'd intuitively expect almost every gene to be differentially expressed to some degree. So I'm not surprised that a large proportion of genes are assesssed as DE. That's about as much help as I can give you. I can't give advice that would allow you to get the same sort of results as you might be used to, because comparing different species isn't a normal thing to do. Best wishes Gordon > Date: Fri, 5 Sep 2014 23:22:28 +0300 > From: assaf www <assafwww at="" gmail.com=""> > To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] Interspecies differential expression of orthologs > with Edger > > Thanks Gordon, > > To summarize the results I got on the cross-species data, after embedding > the length-effect to the GLM offset matrix, as in the code you sent, please > see the attached MA plot: > > 1) for >5 and <-5 log fold change, genes' logFC is positively correlated > with mean log CPM, something I haven?t seen before in Edger standard runs. > 2) most genes with fold change around > 1.3, or < -1.3, are significant, > which looks to me too ?liberal?. Please note that each group contains 6 > true biological replicates (variance within each group is large) . > > The first problem worries me most, any idea is very welcomed. > > Many thanks, > Assaf > > > > On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> >> On Tue, 2 Sep 2014, assaf www wrote: >> >> Does Edger DE analysis is built on the assumption that most genes are not >>> differentially expressed, and that only a small portion of them do (say >>> <20%) ? >>> >> >> Only the calcNormFactors() step of edgeR makes any assumption of this >> sort. calcNormFactors assumes that either that most genes are not DE or >> that the DE is reasonably symmetric. >> >> I mean, in cross-species studies, or when comparing different tissues of >>> the same organism, if this assumption doesn't hold, should it be a serious >>> concern ? >>> >> >> In a cross-species comparison there will be many DE genes, but some will >> be up and some will be down. The DE will not be all in one direction, I >> would guess that normalization will not be a serious concern. >> >> Of all the concerns with cross-species comparisons, this seems to me to be >> far from the most serious. >> >> Best wishes >> Gordon >> > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: crossspecies.png > Type: image/png > Size: 65085 bytes > Desc: not available > URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 201409="" 05="" c599392b="" attachment-0001.png=""> > > ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Normalization Organism edgeR • 2.7k views

ADD COMMENT • link updated 11.4 years ago by assaf www ▴ 140 • written 11.4 years ago by Gordon Smyth 53k

0

Entering edit mode

assaf www ▴ 140

@assaf-www-6709

Last seen 6.5 years ago

Dear Gordon I am aware of the limitations of the corss-species inference - Still , it is critical for me to minimize false positives, before the real-time PCR validation stage. Just trying to understand some other things, that may, or may not, be related to the corss-species issue: Edger manual says that any kind of "genomic feature" may be used, but can "genomic feature" also be defined as 'groups of genes' ? I mean, can it be correct to try Edger after summing up the counts of genes belonging to specific categories (e.g. gene families) ? so instead of having 12,000 genes I end up with, say 2,000 gene groups ? this can also be good for the FDR, etc. Thanks a lot, all the Best, Assaf On Sun, Sep 7, 2014 at 4:11 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Assaf, > > You are getting the sort of results that I would expect you to get when > you try to compare two RNA sources that are very different. > > The diagonal lines in the MA plot are simply a result of having low counts > (0,1,2 etc) in one species and high counts in the other for the same genes. > > When you compare different species, I'd intuitively expect almost every > gene to be differentially expressed to some degree. So I'm not surprised > that a large proportion of genes are assesssed as DE. > > That's about as much help as I can give you. I can't give advice that > would allow you to get the same sort of results as you might be used to, > because comparing different species isn't a normal thing to do. > > Best wishes > Gordon > > > Date: Fri, 5 Sep 2014 23:22:28 +0300 >> From: assaf www <assafwww at="" gmail.com=""> >> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> >> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] Interspecies differential expression of orthologs >> with Edger >> >> Thanks Gordon, >> >> To summarize the results I got on the cross-species data, after embedding >> the length-effect to the GLM offset matrix, as in the code you sent, >> please >> see the attached MA plot: >> >> 1) for >5 and <-5 log fold change, genes' logFC is positively correlated >> with mean log CPM, something I haven?t seen before in Edger standard runs. >> 2) most genes with fold change around > 1.3, or < -1.3, are significant, >> which looks to me too ?liberal?. Please note that each group contains 6 >> true biological replicates (variance within each group is large) . >> >> The first problem worries me most, any idea is very welcomed. >> >> Many thanks, >> Assaf >> >> >> >> On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> >> >>> On Tue, 2 Sep 2014, assaf www wrote: >>> >>> Does Edger DE analysis is built on the assumption that most genes are >>> not >>> >>>> differentially expressed, and that only a small portion of them do (say >>>> <20%) ? >>>> >>>> >>> Only the calcNormFactors() step of edgeR makes any assumption of this >>> sort. calcNormFactors assumes that either that most genes are not DE or >>> that the DE is reasonably symmetric. >>> >>> I mean, in cross-species studies, or when comparing different tissues of >>> >>>> the same organism, if this assumption doesn't hold, should it be a >>>> serious >>>> concern ? >>>> >>>> >>> In a cross-species comparison there will be many DE genes, but some will >>> be up and some will be down. The DE will not be all in one direction, I >>> would guess that normalization will not be a serious concern. >>> >>> Of all the concerns with cross-species comparisons, this seems to me to >>> be >>> far from the most serious. >>> >>> Best wishes >>> Gordon >>> >>> -------------- next part -------------- >> A non-text attachment was scrubbed... >> Name: crossspecies.png >> Type: image/png >> Size: 65085 bytes >> Desc: not available >> URL: <https: stat.ethz.ch="" pipermail="" bioconductor="">> attachments/20140905/c599392b/attachment-0001.png> >> >> ------------------------------ >> > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD COMMENT • link 11.4 years ago assaf www ▴ 140

0

Entering edit mode

On Sun, Sep 7, 2014 at 3:32 PM, assaf www <assafwww at="" gmail.com=""> wrote: > Dear Gordon > > I am aware of the limitations of the corss-species inference - > Still , it is critical for me to minimize false positives, before the > real-time PCR validation stage. > > Just trying to understand some other things, that may, or may not, be > related to the corss-species issue: > Edger manual says that any kind of "genomic feature" may be used, > but can "genomic feature" also be defined as 'groups of genes' ? > I mean, can it be correct to try Edger after summing up the counts of genes > belonging to specific categories > (e.g. gene families) ? so instead of having 12,000 genes I end up with, say > 2,000 gene groups ? > this can also be good for the FDR, etc. > Hi, Assaf. edgeR and other related tools will happily use counts from arbitrary genomic features and have been applied to data such as DNAse-Seq and ChIP-Seq. I'm not sure how doing so will "be good for the FDR", but I may misunderstand your point. Sean > > Thanks a lot, all the Best, > Assaf > > On Sun, Sep 7, 2014 at 4:11 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > > > Dear Assaf, > > > > You are getting the sort of results that I would expect you to get when > > you try to compare two RNA sources that are very different. > > > > The diagonal lines in the MA plot are simply a result of having low > counts > > (0,1,2 etc) in one species and high counts in the other for the same > genes. > > > > When you compare different species, I'd intuitively expect almost every > > gene to be differentially expressed to some degree. So I'm not surprised > > that a large proportion of genes are assesssed as DE. > > > > That's about as much help as I can give you. I can't give advice that > > would allow you to get the same sort of results as you might be used to, > > because comparing different species isn't a normal thing to do. > > > > Best wishes > > Gordon > > > > > > Date: Fri, 5 Sep 2014 23:22:28 +0300 > >> From: assaf www <assafwww at="" gmail.com=""> > >> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> > >> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> > >> Subject: Re: [BioC] Interspecies differential expression of orthologs > >> with Edger > >> > >> Thanks Gordon, > >> > >> To summarize the results I got on the cross-species data, after > embedding > >> the length-effect to the GLM offset matrix, as in the code you sent, > >> please > >> see the attached MA plot: > >> > >> 1) for >5 and <-5 log fold change, genes' logFC is positively correlated > >> with mean log CPM, something I haven?t seen before in Edger standard > runs. > >> 2) most genes with fold change around > 1.3, or < -1.3, are significant, > >> which looks to me too ?liberal?. Please note that each group contains 6 > >> true biological replicates (variance within each group is large) . > >> > >> The first problem worries me most, any idea is very welcomed. > >> > >> Many thanks, > >> Assaf > >> > >> > >> > >> On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> > wrote: > >> > >> > >>> On Tue, 2 Sep 2014, assaf www wrote: > >>> > >>> Does Edger DE analysis is built on the assumption that most genes are > >>> not > >>> > >>>> differentially expressed, and that only a small portion of them do > (say > >>>> <20%) ? > >>>> > >>>> > >>> Only the calcNormFactors() step of edgeR makes any assumption of this > >>> sort. calcNormFactors assumes that either that most genes are not DE or > >>> that the DE is reasonably symmetric. > >>> > >>> I mean, in cross-species studies, or when comparing different tissues > of > >>> > >>>> the same organism, if this assumption doesn't hold, should it be a > >>>> serious > >>>> concern ? > >>>> > >>>> > >>> In a cross-species comparison there will be many DE genes, but some > will > >>> be up and some will be down. The DE will not be all in one direction, > I > >>> would guess that normalization will not be a serious concern. > >>> > >>> Of all the concerns with cross-species comparisons, this seems to me to > >>> be > >>> far from the most serious. > >>> > >>> Best wishes > >>> Gordon > >>> > >>> -------------- next part -------------- > >> A non-text attachment was scrubbed... > >> Name: crossspecies.png > >> Type: image/png > >> Size: 65085 bytes > >> Desc: not available > >> URL: <https: stat.ethz.ch="" pipermail="" bioconductor=""> >> attachments/20140905/c599392b/attachment-0001.png> > >> > >> ------------------------------ > >> > > > > ______________________________________________________________________ > > The information in this email is confidential and inte...{{dropped:10}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago Sean Davis 21k

0

Entering edit mode

Hi sean I guess I'm not clear, sorry. I mean that in principle it is possible to aggregate genes based on their membership in gene families (or any other criteria), and to compare the sum of read counts per sample per groups of genes (usually it would be counts per sample per genes). What I would be interested to learn is if such comparison can be done in Edger. About FDR : In the above case, after grouping there are less multiple comparisons, and lower FDR. best Assaf On Mon, Sep 8, 2014 at 12:01 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > > > > On Sun, Sep 7, 2014 at 3:32 PM, assaf www <assafwww at="" gmail.com=""> wrote: > >> Dear Gordon >> >> I am aware of the limitations of the corss-species inference - >> Still , it is critical for me to minimize false positives, before the >> real-time PCR validation stage. >> >> Just trying to understand some other things, that may, or may not, be >> related to the corss-species issue: >> Edger manual says that any kind of "genomic feature" may be used, >> but can "genomic feature" also be defined as 'groups of genes' ? >> I mean, can it be correct to try Edger after summing up the counts of >> genes >> belonging to specific categories >> (e.g. gene families) ? so instead of having 12,000 genes I end up with, >> say >> 2,000 gene groups ? >> this can also be good for the FDR, etc. >> > > Hi, Assaf. > > edgeR and other related tools will happily use counts from arbitrary > genomic features and have been applied to data such as DNAse-Seq and > ChIP-Seq. I'm not sure how doing so will "be good for the FDR", but I may > misunderstand your point. > > Sean > > > >> >> Thanks a lot, all the Best, >> Assaf >> >> On Sun, Sep 7, 2014 at 4:11 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >> >> > Dear Assaf, >> > >> > You are getting the sort of results that I would expect you to get when >> > you try to compare two RNA sources that are very different. >> > >> > The diagonal lines in the MA plot are simply a result of having low >> counts >> > (0,1,2 etc) in one species and high counts in the other for the same >> genes. >> > >> > When you compare different species, I'd intuitively expect almost every >> > gene to be differentially expressed to some degree. So I'm not >> surprised >> > that a large proportion of genes are assesssed as DE. >> > >> > That's about as much help as I can give you. I can't give advice that >> > would allow you to get the same sort of results as you might be used to, >> > because comparing different species isn't a normal thing to do. >> > >> > Best wishes >> > Gordon >> > >> > >> > Date: Fri, 5 Sep 2014 23:22:28 +0300 >> >> From: assaf www <assafwww at="" gmail.com=""> >> >> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> >> >> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> >> >> Subject: Re: [BioC] Interspecies differential expression of orthologs >> >> with Edger >> >> >> >> Thanks Gordon, >> >> >> >> To summarize the results I got on the cross-species data, after >> embedding >> >> the length-effect to the GLM offset matrix, as in the code you sent, >> >> please >> >> see the attached MA plot: >> >> >> >> 1) for >5 and <-5 log fold change, genes' logFC is positively >> correlated >> >> with mean log CPM, something I haven?t seen before in Edger standard >> runs. >> >> 2) most genes with fold change around > 1.3, or < -1.3, are >> significant, >> >> which looks to me too ?liberal?. Please note that each group contains 6 >> >> true biological replicates (variance within each group is large) . >> >> >> >> The first problem worries me most, any idea is very welcomed. >> >> >> >> Many thanks, >> >> Assaf >> >> >> >> >> >> >> >> On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >> >> >> >> >> >>> On Tue, 2 Sep 2014, assaf www wrote: >> >>> >> >>> Does Edger DE analysis is built on the assumption that most genes are >> >>> not >> >>> >> >>>> differentially expressed, and that only a small portion of them do >> (say >> >>>> <20%) ? >> >>>> >> >>>> >> >>> Only the calcNormFactors() step of edgeR makes any assumption of this >> >>> sort. calcNormFactors assumes that either that most genes are not DE >> or >> >>> that the DE is reasonably symmetric. >> >>> >> >>> I mean, in cross-species studies, or when comparing different >> tissues of >> >>> >> >>>> the same organism, if this assumption doesn't hold, should it be a >> >>>> serious >> >>>> concern ? >> >>>> >> >>>> >> >>> In a cross-species comparison there will be many DE genes, but some >> will >> >>> be up and some will be down. The DE will not be all in one >> direction, I >> >>> would guess that normalization will not be a serious concern. >> >>> >> >>> Of all the concerns with cross-species comparisons, this seems to me >> to >> >>> be >> >>> far from the most serious. >> >>> >> >>> Best wishes >> >>> Gordon >> >>> >> >>> -------------- next part -------------- >> >> A non-text attachment was scrubbed... >> >> Name: crossspecies.png >> >> Type: image/png >> >> Size: 65085 bytes >> >> Desc: not available >> >> URL: <https: stat.ethz.ch="" pipermail="" bioconductor="">> >> attachments/20140905/c599392b/attachment-0001.png> >> >> >> >> ------------------------------ >> >> >> > >> > ______________________________________________________________________ >> > The information in this email is confidential and inte...{{dropped:10}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago assaf www ▴ 140

0

Entering edit mode

Hi, On Mon, Sep 8, 2014 at 1:17 AM, assaf www <assafwww at="" gmail.com=""> wrote: > Hi sean > > I guess I'm not clear, sorry. > > I mean that in principle it is possible to aggregate genes based on their > membership in gene families (or any other criteria), and to compare the sum > of read counts per sample per groups of genes (usually it would be counts > per sample per genes). What I would be interested to learn is if such > comparison can be done in Edger. > > About FDR : In the above case, after grouping there are less multiple > comparisons, and lower FDR. Instead of grouping different genes into one "count feature," it's sounds like keeping genes separate, but doing a gene set enrichment analysis might be more like what you are looking for? edgeR and limma::voom have these out of the box -- look at the camera and roast functions for further info on that. HTH, -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Hi Steve I will look into limma::voom (was not aware of this approach). Do you mean GO enrichment (e.g., David/Go-seq/etc), is so then no, its not what I mean. I specifically would like to ask if Edger (or similar tools) could give reasonable DE estimation by comparing the sum of counts of groups of genes (instead single genes). This is a completely different thing - it may possibly allow working- around the issue of paralogy-orthology when performing cross-species DE analysis, and may have multiple other advantages I believe (regardless of cross-species things). (Of course, in case it doesn't violate the basic assumptions of these DE analyzes, and can keep the data properly normalized - this is my question) thanks a lots for the suggestions, i will look into, Assaf On Mon, Sep 8, 2014 at 6:33 PM, Steve Lianoglou <lianoglou.steve at="" gene.com=""> wrote: > Hi, > > On Mon, Sep 8, 2014 at 1:17 AM, assaf www <assafwww at="" gmail.com=""> wrote: > > Hi sean > > > > I guess I'm not clear, sorry. > > > > I mean that in principle it is possible to aggregate genes based on their > > membership in gene families (or any other criteria), and to compare the > sum > > of read counts per sample per groups of genes (usually it would be counts > > per sample per genes). What I would be interested to learn is if such > > comparison can be done in Edger. > > > > About FDR : In the above case, after grouping there are less multiple > > comparisons, and lower FDR. > Instead of grouping different genes into one "count feature," it's > sounds like keeping genes separate, but doing a gene set enrichment > analysis might be more like what you are looking for? > > edgeR and limma::voom have these out of the box -- look at the camera > and roast functions for further info on that. > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago assaf www ▴ 140

0

Entering edit mode

Hi, On Mon, Sep 8, 2014 at 1:41 PM, assaf www <assafwww at="" gmail.com=""> wrote: > Hi Steve > > I will look into limma::voom (was not aware of this approach). This is "just" another approach to do differential expression analysis with rna-seq data -- ie. one could use edgeR, DESeq2, limma::voom, etc. to do "standard" differential expression (count) testing. > Do you mean GO enrichment (e.g., David/Go-seq/etc), is so then no, its not > what I mean. No, I do not mean GO enrichment, I really meant "gene set enrichment analysis", it's "a thing" ... look at the help and references listed under ?camera and ?roast, I thought your motivation to group "features" together was to do something along those lines but from what you say below, it seems not (?) > I specifically would like to ask if Edger (or similar tools) could give > reasonable DE estimation by comparing the sum of counts of groups of genes > (instead single genes). It's not clear to me what "reasonable" means in this case, to be honest. > This is a completely different thing - it may possibly allow working-around > the issue of paralogy-orthology when performing cross-species DE analysis, > and may have multiple other advantages I believe (regardless of > cross-species things). > (Of course, in case it doesn't violate the basic assumptions of these DE > analyzes, and can keep the data properly normalized - this is my question) Without getting too involved here, my gut feeling is that going about things in this way is making things worse ... not better. I'm not sure where to start pointing you for some help, and this might be entirely unrelated (but I'll let you figure that one out ;-), but maybe you can take a look at the paper written by the DEXSeq folks: Drift and conservation of differential exon usage across tissues in primate species http://www.pnas.org/content/110/38/15377.short I admit that this isn't directly doing what you are doing, but perhaps they touch upon some issues that might be of help to you ... to be honest, however, I haven't given it a good read through, even though it has been on my "readme" list for sometime (over a year, apparently!). -steve -- Steve Lianoglou Computational Biologist Genentech

ADD REPLY • link 11.4 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

thanks for the ideas, you probably mean this method ? http://www.broadinstitute.org/gsea/index.jsp also will check if DexSeq can help here Assaf On Mon, Sep 8, 2014 at 11:55 PM, Steve Lianoglou <lianoglou.steve at="" gene.com=""> wrote: > Hi, > > On Mon, Sep 8, 2014 at 1:41 PM, assaf www <assafwww at="" gmail.com=""> wrote: > > Hi Steve > > > > I will look into limma::voom (was not aware of this approach). > > This is "just" another approach to do differential expression analysis > with rna-seq data -- ie. one could use edgeR, DESeq2, limma::voom, > etc. to do "standard" differential expression (count) testing. > > > Do you mean GO enrichment (e.g., David/Go-seq/etc), is so then no, its > not > > what I mean. > > No, I do not mean GO enrichment, I really meant "gene set enrichment > analysis", it's "a thing" ... look at the help and references listed > under ?camera and ?roast, I thought your motivation to group > "features" together was to do something along those lines but from > what you say below, it seems not (?) > > > I specifically would like to ask if Edger (or similar tools) could give > > reasonable DE estimation by comparing the sum of counts of groups of > genes > > (instead single genes). > > It's not clear to me what "reasonable" means in this case, to be honest. > > > This is a completely different thing - it may possibly allow > working-around > > the issue of paralogy-orthology when performing cross-species DE > analysis, > > and may have multiple other advantages I believe (regardless of > > cross-species things). > > (Of course, in case it doesn't violate the basic assumptions of these DE > > analyzes, and can keep the data properly normalized - this is my > question) > > Without getting too involved here, my gut feeling is that going about > things in this way is making things worse ... not better. > > I'm not sure where to start pointing you for some help, and this might > be entirely unrelated (but I'll let you figure that one out ;-), but > maybe you can take a look at the paper written by the DEXSeq folks: > > Drift and conservation of differential exon usage across tissues in > primate species > http://www.pnas.org/content/110/38/15377.short > > I admit that this isn't directly doing what you are doing, but perhaps > they touch upon some issues that might be of help to you ... to be > honest, however, I haven't given it a good read through, even though > it has been on my "readme" list for sometime (over a year, > apparently!). > > -steve > > -- > Steve Lianoglou > Computational Biologist > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 11.4 years ago assaf www ▴ 140

0

Entering edit mode

Dear Assaf, Please type library(edgeR) ?roast.DGEList to see the roast gene set test that Steve was referring to. It tests whether a set of genes is differentially expressed as a group. Gordon On Mon, 8 Sep 2014, assaf www wrote: > Hi Steve > > I will look into limma::voom (was not aware of this approach). > > Do you mean GO enrichment (e.g., David/Go-seq/etc), is so then no, its not > what I mean. > > I specifically would like to ask if Edger (or similar tools) could give > reasonable DE estimation by comparing the sum of counts of groups of genes > (instead single genes). > This is a completely different thing - it may possibly allow working-around > the issue of paralogy-orthology when performing cross-species DE analysis, > and may have multiple other advantages I believe (regardless of > cross-species things). > (Of course, in case it doesn't violate the basic assumptions of these DE > analyzes, and can keep the data properly normalized - this is my question) > > thanks a lots for the suggestions, i will look into, > Assaf > > On Mon, Sep 8, 2014 at 6:33 PM, Steve Lianoglou <lianoglou.steve at="" gene.com=""> > wrote: > >> Hi, >> >> On Mon, Sep 8, 2014 at 1:17 AM, assaf www <assafwww at="" gmail.com=""> wrote: >>> Hi sean >>> >>> I guess I'm not clear, sorry. >>> >>> I mean that in principle it is possible to aggregate genes based on their >>> membership in gene families (or any other criteria), and to compare the >> sum >>> of read counts per sample per groups of genes (usually it would be counts >>> per sample per genes). What I would be interested to learn is if such >>> comparison can be done in Edger. >>> >>> About FDR : In the above case, after grouping there are less multiple >>> comparisons, and lower FDR. >> Instead of grouping different genes into one "count feature," it's >> sounds like keeping genes separate, but doing a gene set enrichment >> analysis might be more like what you are looking for? >> >> edgeR and limma::voom have these out of the box -- look at the camera >> and roast functions for further info on that. >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Genentech >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.4 years ago Gordon Smyth 53k

0

Entering edit mode

Dear Assaf, As Sean and the edgeR manual have already told you, you can define the genomic features any way you like. You can still do a comparison using edgeR. Why do you keep asking? However, it is your responsibility (not ours) to make sure that the genomic features you have defined make biological sense for your specific problem. The transcripts arising from within each genomic feature need to behave reasonably consistently, or else you need to be interested only in the aggregate behaviour. I can see why it might make sense to define genomic regions based on ortholog families rather than individual genes. Whether it makes sense to group together large families of genes, I am a bit sceptical about that. A roast() test would seem more appropriate for that sort of thing. You are assuming that reducing the groups will lower FDR. This does not necessarily follow. Best wishes Gordon --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Mon, 8 Sep 2014, assaf www wrote: > Hi sean > > I guess I'm not clear, sorry. > > I mean that in principle it is possible to aggregate genes based on their > membership in gene families (or any other criteria), and to compare the sum > of read counts per sample per groups of genes (usually it would be counts > per sample per genes). What I would be interested to learn is if such > comparison can be done in Edger. > > About FDR : In the above case, after grouping there are less multiple > comparisons, and lower FDR. > > best > Assaf > > On Mon, Sep 8, 2014 at 12:01 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: > >> >> >> >> On Sun, Sep 7, 2014 at 3:32 PM, assaf www <assafwww at="" gmail.com=""> wrote: >> >>> Dear Gordon >>> >>> I am aware of the limitations of the corss-species inference - >>> Still , it is critical for me to minimize false positives, before the >>> real-time PCR validation stage. >>> >>> Just trying to understand some other things, that may, or may not, be >>> related to the corss-species issue: >>> Edger manual says that any kind of "genomic feature" may be used, >>> but can "genomic feature" also be defined as 'groups of genes' ? >>> I mean, can it be correct to try Edger after summing up the counts of >>> genes >>> belonging to specific categories >>> (e.g. gene families) ? so instead of having 12,000 genes I end up with, >>> say >>> 2,000 gene groups ? >>> this can also be good for the FDR, etc. >>> >> >> Hi, Assaf. >> >> edgeR and other related tools will happily use counts from arbitrary >> genomic features and have been applied to data such as DNAse-Seq and >> ChIP-Seq. I'm not sure how doing so will "be good for the FDR", but I may >> misunderstand your point. >> >> Sean >> >> >> >>> >>> Thanks a lot, all the Best, >>> Assaf >>> >>> On Sun, Sep 7, 2014 at 4:11 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: >>> >>>> Dear Assaf, >>>> >>>> You are getting the sort of results that I would expect you to get when >>>> you try to compare two RNA sources that are very different. >>>> >>>> The diagonal lines in the MA plot are simply a result of having low >>> counts >>>> (0,1,2 etc) in one species and high counts in the other for the same >>> genes. >>>> >>>> When you compare different species, I'd intuitively expect almost every >>>> gene to be differentially expressed to some degree. So I'm not >>> surprised >>>> that a large proportion of genes are assesssed as DE. >>>> >>>> That's about as much help as I can give you. I can't give advice that >>>> would allow you to get the same sort of results as you might be used to, >>>> because comparing different species isn't a normal thing to do. >>>> >>>> Best wishes >>>> Gordon >>>> >>>> >>>> Date: Fri, 5 Sep 2014 23:22:28 +0300 >>>>> From: assaf www <assafwww at="" gmail.com=""> >>>>> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> >>>>> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>>>> Subject: Re: [BioC] Interspecies differential expression of orthologs >>>>> with Edger >>>>> >>>>> Thanks Gordon, >>>>> >>>>> To summarize the results I got on the cross-species data, after >>> embedding >>>>> the length-effect to the GLM offset matrix, as in the code you sent, >>>>> please >>>>> see the attached MA plot: >>>>> >>>>> 1) for >5 and <-5 log fold change, genes' logFC is positively >>> correlated >>>>> with mean log CPM, something I haven?t seen before in Edger standard >>> runs. >>>>> 2) most genes with fold change around > 1.3, or < -1.3, are >>> significant, >>>>> which looks to me too ?liberal?. Please note that each group contains 6 >>>>> true biological replicates (variance within each group is large) . >>>>> >>>>> The first problem worries me most, any idea is very welcomed. >>>>> >>>>> Many thanks, >>>>> Assaf >>>>> >>>>> >>>>> >>>>> On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >>> wrote: >>>>> >>>>> >>>>>> On Tue, 2 Sep 2014, assaf www wrote: >>>>>> >>>>>> Does Edger DE analysis is built on the assumption that most genes are >>>>>> not >>>>>> >>>>>>> differentially expressed, and that only a small portion of them do >>> (say >>>>>>> <20%) ? >>>>>>> >>>>>>> >>>>>> Only the calcNormFactors() step of edgeR makes any assumption of this >>>>>> sort. calcNormFactors assumes that either that most genes are not DE >>> or >>>>>> that the DE is reasonably symmetric. >>>>>> >>>>>> I mean, in cross-species studies, or when comparing different >>> tissues of >>>>>> >>>>>>> the same organism, if this assumption doesn't hold, should it be a >>>>>>> serious >>>>>>> concern ? >>>>>>> >>>>>>> >>>>>> In a cross-species comparison there will be many DE genes, but some >>> will >>>>>> be up and some will be down. The DE will not be all in one >>> direction, I >>>>>> would guess that normalization will not be a serious concern. >>>>>> >>>>>> Of all the concerns with cross-species comparisons, this seems to me >>> to >>>>>> be >>>>>> far from the most serious. >>>>>> >>>>>> Best wishes >>>>>> Gordon >>>>>> >>>>>> -------------- next part -------------- >>>>> A non-text attachment was scrubbed... >>>>> Name: crossspecies.png >>>>> Type: image/png >>>>> Size: 65085 bytes >>>>> Desc: not available >>>>> URL: <https: stat.ethz.ch="" pipermail="" bioconductor="">>>>> attachments/20140905/c599392b/attachment-0001.png> >>>>> >>>>> ------------------------------ >>>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and inte...{{dropped:10}} >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.4 years ago Gordon Smyth 53k

0

Entering edit mode

Thanks Gordon and Sean OK, I see what you mean now about Roast, sorry for the mess !!! but your answers are highly informative for me . I guess that simply aggregating 1,500 olfactory receptors members of the gene family for example, would just increase the mess. Let me read and check this, both the Roast approach, and the aggregation. Assaf On Tue, Sep 9, 2014 at 7:26 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Assaf, > > As Sean and the edgeR manual have already told you, you can define the > genomic features any way you like. You can still do a comparison using > edgeR. Why do you keep asking? > > However, it is your responsibility (not ours) to make sure that the > genomic features you have defined make biological sense for your specific > problem. The transcripts arising from within each genomic feature need to > behave reasonably consistently, or else you need to be interested only in > the aggregate behaviour. > > I can see why it might make sense to define genomic regions based on > ortholog families rather than individual genes. Whether it makes sense to > group together large families of genes, I am a bit sceptical about that. A > roast() test would seem more appropriate for that sort of thing. > > You are assuming that reducing the groups will lower FDR. This does not > necessarily follow. > > Best wishes > Gordon > > --------------------------------------------- > Professor Gordon K Smyth, > Bioinformatics Division, > Walter and Eliza Hall Institute of Medical Research, > 1G Royal Parade, Parkville, Vic 3052, Australia. > http://www.statsci.org/smyth > > On Mon, 8 Sep 2014, assaf www wrote: > > Hi sean >> >> I guess I'm not clear, sorry. >> >> I mean that in principle it is possible to aggregate genes based on their >> membership in gene families (or any other criteria), and to compare the >> sum >> of read counts per sample per groups of genes (usually it would be counts >> per sample per genes). What I would be interested to learn is if such >> comparison can be done in Edger. >> >> About FDR : In the above case, after grouping there are less multiple >> comparisons, and lower FDR. >> >> best >> Assaf >> >> On Mon, Sep 8, 2014 at 12:01 AM, Sean Davis <sdavis2 at="" mail.nih.gov=""> wrote: >> >> >>> >>> >>> On Sun, Sep 7, 2014 at 3:32 PM, assaf www <assafwww at="" gmail.com=""> wrote: >>> >>> Dear Gordon >>>> >>>> I am aware of the limitations of the corss-species inference - >>>> Still , it is critical for me to minimize false positives, before the >>>> real-time PCR validation stage. >>>> >>>> Just trying to understand some other things, that may, or may not, be >>>> related to the corss-species issue: >>>> Edger manual says that any kind of "genomic feature" may be used, >>>> but can "genomic feature" also be defined as 'groups of genes' ? >>>> I mean, can it be correct to try Edger after summing up the counts of >>>> genes >>>> belonging to specific categories >>>> (e.g. gene families) ? so instead of having 12,000 genes I end up with, >>>> say >>>> 2,000 gene groups ? >>>> this can also be good for the FDR, etc. >>>> >>>> >>> Hi, Assaf. >>> >>> edgeR and other related tools will happily use counts from arbitrary >>> genomic features and have been applied to data such as DNAse-Seq and >>> ChIP-Seq. I'm not sure how doing so will "be good for the FDR", but I >>> may >>> misunderstand your point. >>> >>> Sean >>> >>> >>> >>> >>>> Thanks a lot, all the Best, >>>> Assaf >>>> >>>> On Sun, Sep 7, 2014 at 4:11 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >>>> wrote: >>>> >>>> Dear Assaf, >>>>> >>>>> You are getting the sort of results that I would expect you to get when >>>>> you try to compare two RNA sources that are very different. >>>>> >>>>> The diagonal lines in the MA plot are simply a result of having low >>>>> >>>> counts >>>> >>>>> (0,1,2 etc) in one species and high counts in the other for the same >>>>> >>>> genes. >>>> >>>>> >>>>> When you compare different species, I'd intuitively expect almost every >>>>> gene to be differentially expressed to some degree. So I'm not >>>>> >>>> surprised >>>> >>>>> that a large proportion of genes are assesssed as DE. >>>>> >>>>> That's about as much help as I can give you. I can't give advice that >>>>> would allow you to get the same sort of results as you might be used >>>>> to, >>>>> because comparing different species isn't a normal thing to do. >>>>> >>>>> Best wishes >>>>> Gordon >>>>> >>>>> >>>>> Date: Fri, 5 Sep 2014 23:22:28 +0300 >>>>> >>>>>> From: assaf www <assafwww at="" gmail.com=""> >>>>>> To: Gordon K Smyth <smyth at="" wehi.edu.au=""> >>>>>> Cc: Bioconductor mailing list <bioconductor at="" r-project.org=""> >>>>>> Subject: Re: [BioC] Interspecies differential expression of orthologs >>>>>> with Edger >>>>>> >>>>>> Thanks Gordon, >>>>>> >>>>>> To summarize the results I got on the cross-species data, after >>>>>> >>>>> embedding >>>> >>>>> the length-effect to the GLM offset matrix, as in the code you sent, >>>>>> please >>>>>> see the attached MA plot: >>>>>> >>>>>> 1) for >5 and <-5 log fold change, genes' logFC is positively >>>>>> >>>>> correlated >>>> >>>>> with mean log CPM, something I haven?t seen before in Edger standard >>>>>> >>>>> runs. >>>> >>>>> 2) most genes with fold change around > 1.3, or < -1.3, are >>>>>> >>>>> significant, >>>> >>>>> which looks to me too ?liberal?. Please note that each group contains 6 >>>>>> true biological replicates (variance within each group is large) . >>>>>> >>>>>> The first problem worries me most, any idea is very welcomed. >>>>>> >>>>>> Many thanks, >>>>>> Assaf >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 3, 2014 at 2:08 AM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >>>>>> >>>>> wrote: >>>> >>>>> >>>>>> >>>>>> On Tue, 2 Sep 2014, assaf www wrote: >>>>>>> >>>>>>> Does Edger DE analysis is built on the assumption that most genes >>>>>>> are >>>>>>> not >>>>>>> >>>>>>> differentially expressed, and that only a small portion of them do >>>>>>>> >>>>>>> (say >>>> >>>>> <20%) ? >>>>>>>> >>>>>>>> >>>>>>>> Only the calcNormFactors() step of edgeR makes any assumption of >>>>>>> this >>>>>>> sort. calcNormFactors assumes that either that most genes are not DE >>>>>>> >>>>>> or >>>> >>>>> that the DE is reasonably symmetric. >>>>>>> >>>>>>> I mean, in cross-species studies, or when comparing different >>>>>>> >>>>>> tissues of >>>> >>>>> >>>>>>> the same organism, if this assumption doesn't hold, should it be a >>>>>>>> serious >>>>>>>> concern ? >>>>>>>> >>>>>>>> >>>>>>>> In a cross-species comparison there will be many DE genes, but some >>>>>>> >>>>>> will >>>> >>>>> be up and some will be down. The DE will not be all in one >>>>>>> >>>>>> direction, I >>>> >>>>> would guess that normalization will not be a serious concern. >>>>>>> >>>>>>> Of all the concerns with cross-species comparisons, this seems to me >>>>>>> >>>>>> to >>>> >>>>> be >>>>>>> far from the most serious. >>>>>>> >>>>>>> Best wishes >>>>>>> Gordon >>>>>>> >>>>>>> -------------- next part -------------- >>>>>>> >>>>>> A non-text attachment was scrubbed... >>>>>> Name: crossspecies.png >>>>>> Type: image/png >>>>>> Size: 65085 bytes >>>>>> Desc: not available >>>>>> URL: <https: stat.ethz.ch="" pipermail="" bioconductor="">>>>>> attachments/20140905/c599392b/attachment-0001.png> >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and inte...{{dropped:10}} >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>> >>> >> > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 11.4 years ago assaf www ▴ 140

Login before adding your answer.