goseq analysis
2
0
Entering edit mode
Dave Tang ▴ 210
@dave-tang-4661
Last seen 6.5 years ago
Australia/Perth/UWA
Hello, In the vignette (17th March 2012) of the goseq package (page 6), a list of differentially expressed genes produced by edgeR is used as input into goseq. However if I were interested in over represented GO terms in either UP or DOWN regulated genes, I should just input genes that have a POSITIVE or NEGATIVE fold change (with an adjusted p-value < 0.05) into goseq? It sounds obvious, but I'm not sure. Also I have some questions regarding the graph on page 9. The x-axis is bias.data, which according to the vignette is usually the "gene length" or "number of counts". I can understand "gene length" but I don't understand what "number of counts" refers to. I hand picked two genes and it seems that bias.data is the gene length for these two genes. Therefore my interpretation of the graph on page 9 is that longer genes are proportionally more differentially expressed; is this correct? And lastly I'm working with a list of differentially expressed features (CAGE tags), which can be annotated to genes based on genome mapping. However a small subset of these features cannot be annotated and I have discarded them from the analysis since they cannot be associated to GO terms. Is this potentially disastrous? Many thanks, -- Dave
GO graph edgeR goseq GO graph edgeR goseq • 2.8k views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 21 months ago
United States
Hi, I'm not the author of this package, but I'll give you my best guess: On Thu, Nov 1, 2012 at 5:18 AM, Dave Tang <davetingpongtang at="" gmail.com=""> wrote: > Hello, > > In the vignette (17th March 2012) of the goseq package (page 6), a list of > differentially expressed genes produced by edgeR is used as input into > goseq. However if I were interested in over represented GO terms in either > UP or DOWN regulated genes, I should just input genes that have a POSITIVE > or NEGATIVE fold change (with an adjusted p-value < 0.05) into goseq? It > sounds obvious, but I'm not sure. Right. > Also I have some questions regarding the graph on page 9. The x-axis is > bias.data, which according to the vignette is usually the "gene length" or > "number of counts". I can understand "gene length" but I don't understand > what "number of counts" refers to. It is likely referring to the number of reads assigned to that gene when it was assessed for DE. By the structure of the data.frame, it seems that you might just add up the number of read counts between the two conditions under test and put it there. While I haven't tried this, I'd suspect that this has some (strong(?)) correlation with the gene's length. > I hand picked two genes and it seems that > bias.data is the gene length for these two genes. Therefore my > interpretation of the graph on page 9 is that longer genes are > proportionally more differentially expressed; is this correct? Also correct. > And lastly I'm working with a list of differentially expressed features > (CAGE tags), which can be annotated to genes based on genome mapping. > However a small subset of these features cannot be annotated and I have > discarded them from the analysis since they cannot be associated to GO > terms. Is this potentially disastrous? Depends on how you define disaster? If I were you, I'd try and assign tags that are intergenic but only slightly 5' upstream from annotated genes to the downstream gene. I'm leaving the definition of "slightly" undefined, though :-) Also, a disaster you might try to avert is whether or not using goseq is appropriate for your analysis to begin with. If you are using goseq to correct for a length bias in detecting differential expression, you might explore whether or not CAGE data is subject to this bias at all. I think the "common" understanding is that tag-based methods generally don't suffer from this, see: Protocol Dependence of Sequencing-Based Gene Expression Measurements http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.001 9287 And in the discussion of: Transcript length bias in RNA-seq data confounds systems biology http://www.biology-direct.com/content/4/1/14 HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Hi Steve, Thank you for your reply. On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: [snipped] >> Also I have some questions regarding the graph on page 9. The x-axis is >> bias.data, which according to the vignette is usually the "gene length" >> or >> "number of counts". I can understand "gene length" but I don't >> understand >> what "number of counts" refers to. > > It is likely referring to the number of reads assigned to that gene > when it was assessed for DE. By the structure of the data.frame, it > seems that you might just add up the number of read counts between the > two conditions under test and put it there. While I haven't tried > this, I'd suspect that this has some (strong(?)) correlation with the > gene's length. It was confusing because for my analysis I just gave goseq a vector of all the genes assayed and the DE status, so there wasn't any count information. I guess in this context, I'll just assume that bias.data is just the length of the gene. [snipped] >> And lastly I'm working with a list of differentially expressed features >> (CAGE tags), which can be annotated to genes based on genome mapping. >> However a small subset of these features cannot be annotated and I have >> discarded them from the analysis since they cannot be associated to GO >> terms. Is this potentially disastrous? > > Depends on how you define disaster? Sorry for the lack of clarity; I should avoid using vague words. I guess the worst case scenario is that I get results that are flawed because of the analysis. > If I were you, I'd try and assign tags that are intergenic but only > slightly 5' upstream from annotated genes to the downstream gene. I'm > leaving the definition of "slightly" undefined, though :-) That's almost how I annotated the CAGE tags. So I used the starting position of all Ensembl genes, and if a tag was 400 bp upstream or downstream of this position, I annotated this tag with that gene. > Also, a disaster you might try to avert is whether or not using goseq > is appropriate for your analysis to begin with. I was going to use GOstats for this but thought "ah a newish package, let's try it out!" From the vignette it mentions that using their methods when there is no length bias does no harm. They also have the hypergeometric method implemented, which doesn't take into account the length bias, so I also used that. Actually I got different results when I did the analysis accounting for length bias and not accounting for length bias; I got no enriched GO terms after pvalue adjustment when I did the hypergeometric test compared to 20 enriched GO terms using the Wallenius approximation method. > If you are using goseq to correct for a length bias in detecting > differential expression, you might explore whether or not CAGE data is > subject to this bias at all. I think the "common" understanding is > that tag-based methods generally don't suffer from this, see: > > Protocol Dependence of Sequencing-Based Gene Expression Measurements > http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0 019287 > > And in the discussion of: > > Transcript length bias in RNA-seq data confounds systems biology > http://www.biology-direct.com/content/4/1/14 Thanks for the papers (which I will read in more detail later) and the comments; it was very useful. Much appreciated. Cheers, -- Dave
ADD REPLY
0
Entering edit mode
@alicia-oshlack-4634
Last seen 10.2 years ago
Hi Dave, I¹d just like to mention that I don¹t believe that using gene lengths for CAGE data is the right thing to do because you are not counting the number of reads across the whole transcript, which will correlate with gene length. Therefore I believe you should not use the automatic fetching of gene lengths as your bias data. It seems likely that your power to detect DE for CAGE data will depend on the total counts for each tag so your bias.data would be total counts for each tag which you will have to supply. Hope this makes sense. Cheers, Alicia Hi Steve, Thank you for your reply. On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou <mailinglist.honeypot at="" gmail.com=""> wrote: [snipped] >> Also I have some questions regarding the graph on page 9. The x-axis is >> bias.data, which according to the vignette is usually the "gene length" >> or >> "number of counts". I can understand "gene length" but I don't >> understand >> what "number of counts" refers to. > > It is likely referring to the number of reads assigned to that gene > when it was assessed for DE. By the structure of the data.frame, it > seems that you might just add up the number of read counts between the > two conditions under test and put it there. While I haven't tried > this, I'd suspect that this has some (strong(?)) correlation with the > gene's length. It was confusing because for my analysis I just gave goseq a vector of all the genes assayed and the DE status, so there wasn't any count information. I guess in this context, I'll just assume that bias.data is just the length of the gene. [snipped] >> And lastly I'm working with a list of differentially expressed features >> (CAGE tags), which can be annotated to genes based on genome mapping. >> However a small subset of these features cannot be annotated and I have >> discarded them from the analysis since they cannot be associated to GO >> terms. Is this potentially disastrous? > > Depends on how you define disaster? Sorry for the lack of clarity; I should avoid using vague words. I guess the worst case scenario is that I get results that are flawed because of the analysis. > If I were you, I'd try and assign tags that are intergenic but only > slightly 5' upstream from annotated genes to the downstream gene. I'm > leaving the definition of "slightly" undefined, though :-) That's almost how I annotated the CAGE tags. So I used the starting position of all Ensembl genes, and if a tag was 400 bp upstream or downstream of this position, I annotated this tag with that gene. > Also, a disaster you might try to avert is whether or not using goseq > is appropriate for your analysis to begin with. I was going to use GOstats for this but thought "ah a newish package, let's try it out!" From the vignette it mentions that using their methods when there is no length bias does no harm. They also have the hypergeometric method implemented, which doesn't take into account the length bias, so I also used that. Actually I got different results when I did the analysis accounting for length bias and not accounting for length bias; I got no enriched GO terms after pvalue adjustment when I did the hypergeometric test compared to 20 enriched GO terms using the Wallenius approximation method. > If you are using goseq to correct for a length bias in detecting > differential expression, you might explore whether or not CAGE data is > subject to this bias at all. I think the "common" understanding is > that tag-based methods generally don't suffer from this, see: > > Protocol Dependence of Sequencing-Based Gene Expression Measurements > http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0 019287 > > And in the discussion of: > > Transcript length bias in RNA-seq data confounds systems biology > http://www.biology-direct.com/content/4/1/14 Thanks for the papers (which I will read in more detail later) and the comments; it was very useful. Much appreciated. Cheers, -- Dave ______________________________________________________________________ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com ______________________________________________________________________ [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Hi Alicia, Thank you for your reply and comments. It definitely makes much more sense now. I shall redo the analysis with the tag counts as my bias.data. Cheers, Dave On Fri, 02 Nov 2012 20:40:50 +0900, Alicia Oshlack <alicia.oshlack at="" mcri.edu.au=""> wrote: > Hi Dave, > > I?d just like to mention that I don?t believe that using gene lengths for > CAGE data is the right thing to do because you are not counting the > number > of reads across the whole transcript, which will correlate with gene > length. > Therefore I believe you should not use the automatic fetching of gene > lengths as your bias data. It seems likely that your power to detect DE > for > CAGE data will depend on the total counts for each tag so your bias.data > would be total counts for each tag which you will have to supply. > > Hope this makes sense. > > Cheers, > Alicia > > > Hi Steve, > > Thank you for your reply. > > On Thu, 01 Nov 2012 21:41:16 +0900, Steve Lianoglou > <mailinglist.honeypot at="" gmail.com=""> wrote: > > [snipped] > >>> Also I have some questions regarding the graph on page 9. The x-axis is >>> bias.data, which according to the vignette is usually the "gene length" >>> or >>> "number of counts". I can understand "gene length" but I don't >>> understand >>> what "number of counts" refers to. >> >> It is likely referring to the number of reads assigned to that gene >> when it was assessed for DE. By the structure of the data.frame, it >> seems that you might just add up the number of read counts between the >> two conditions under test and put it there. While I haven't tried >> this, I'd suspect that this has some (strong(?)) correlation with the >> gene's length. > > It was confusing because for my analysis I just gave goseq a vector of > all > the genes assayed and the DE status, so there wasn't any count > information. I guess in this context, I'll just assume that bias.data is > just the length of the gene. > > [snipped] > >>> And lastly I'm working with a list of differentially expressed features >>> (CAGE tags), which can be annotated to genes based on genome mapping. >>> However a small subset of these features cannot be annotated and I have >>> discarded them from the analysis since they cannot be associated to GO >>> terms. Is this potentially disastrous? >> >> Depends on how you define disaster? > > Sorry for the lack of clarity; I should avoid using vague words. I guess > the worst case scenario is that I get results that are flawed because of > the analysis. > >> If I were you, I'd try and assign tags that are intergenic but only >> slightly 5' upstream from annotated genes to the downstream gene. I'm >> leaving the definition of "slightly" undefined, though :-) > > That's almost how I annotated the CAGE tags. So I used the starting > position of all Ensembl genes, and if a tag was 400 bp upstream or > downstream of this position, I annotated this tag with that gene. > >> Also, a disaster you might try to avert is whether or not using goseq >> is appropriate for your analysis to begin with. > > I was going to use GOstats for this but thought "ah a newish package, > let's try it out!" From the vignette it mentions that using their methods > when there is no length bias does no harm. They also have the > hypergeometric method implemented, which doesn't take into account the > length bias, so I also used that. > > Actually I got different results when I did the analysis accounting for > length bias and not accounting for length bias; I got no enriched GO > terms > after pvalue adjustment when I did the hypergeometric test compared to 20 > enriched GO terms using the Wallenius approximation method. > >> If you are using goseq to correct for a length bias in detecting >> differential expression, you might explore whether or not CAGE data is >> subject to this bias at all. I think the "common" understanding is >> that tag-based methods generally don't suffer from this, see: >> >> Protocol Dependence of Sequencing-Based Gene Expression Measurements >> http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone. 0019287 >> >> And in the discussion of: >> >> Transcript length bias in RNA-seq data confounds systems biology >> http://www.biology-direct.com/content/4/1/14 > > Thanks for the papers (which I will read in more detail later) and the > comments; it was very useful. Much appreciated. > > Cheers, > > -- Dave
ADD REPLY

Login before adding your answer.

Traffic: 812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6