Question

GO enrichment analysis with goana after performing DE analysis with glmTreat

0

Entering edit mode

Raito92 ▴ 60

@raito92-20399

Last seen 2.7 years ago

Italy

Good morning to everyone, I'm trying to use the function goana to run a GO enrichment analysis after performing the workflow RnaSeqGeneEdgeRQL. I obtained a list of DE genes with the function glmQLFTest (on the left), and shortly after I decided to use glmTreat as well (on the right), to obtain a more limited dataset. I picked a threshold of logFC > 1.5.

Here is the result

enter image description here

I performed a further GO enrichment analysis on the DE genes.

I used the result of the function glmTreat at first, but what I obtained is an error:

data frame with 0 columns and 0 rows No DE genes

(is the dataset too small to perform a statistically significant GO enrichment analysis?). This just works fine while I try to compare other conditions from which a higher number of genes is obtained.

I also tried to make the function glmTreat work like glmQLFTest, like suggested in the pipeline. I set

tr <- glmTreat(fit, contrast=B.LvsP, lfc=0)

And I obtain the same number of DE genes I did for glmQLFTest. Everything works fine, till I try to use goana again on this specific object, where I obtain a list of GO terms showing p values of 1 in the columns. I guess the test for some reason doesn't work...

enter image description here

How can I solve this problem and perform a GO enrichment analysis? Did I do anything wrong with the two functions? Thanks in advance!

go DE edger rnaseq glmTreat • 2.7k views

ADD COMMENT • link updated 4.9 years ago by Gordon Smyth 52k • written 4.9 years ago by Raito92 ▴ 60

2

Entering edit mode

Posts have been cross-posted on Biostars ( https://www.biostars.org/p/432527/ ; https://www.biostars.org/p/432528/ )

ADD REPLY • link 4.9 years ago Kevin Blighe ★ 4.0k

Gordon Smyth · Answer 1 · 2020-04-15

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 8 minutes ago

WEHI, Melbourne, Australia

Dear Raito92,

I have tried hard to answer your questions over the past year, but I have to say that I don't think you're doing an entirely sensible analysis and, perhaps more importantly, I don't think you are asking questions in a way that helps us to help you.

You have posted several questions about this same analysis over the past 11 hours, at least four to this site and at least two to Biostars. Please don't cross-post your questions to multiple web sites. It is bad manners and annoys the people trying to help you. It would be better to ask one question in a better way.
Your analysis doesn't have a lot to do with the RnaSeqGeneEdgeRQL workflow as far as I can see apart from the fact that you've used the glmQLFit function. You're not following the workflow in any detail.
Your question omits crucial information. It turns out from your other posts that you are actually working with a non-model organism (probably olive) and that you are using a modified version of goana() instead of goana() itself. It also turns out that you are analysing transcripts instead of genes even though the edgeR analysis pipeline you are using is not designed for that. The workflow package RnaSeqGeneEdgeRQL has "Gene" in the name to alert you to the fact that it is designed for gene-level analyses only.
Your question doesn't include any code. You give snippets of output but don't show any of the code that gave rise to that output. The one line of code in your question is copied from the RnaSeqGeneEdgeRQL workflow and could not have been used for your data. How can we help you if we don't know what you are doing?
The logFC threshold you have chosen for glmTreat is much too high and does not correspond to the advice in the RnaSeqGeneEdgeRQL workflow. logFC=1.5 corresponds to a fold-change cuttoff of 2^1.5 = 2.8. Naturally, using such a high fold-change cutoff leaves you with far too few DE genes to be able to do any sensible GO enrichment analysis. Why are you using glmTreat at all here?
I don't believe it is necessary to make your own version goana but, if you do, then I can't help you with it. If your own function is isn't working then that becomes your own responsibility.
I have advised you before that kegga is the easiest way to do a GO analysis using your own database. You can input your own database to kegga without having to making any changes to the edgeR or limma functions. It will give the same results as goana would with the same annotation.

ADD COMMENT • link 4.9 years ago Gordon Smyth 52k

0

Entering edit mode

Hello, thank you for your kind answer!

I have tried hard to answer your questions over the past year, but I have to say that I don't think you're doing a very sensible analysis and I don't think you are asking questions in a way that helps us to help you.

I'm sorry, like I explained in the other topic I'm not so skilled in bioinformatics sadly so I may unwillingly adapt methods I see online and make assumptions which are not the best in my specific case... I'll try to explain my needs better in the future, even though your answers have been pretty claryfing till now and made me rethink some things.

You have posted several questions about this same analysis over the past 11 hours, at least four to this site and at least two to Biostars. Please don't cross-post your questions to multiple web sites. It is bad manners and annoys the people trying to help you. It would be better to ask one question in a better way.

I apologise for this... I didn't know, I thought that, instead, putting the same question in a more specific R-oriented place and a more 'general forum' like Biostars could give me different opinions on the same problem. I'm sorry, this won't re-happen.

Your analysis doesn't have a lot to do with the RnaSeqGeneEdgeRQL workflow as far as I can see apart from the fact that you've used the glmQLFit function. You're not following workflow in any detail

The RnaSeqGeneEdgeRQL workflow is actually where I have found the function you mentioned. I tried to follow anything step-by-step to the best of my skills... More specifically, apparently, the function glmQLFit is used when estimating QL dispersions. After using this function to fit data, the pipeline goes straightly to DE analysis. The function used for DE analysis without filtering is called glmQLFTest if no threshold is applied. An alternative use of glmTreat to apply a threshold and adjust analysis is suggested shortly after.

Your question omits crucial information. It turns out from your other posts that your are actually working with a non-model organism (probably olive) and that you are using your own hacked version of goana() instead of using goana() itself. It also turns out that you are analysing transcripts instead of genes even though the edgeR analysis pipeline you are using is not designed for that. The workflow package RnaSeqGeneEdgeRQL has "Gene" in the name to alert you to the fact that it is designed for gene-level analyses only.

Yes, that's true, I'm working on a domestic variety of olive tree and I tried to use a modified version of goana since I couldn't use my own database. But as you answered in another thread, is there another simple way I'm missing to apply it? As I said assuming the pipeline would do well (the research project was initially intended to be run on genes, not transcripts) was probably an over-semplification by me, and you suggested in another topic to switch to genes instead. Is it possible to adapt these pipelines easily to transcripts (for instance by using Salmon and just going ahead normally) or do I have to rethinking my approach completely? Sorry if this question may sound naive to you.

Your question doesn't include any code. You give snippets of output but don't show any of the code that gave rise to that output. The one line of code in your question is copied from the RnaSeqGeneEdgeRQL workflow and could not have been used for your data. How can we help you if we don't know what you are doing?

That's true, I copied that part because it's what I applied and seemed crucial, and went ahead exactly as suggested but I'll try to be more specific in the future, sorry! I can't copy all the code I used but I'll try to focus on the critical points. (Probably in a second answer since this one is already quite long)

ADD REPLY • link 4.9 years ago Raito92 ▴ 60

0

Entering edit mode

The logFC threshold you have chosen for glmTreat is much too high and does not correspond to the advice in the RnaSeqGeneEdgeRQL workflow. logFC=1.5 corresponds to a fold-change cuttoff of 2^1.5 = 2.8, which seems to me ridiculously high. Naturally, using such a high fold-change cutoff leaves you with far too few DE genes to be able to do any sensible GO enrichment analysis. Why are you using glmTreat at all here?

Maybe I misunderstood something, but apparently this is exactly the treshold suggested in the pipeline... look at this screenshoot.

Here we test whether the differential expression fold changes are significantly greater than 1.5, that is, whether the logFCs are significantly greater than log2(1.5):

https://i.ibb.co/MMdk7NY/Approach.png

I applied it because I got way too many transcripts if I didn't apply a threshold like I reported in my first table.

I don't believe it is sensible to hack goana in this way and you don't provide any evidence that your own hacked function is correct. Your question is basically saying that your own function, which you don't show us, isn't working. How can we help you with your own function?

That's right, I can expand on this of course, but since you mentioned there is an alternative method to acquire an external database which isn't online, maybe I can use goana without the need of using a possibly wrong hacked version...

I have advised you before that kegga is the easiest way to do a GO analysis using your own database. You can input your own database to kegga without having to making any changes to the edgeR or limma functions. It will give the same results as goana would with the same annotation.

And yes, like I have already said in another topic and to the point 6... Could you expand on this and more specifically on how to use external GO/KEGG annotations which aren't found online? Thanks in advance.

Again, I'm sorry if I sound annoying or don't know how to use forums the best way, I didn't want to sound annoying. Since you're the developer of the workflow your help is greatly appreciated, I didn't want to sound assuming or something and I apologise for this.

ADD REPLY • link updated 4.9 years ago by Gordon Smyth 52k • written 4.9 years ago by Raito92 ▴ 60

2

Entering edit mode

you suggested in another topic to switch to genes instead.

Indeed I did and I still do.

Is it possible to adapt these pipelines easily to transcripts (for instance by using Salmon and just going ahead normally)

This is an area of current research. I am not willing to give you advice on it. I have already explained why a transcript analysis would be unreliable for your data.

or do I have to rethinking my approach completely?

Basically, yes. Analysing transcripts is very, very much noiser and more difficult than analysing genes. The Salmon approach moreover assumes that the transcriptome is comprehensively annotated, which for your genome it clearly is not.

apparently this is exactly the treshold suggested in the pipeline

Read again. We used lfc=log2(1.5). You used lfc=1.5. Not the same thing.

Also, the edgeR workflow did not advise you to apply a glmTreat threshold for every dataset. We used it in the workflow only because the non-thresholded analysis give rise to such a large number of DE genes (over 5000). For other datasets with less DE we would use a smaller threshold or not use a threshold at all.

Have you read the glmTreat() help page? It says:

"In practice, modest values for lfc such as log2(1.1), log2(1.2) or log2(1.5) are usually the most useful."

Could you expand on this and more specifically on how to use external GO/KEGG annotations which aren't found online?

No I can't. Not unless you read the kegga help page and make some sort of attempt yourself.

The purpose of this forum is to help you with Bioconductor R code. Please have a read of the posting guide to see what is expected in a question. You still haven't shown any edgeR code in any of your questions, so there is nothing I can help you with.

ADD REPLY • link 4.9 years ago Gordon Smyth 52k

0

Entering edit mode

Hello, sorry for the delayed answer! Meanwhile, I've figured out some of my doubts! goana finally worked, the problem was somehow silly: I wasn't expecting there could be some very few enriched GO terms in thousands of pvalue = 1... The function had always worked! So it's more a problem of the lfc filter threshold I chose as you pointed out later. My fault.

Indeed I did and I still do.

Definitely, I've rethought my approach and decided to use genes instead as you suggested!

Read again. We used lfc=log2(1.5). You used lfc=1.5. Not the same thing.

I think this mislead me, I realized it shortly after!

No I can't. Not unless you read the kegga help page and make some sort of attempt yourself.

The purpose of this forum is to help you with Bioconductor R code. Please have a read of the posting guide to see what is expected in a question. You still haven't shown any edgeR code in any of your questions, so there is nothing I can help you with

Sure, thanks! Now I still have my heatmap question opened so I'll try to be clearer there, and I haven't really tried kegga yet.

ADD REPLY • link 4.9 years ago Raito92 ▴ 60