Question: Finding DE genes from RNA-seq data
0
6 months ago by
Harrisonesmith0 wrote:

I am having trouble determining what values I need to use for determining “significant differentially expressed genes”

I have been asked to attempt this using two methods.

Method 1 involves the use of the pipeline used in our lab which involves mapping to HISAT2 -> normalizing with cuffnorm -> getting FPKM values -> getting the LOG2FC of those and performing a T.Test for P values -> adjusting P values with the BH method -> isolating only those with P-value < .05 and log2fc >2

I feel like selecting change greater than 2 is arbitrary because it doesn’t take in to account that even a subtle change in gene expression can have a great change biological function. Unless the LOG2FC just means its’s been scaled down to where a noticeable difference must be greater than |2| in that case than fine I would make sure they are >|2|

The other method that I was told to use was EdgeR. I input raw read counts into EdgeR -> calcnorm factors -> estimate dispersion -> filter out low expressed genes -> glmFIT -> glmQLFTEST -> and use top tags to see which are DE which I noticed are sorted by FDR value. At the same time I noticed that EdgeR calculates LogFC rather than Log2FC. Well which one am I supposed to use? Wouldn’t the resulting P-values and FDR values be changed because they were calculated using log2FC rather than logFC?

I don’t understand why my lab would use LOG2FC from FPKM to calculate P-values then adjust them and why is that better or worse than EdgeR which uses logFC to calculate p values and FDR.

My question is I guess which of the two methods are better? Should I calculate the LOG2FC from edgeRs LogFC and perform a T.Test and adjust them myself? Will that give me the Q-value?, Adjusted P value, or FDR value or are all of those 3 the same thing?

There is no clear answer anywhere I have read so much documentation my brain hurts and I feel like I’m asking my coworkers the same questions over and over again only to realize they don’t really understand it themselves. Please do not refer me to any links because I have gone through many only to leave me with the same questions. If you don’t want to provide actual input then please don’t bother commenting unless the link truly explains it perfectly but I would much rather someone attempt to explain it in layman's terms. Thank you in advance.

modified 5 months ago by Steve Lianoglou12k • written 6 months ago by Harrisonesmith0
Answer: Finding DE genes from RNA-seq data
5
6 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.3k wrote:

You seem to have some of misconceptions about how differential expression testing works. First of all, changing the base used in the logarithm does not change the p-values. If you used base e or base 10, the logFC, logFPKM, and logCPM values would be different, but the p-values would still be identical, because you're changing the scale, not the data. Second, you say that you have read the edgeR documentation, so you should already know that edgeR's logFC values, as well as logCPM values, are all using base 2 logarithms, so edgeR's logFC values are directly comparable to log2 fold change values from another method. This is documented in, for example, the help page for edgeR::glmFit.

In any case, the t-test pipeline you've described is inadequate in a number of ways. First, I don't believe that the usage of cuffnorm you describe will perform any normalization for composition bias (i.e. what edgeR::calcNormFactors does). Second, FPKM values are considered suboptimal at best for use in differential expression testing. Third, an ordinary t-test is not suitable for logFPKM or logCPM values, for several reasons, including non-normality and heteroskedasticity of the data, which violate the assumptions of the test, as well as a lack of statistical power for small sample sizes. Fourth, a log2 fold change threshold of 2 is absurdly high. You are requiring a gene to be 4-fold up- or down-regulated in order to call it significant. Almost no genes will have a true fold change this large, but many false positive genes with very low counts will have very large fold changes if you are not careful about how they are calculated (e.g. division by zero yielding infinite fold changes). So if anything, such a large fold change threshold will discard all real differential expression while retaining many false positives. Finally, if you use separate thresholds for both adjusted p-value and logFC, your adjusted p-values are no longer valid, since they were not calculated with the logFC threshold in mind. There may be other issues with this pipeline, but that is what I can tell from the information you've given.

The edgeR pipeline you describe is much more reasonable, and addresses most of the above concerns (as it should, since addressing those issues is the reason edgeR exists). There are a few things you should fix, though. First, you should filter out low-expression genes before you do anything else, such as calculating normalization factors or estimating dispersions. This will make later steps more reliable. Second, I believe the preferred way to do abundance filtering is now edgeR::filterByExpr. Third, if you really want to implement a minimum required fold change, the correct thing to do would be to use edgeR::glmTreat. This will give you BH-adjsuted p-values that correspond to an interval null hypothesis using your chosen log fold change threshold. However, if, as you say, you care about small but statistically significant changes, you should stick with glmQLFTest as you have already described.

As for your question about the difference between adjusted p-value, FDR, and Q-value, they are all somewhat related terms. FDR stands for false discovery rate, which is pretty much what it sounds like. Calling something an "FDR" tells you that the value is meant to be an estimate or bound on the false discovery rate, but does not tell you the method that was used to compute it, which is why specifying the method you used is critical. Benjamini-Hochberg and q-values are two FDR-estimating methods. (BH actually estimates an upper bound on the FDR, not the FDR itself.) The term "adjusted p-value" is even more general. It covers just about any multiple testing correction, including, for example, methods for estimating the family-wise error rate (FWER) instead of the FDR.

Answer: Finding DE genes from RNA-seq data
2
5 months ago by
Denali
Steve Lianoglou12k wrote:

To supplement/complement Ryan's rather thorough answer, I think it'd still be useful if you read through this edgeR/quasi-likelihood workflow, in case you haven't seen it yet:

From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline

The authors do a good job of walking beginners through the analysis of an RNA-seq dataset step by step, and discuss some of the statistical issues you need to think about along the way.

If what you say is true, that none of your co-workers seem to understand the answers to your questions, I'd forget about trying to analyze your data using the first pipeline (except for using HISAT2 for getting gene-level counts ... if you already have this alignment pipeline setup using that, then just keep it) and follow along with the tutorial to analyze these data downstream from the count-matrix construction.

In the meantime, it can also be helpful to look to other sources of information (books, people, etc) to learn a bit more about (generalized) linear models in general -- which are fundamentally what are used to analyze expression data with edgeR, DESeq2, and limma/voom.

Once you get a better handle on some of the basics from other sources, you can then come back and reread the edgeR (and limma) user guides and hopefully better appreciate their attempts of trying to explain in semi-layman terms many of the problems analysts come across in analyzing high throughput expression data and how they can be addressed using these packages.

From the tenor of your post, it's clear that you're quite frustrated with your efforts so far, which is natural. Given the ubiquitous use of expression data in scientific research and publications, one might be fooled into thinking that analyzing it correctly is easy, but there is a lot of things you need to learn about before you can simply execute the analysis code you see elsewhere successfully, so be patient with yourself.

You should also be patient with the people trying to help you. Demanding that one should only provide links that "explain things perfectly" won't get you too far. Or, it might get you exactly what you're asking for, which will be a link to some page that is thick with mathematical formulae which provide the "perfect" explanation of what's doing what, but won't be helpful to you at all (in this stage of your training) since whenever someone tries to explain math with words, things always get lost in translation.

That having been said, I do think the link I've provided to the f1000 workflow explains things at about the right level to get you off the ground, so ... pretty perfect if you ask me :-)

Good luck!

2

F1000Research edgeR QL article is also available as a Bioconductor workflow:

This version is updated to the current release of Bioconductor and has a few other minor updates or simplications compared to the 2016 journal article version.

Steve Lianogloucan can you  please guide me about  doing differential gene expression analysis using machine learning algorithms.I will be  using Particle swarm intelligenece algorithm for feature selection and after that i will use SVM for classification of cancer related genes from RNA seq data.

Can you please tell me what input should i give to PSO algorithm P values,Z score??or just normalized data?i normalized data using edge R

Hi Maryam,

In short: no, I really can't. It sounds like you are at the early stages of a long journey on a project that needs a lot of care (wether it's worth while, I'll leave that up to you to decide :-)

If you are asking me how to transform RNA-seq data so that it is "best used" as features for some machine learning algorithm, then the most general / naive answer I can give you is to either:

1. Get your counts into a DGEList, run calcNormFactors, then get then use the output cpm(dgelist, prior.count = 5, log = TRUE); or
2. Get your counts into a DESeqDataSet and use the output from a call to is varianceStabilizingTransformation

These approaches will transform your data into log2 space and try to reduce the higher variance observed at lower levels of expression. You want to do this because most algorithms assume that your data isn't heteroscedastic.

You will also likely want to remove lowly expressed / low variance features (genes) prior to feeding these data into some ML algorithm.

... and ... that's all I got for you.

Good luck!