I would like to perform differential expression analysis of lncRNAs using DESeq2 and edgeR.
1
1
Entering edit mode
John ▴ 10
@0ccfb76d
Last seen 4 months ago
Hong Kong

Hi all,

I read an article titled "Poor Performance of Differential Gene Expression Analysis Tools for Long Non-coding RNA Sequencing Data" (https://pubmed.ncbi.nlm.nih.gov/30041657/). The article's results show that many differential expression analysis pipelines do not control the FDR well (Figure 4). Among those pipelines that relatively well control the FDR, many have very small TPR values. During a previous search, I came across a response from one of the authors of edgeR (https://www.biostars.org/p/9493810/). Based on the author's response, edgeR is capable of fulfilling the differential expression analysis requirements for lncRNA. It's hard for me to be sure which perspective is more accurate.

Furthermore, it has been observed that filterByExpr demonstrates a higher tendency to filter out lncRNAs, although these low expression may be attributed to their intrinsic characteristics. Should I filter the data of mRNA and lncRNA together?

lncRNA_data <- all_data[lncRNA_list,]

mRNA_data <- all_data[mRNA_list,]

lncRNA_filter <- filterByExpr(lncRNA_data)

mRNA_filter <- filterByExpr(mRNA_data )

or

all_filter <- filterByExpr(all_data)

I'm a bit confused now. First, I'm not sure which software is more suitable for conducting differential analysis of lncRNA. Second, I'm not clear whether I should analyze mRNA and lncRNA separately or combine them for analysis and then separate the results for both in the final part. Third, I'm not sure if the threshold for the difference between mRNA and lncRNA is the same, that is, |log2fc| > 1 and fdr value less than 0.05.

All opinions and experiences are greatly appreciated!

diffGeneAnalysis lncRNA • 3.3k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 6 hours ago
WEHI, Melbourne, Australia

The Genome Biology paper that you link to recommends limma as the best performing method. If you're worried about the results of that paper, why not follow their recommendation?

I am the author of limma as well of edgeR. I wrote the answer you link to from Biostars. I am also the author of the filterByExpr() function. My lab has analyzed over a thousand RNA-seq experiments over the past 20 years. We always include lncRNAs in our analyses and have never observed any problem in doing so. I have never seen any evidence that lncRNAs are systematically more noisy than mRNAs at the same read levels or that they need special treatment.

The only problem with lncRNAs is that they often have low read counts, and it is obviously going to be harder to get significant DE for low count genes than for those with higher counts. That is an intrinsic data limitation rather than an issue of performance of the DE methods, and the same issue is shared with mRNAs that have low counts, of which there always are many. In my opinion, the latest versions of limma (limma-voomLmFit) and edgeR (edgeR v4 QL) are both very reliable for low count genes and are also very robust to filtering (see https://doi.org/10.1093/nar/gkaf018 or https://doi.org/10.1101/2025.04.07.647659 ). I recommend the use of robust empirical Bayes (robust=TRUE) in both cases.

You should analyse mRNA and lncRNAs together, not separately. There is no need to apply any logFC cutoff.

If you have human data with lots of samples, then you the default settings of filterByExpr() are admitedly overly conservative. You could apply very little filtering, and limma-voomLmFit and edgeR4-QL will continue to work well. That would allow you keep all the data in the analysis.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your answer!

It addressed most of my questions. I just have one last quick question: I've noticed that the distribution of lncRNA expression seems to differ from that of mRNA, as shown in the figure. Just as you said, we are ready to analyze together. But we were wondering, does edgeR perform better than ortehr methods when analyzing data with distinct distribution patterns?

enter image description here

ADD REPLY
0
Entering edit mode

Having genes with a wide range of expression values in the same experiment causes no problems, in fact it is almost an advantage because it helps to estimate the mean-variance relationship.

I am however a bit puzzled that you are plotting TPMs, which suggests to me that you might be analyzing transcripts (RNA isoforms) instead of genes. If this is transcript data, you should read https://doi.org/10.1093/nar/gkad1167 . If this is gene data, I wonder how you are computing TPMs for genes. In case you don't already know, log2(TPM) is not suitable input for any of the DE programs.

ADD REPLY
0
Entering edit mode

You're right, I used counts for the differential analysis, but chose TPM for plotting because it makes the distribution differences more clearly visible. I used the RSEM software for quantification, which provides counts, FPKM, and TPM values for both transcripts and genes.

ADD REPLY

Login before adding your answer.

Traffic: 1165 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6