Visualization of negative log CPM values
1
1
Entering edit mode
Sabiha ▴ 20
@7f93ecd8
Last seen 14 days ago
United States

Hi,

I understand that the presence of negative log-transformed count values is a common occurrence in RNA-seq data analysis and these values are not necessarily problematic for statistical analyses. I used TMM method > log transformed and used limma for identifying significant gene expression (see below). Then, I extracted few genes for (line plot) that is high expressed between the groups of interest, but I see negative log (CPM) values on y-axis (though the expression pattern on the line plot shows the difference). My collaborators were not convinced with this y-axis scale and asked me if there is a way to get the values on the positive scale.

Then, I increased prior.count (started increasing from +150 --> + 250), the minimum values increase to positive value scale. Meaning the log negative values now transformed to the positive values. However, I am not sure if adding larger prior.count value would affect downstream analysis like differential analysis in general. Does, choosing a large prior.count for visualization only seems like a good strategy because with prior.count = 1, the scale is negative.

Alternatively, I tried Deseq2 (VST) as it did not produce log negative counts. I also have TPM values and observed log of TPM also does not produce negative values. But I do not want to use these methods again for the analysis just to justify the visualization needs.

TMM:

y <- calcNormFactors(y, method = "TMM")
logCPM <- cpm(y, prior.count=1, log=TRUE)
Use limma to analyse multi-factor and varaibles

Thank you

Visualization Normalization ggplot2 edgeR • 2.9k views
ADD COMMENT
1
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

There is nothing strange or problematic about negative logCPM values. It just means that the CPM value is less than 1, even after adding a small offset. You should just leave them as they are rather than using extreme parameter settings to make the logCPMs artificially positive. Using very large prior.count values makes it impossible to tell which of the original counts were very small and could be viewed as misrepresenting the data. We suggest a default value of prior.count=2 as a good compromise.

You might ask your collaborators to review their high school math on logarithms :-)

If your collaborators really don't understand log-values, you could consider plotting CPM values instead, which are easy to understand.

ADD COMMENT
0
Entering edit mode

Gordon Smyth thanks. Yes, indeed, plotting CPM values would be an alternative. As discussed, I was discussing about few biologically interesting genes (see below), the question about negative logCPM scale was raised and it was at the moment interpreted that, though the gene is high expressed statistically in particular condition, negative scale might perhaps indicate gene is too low to be really expressed.

Another related question, in addition to the raw count values which I use in edgeR as as input, my RNA-Seq pipeline also generates TPM values. Does it make sense to use to log TPM values only for visualization? Basically, all the analysis would remain same using edgeR and limma for differential expression except for visualization use log TPM values instead of logCPM values.

ADD REPLY
1
Entering edit mode

negative scale might perhaps indicate gene is too low to be really expressed.

Yes, negative logCPMs do mean that the gene is expressed at a low level. On your plots, a logCPM of -7 probably means the gene is not expressed at all.

I don't see how manipulating the data to avoid negative values would solve the problem. It doesn't make the gene any more highly expressed but might give the false impression that it is. My aim is always to communicate correct information.

Does it make sense to use to log TPM values only for visualization?

That's entirely up to you, but I don't generally do that. It depends on your pipeline and the purpose of your study. I prefer to plot quantities that accurately reflect the DE analysis. Given that the edgeR analysis is based on counts, I prefer to plot quantities that are directly related to those counts so I know the DE analysis and plots are in agreement. The logCPM values are counts normalized for library size and then converted to log-scale with minimal correction to avoid infinite values.

You could perhaps plot both logCPM and logTPM to check they agree, but I have less faith in TPM values than many other people on the internet because TPMs are typically estimated with higher technical uncertainty than genewise counts and because TPM don't correctly reflect the TMM normalized library sizes.

ADD REPLY
0
Entering edit mode

Gordon Smyth noted. Related to the graphs above, the table below is the output from limma, and compound B has 2 statistically significant genes that are up-regulated but the logCPM are on negative scale which makes difficult to explain that the genes are indeed up-regulated.

Furthermore, I tried plotting with log2(TPM), now the y-axis shows positive scale of values compared to log2(CPM) which shows negative scale, however the expression trend of these genes in the graphs looks similar.

Gene       logFC.A    logFC.B   logFC.C P.value.A   P.value.B   P.value.C
1 Gene_1 -2.147702e-01  2.2737828 -1.017877 0.6917291 0.000771861 0.075842861
2 Gene_2  7.990000e-15  2.4226954  2.676110 1.0000000 0.015743318 0.008940449
3 Gene_3 -7.126446e-01  1.2080397  1.610900 0.7264708 0.554852894 0.433318355
4 Gene_4  4.805340e-01 -0.2568965  1.079119 0.5912832 0.773214146 0.237587134
ADD REPLY
1
Entering edit mode

Sorry, but some of the things you say are not correct.

compound B has 2 statistically significant genes that are up-regulated

No, there are no significant genes in the table you show. Significance is judged by adjusted p-values and the table only shows raw p-values.

It appears that you are plotting genes that are not actually significant. If you plot these genes and claim that they are DE, then that would explain why the plots appear to be at odds with the DE results.

I tried plotting with log2(TPM), now the y-axis shows positive scale of values compared to log2(CPM) which shows negative scale

logTPM are just as likely to be negative as logCPM values if the two are computed in equivalent ways.

ADD REPLY
0
Entering edit mode

Gordon Smyth Apologies, I used a wrong term, I would have mentioned raw p-value instead. Unfortunately, only 3 gene passed FDR (BH) in the analysis. None of the genes I plotted for instance did not pass the FDR < 0.05 , but are either directly or in-directly associated with research question we are exploring. Ideally, as a soft raw p-value cut-off In sometimes consider p<0.05, for FDR, in the least case, can I consider maybe FDR<0.05 or FDR<0.1.

Regarding TPM, I plotted Gene_4 for instance with two subgroups this time, below is the plot.

ADD REPLY
0
Entering edit mode

Sorry, I was wrong to say that log2(TPM) is always negative, but log2(TPM) is not more likely that log2(CPM) to be positive if both are computed in equivalent ways. I don't know exactly which groups your plots relate to, but it seems impossible for TPM to disagree with CPM to the extent shown in your plots for Gene 4. If CPM = 0, as it seems to be for some groups in a previous plot for Gene 4, then you must have TPM=0 as well.

Anyway, I can only give you help with edgeR and limma usage, and the software seems to be working correctly.

ADD REPLY
0
Entering edit mode

You can visualize whatever is appropriate for your story. If negative logCPMs are a problem (unintuitive I agree) then for visualization simply get log2(cpm+1) and problem solved.

ADD REPLY
0
Entering edit mode

ATpoint OK, are you referring to the log2(cpm+1) or log2(tpm+1)

ADD REPLY
1
Entering edit mode

Applies to both.

ADD REPLY

Login before adding your answer.

Traffic: 949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6