Question: Log2(RPKM) counts in Bioconductor
1
gravatar for kushshah
6 months ago by
kushshah10
University of North Carolina, Chapel Hill, USA
kushshah10 wrote:

Hi all. I have an expression matrix that consists of log2-transformed RPKM counts. When I read this matrix in to a SingleCellExperiment object, how does Bioconductor know that I am working with RPKM counts? Is there a way to specify that my counts are log2(RPKM) counts, rather than raw reads?

Further, does the calcAverage() function in scater treat raw reads differently than log-transformed RPKM values? This is my main concern. At no point in my pipeline have I specified that these are RPKM counts, so I am wondering which downstream analyses that would impact.

Thank you for your help. I am new to Bioconductor in R so my apologies if this has a fairly obvious answer.

ADD COMMENTlink modified 6 months ago by Aaron Lun25k • written 6 months ago by kushshah10
2

Short answer...throw away your RPKM values. No differential expression software wants them, and converting back to the desired raw gene counts is extremely hard to reverse engineer 100% accurately.

ADD REPLYlink written 6 months ago by swbarnes2330
Answer: Log2(RPKM) counts in Bioconductor
3
gravatar for Aaron Lun
6 months ago by
Aaron Lun25k
Cambridge, United Kingdom
Aaron Lun25k wrote:

how does Bioconductor know that I am working with RPKM counts?

If you read the sales pitch, Bioconductor is a software project. It doesn't know anything about your specific analysis. I presume you mean to say "how does a specific Bioconductor package for the analysis of single-cell data know that I am working with RPKM counts?"

For many of the packages under my control (e.g., scran, some of scater, DropletUtils), the expected type of expression values will be stated in the documentation. If you give it a RPKM matrix, and the function is expecting a raw count matrix or a log-expression matrix, then you're operating outside the specifications of the function. At this point, the function has no obligation to do anything sensible and Bad Things may happen.

In fact, if you're using the SingleCellExperiment container, these functions will automatically look for an assay with the expected name, e.g., "counts", "logcounts". If you trick it into accepting a RPKM matrix - either by storing a RPKM matrix as the "counts" assay, or by explicitly redirecting the function to use an "rpkm" assay - then anything bad is on you.

Also, RPKMs are not counts.

Is there a way to specify that my counts are log2(RPKM) counts, rather than raw reads?

Putting aside your incorrect terminology about "counts" and "raw reads", if the function of interest doesn't work on log2(RPKM) values, then there's no way to specify anything. Even if we were to add an option to allow you to specify something, the only sensible course of action for the function would be to simply throw an error. Not very helpful.

Further, does the calcAverage() function in scater treat raw reads differently than log-transformed RPKM values?

Read ?calcAverage and you'll see that it expects a count matrix. Whatever you give calcAverage, it will assume that they are counts. RPKMs are not counts, so whatever calcAverage computes may or may not be gibberish.

Compare to the wording in ?nexprs, which only requires a general expression matrix. Of course, I would advise against playing too many word games; the only real way to know if a function is appropriate for a data type is to understand what it does.

I am wondering which downstream analyses that would impact.

With RPKMs: mean-variance trend modelling will fail. Any proper differential expression analysis will fail. The log-transformation will do funny things - the contribution of long genes to heterogeneity is implicitly downweighted. Also see my answer here.

ADD COMMENTlink written 6 months ago by Aaron Lun25k

With RPKMs: mean-variance trend modelling will fail. Any proper differential expression analysis will fail. The log-transformation will do funny things - the contribution of long genes to heterogeneity is implicitly downweighted. Also see my answer here.

I agree - I have been posting 'warning' messages against the usage of R/FPKM units in differential expression analyses for months on Biostars, like here: https://www.biostars.org/p/262373/#332353

ADD REPLYlink written 6 months ago by Kevin Blighe200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 238 users visited in the last hour