Question

Log2(RPKM) counts in Bioconductor

1

Entering edit mode

kushshah ▴ 10

@kushshah-20393

Last seen 20 months ago

University of North Carolina, Chapel Hi…

Hi all. I have an expression matrix that consists of log2-transformed RPKM counts. When I read this matrix in to a SingleCellExperiment object, how does Bioconductor know that I am working with RPKM counts? Is there a way to specify that my counts are log2(RPKM) counts, rather than raw reads?

Further, does the calcAverage() function in scater treat raw reads differently than log-transformed RPKM values? This is my main concern. At no point in my pipeline have I specified that these are RPKM counts, so I am wondering which downstream analyses that would impact.

Thank you for your help. I am new to Bioconductor in R so my apologies if this has a fairly obvious answer.

rpkm scater calcaverage R expression • 3.2k views

ADD COMMENT • link updated 5.0 years ago by Aaron Lun ★ 28k • written 5.0 years ago by kushshah ▴ 10

2

Entering edit mode

Short answer...throw away your RPKM values. No differential expression software wants them, and converting back to the desired raw gene counts is extremely hard to reverse engineer 100% accurately.

ADD REPLY • link 5.0 years ago swbarnes2 ★ 1.3k

score 4 · Accepted Answer · 2019-04-05

how does Bioconductor know that I am working with RPKM counts?

If you read the sales pitch, Bioconductor is a software project. It doesn't know anything about your specific analysis. I presume you mean to say "how does a specific Bioconductor package for the analysis of single-cell data know that I am working with RPKM counts?"

For many of the packages under my control (e.g., scran, some of scater, DropletUtils), the expected type of expression values will be stated in the documentation. If you give it a RPKM matrix, and the function is expecting a raw count matrix or a log-expression matrix, then you're operating outside the specifications of the function. At this point, the function has no obligation to do anything sensible and Bad Things may happen.

In fact, if you're using the SingleCellExperiment container, these functions will automatically look for an assay with the expected name, e.g., "counts", "logcounts". If you trick it into accepting a RPKM matrix - either by storing a RPKM matrix as the "counts" assay, or by explicitly redirecting the function to use an "rpkm" assay - then anything bad is on you.

Also, RPKMs are not counts.

Is there a way to specify that my counts are log2(RPKM) counts, rather than raw reads?

Putting aside your incorrect terminology about "counts" and "raw reads", if the function of interest doesn't work on log2(RPKM) values, then there's no way to specify anything. Even if we were to add an option to allow you to specify something, the only sensible course of action for the function would be to simply throw an error. Not very helpful.

Further, does the calcAverage() function in scater treat raw reads differently than log-transformed RPKM values?

Read ?calcAverage and you'll see that it expects a count matrix. Whatever you give calcAverage, it will assume that they are counts. RPKMs are not counts, so whatever calcAverage computes may or may not be gibberish.

Compare to the wording in ?nexprs, which only requires a general expression matrix. Of course, I would advise against playing too many word games; the only real way to know if a function is appropriate for a data type is to understand what it does.

I am wondering which downstream analyses that would impact.

With RPKMs: mean-variance trend modelling will fail. Any proper differential expression analysis will fail. The log-transformation will do funny things - the contribution of long genes to heterogeneity is implicitly downweighted. Also see my answer here.