I am working with the package recount which download preprocessed RNASeq datasets. In the vinette, I am struggling to understand this :

"Downloaded count data are first scaled to take into account differing coverage between samples.

## Scale counts by taking into account the total coverage per sample

rse1 <- scale_counts(rse_gene1)

"

-----------------------------------------------------------------------

The scale_counts function is as follows:

scale_counts <- function(rse, by = 'auc', targetSize = 4e7, L = 100,

factor_only = FALSE, round = TRUE) {

...

## Scale counts

if(by == 'auc') {

# L cancels out:

# have to multiply by L to get the desired library size,

# but then divide by L to take into account the read length since the

# raw counts are the sum of base-level coverage.

scaleFactor <- targetSize / SummarizedExperiment::colData(rse)\$auc

...

scaleMat <- matrix(rep(scaleFactor, each = nrow(counts)),

ncol = ncol(counts))

scaledCounts <- counts * scaleMat

if(round) scaledCounts <- round(scaledCounts, 0)

SummarizedExperiment::assay(rse, 1) <- scaledCounts

return(rse)

}

--------------------------------------------------------------------

First I though that auc is the library depth (sum of all read counts in each sample) but I get a different number. What is scaling by auc ? is it an alternative to normalization ?

Hi,

I don't know why I didn't get an email about this question. In any case, please check the recount workflow (http://bioconductor.org/packages/release/workflows/html/recountWorkflow.html) published at F1000 Research https://f1000research.com/articles/6-1558/v1. That workflow describes in more detail what are the actual numbers we provide in the RangedSummarizedExperiment objects. The scale_counts() function can be used to go from the numbers we provide to actual read counts.

