Question

newSCESet and SingleCellExperiment basic differences and TMM Normalization

0

Entering edit mode

hamza_karakurt ▴ 50

@hamza_karakurt-17704

Last seen 21 months ago

Turkey

Hello everyone,

I am trying to replicate a single cell RNA-Seq data analysis. Actually, the main problem is version. The old results were generated via newSCESet and normalized with normaliseExprs command (TMM method) but for the newest version of R, Bioconductor and packages so I used SingleCellExperiment command.

But I still do not understand what are the differences between the outputs of newSCESet and SingleCellExperiment. I have looked but could not find any information. Mostly pData and fData converted to colData and rowData

They look the same but in the end, but the results look different.

I also want to do TMM normalisation but normalizeExprs command is also deprecated. Is there another way to do TMM normalization?

Thank you.

singlecellexperiment single-cell rnaseq seurat scater • 2.5k views

ADD COMMENT • link 5.6 years ago hamza_karakurt ▴ 50

score 1 · Answer 1 · 2018-10-09

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 16 hours ago

The city by the bay

There are a number of aspects of your post that need addressing, so let's do it one at a time.

The first is the switch from SCESet to SingleCellExperiment. This happened a while ago, motivated by the superiority of the SummarizedExperiment class as a general data container in terms of stability, flexibility and usability. From a user perspective, this simply involves changing the constructor call (from newSCESet() to SingleCellExperiment()), and the various accessors (e.g., fData() to rowData(), pData() to colData()). Not particularly hard, and it also allows you to interface with any SummarizedExperiment-compatible packages, e.g., iSEE, DESeq2.

As for TMM normalization - we've known for a while that this was a poor choice of normalization method for single-cell RNA-seq data with lots of zeroes, see https://doi.org/10.1186/s13059-016-0947-7 for a study of this. (Similar criticisms apply to DESeq's default normalization.) Thus, we no longer recommend using TMM normalization and have removed all functions that do so. I would suggest using alternatives like scran:::computeSumFactors(), see the simpleSingleCell workflow to see how it's done. That said, if you insist on using TMM, you can simply call edgeR::calcNormFactors directly on your count matrix and multiply the result by the library sizes to get the "TMM size factors". The multiplication is important as calcNormFactors alone will only yield the normalization factors, these need to be scaled by the library sizes to obtain the size factors (yes, there's a difference between these two terms!).

The situation of normalizeExprs is a bit more complicated because it tries to do three things at once - TMM normalization, log-transformation and batch correction. I didn't write this function, but I hated it. It doesn't have a single purpose, it's just cobbled together from three separate functions that might as well be called separately. Separate calls would require a bit more writing, but at least the user (and reader of the code) understands what is happening. A reader seeing a call to normalizeExprs() would find it hard to figure out the function does. If we had to use a single function, it should instead be called:

calcTMMFactorsAndNormalizeAndRemoveBatchEffects

... which we can all agree is a stupid name. I deprecated normalizeExprs() because it was better for users to be explicit about what they wanted to do and call the relevant functions directly.

ADD COMMENT • link 5.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hey Aaron,

Thank you for your answer. I will use calcNormFactors in scater. After this line, I need to multiply my SumFactors with my counts to normalize right?

Or after computeSumFactors, directly normalize(sce) command does not do the job? I am working on unique barcoded single cell RNA-Seq.

Thanks again.

ADD REPLY • link 5.6 years ago hamza_karakurt ▴ 50

0

Entering edit mode

For your first question: get your terminology right, otherwise this discussion will be very confusing. calcNormFactors is from edgeR. It returns TMM normalization factors, one per cell. This needs to be multiplied by the library size for each cell to obtain the size factor. You can then save the size factors into the SingleCellExperiment object with sizeFactors(sce) <- tmm.size.factors, and run normalize to compute log-transformed normalized expression values.

For your second question, I'm not sure what you're actually asking. Running computeSumFactors will compute the size factors and store them in the SingleCellExperiment object (assuming that the input was also an SCE object). Running normalize will then compute log-transformed normalized expression values.

ADD REPLY • link 5.6 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

I have a similar situation where I am trying to reproduce earlier results for which normalizeExpr command was used.

Previous code:

sce <- computeSumFactors(sce, sizes=seq(20, 80, 5))
sce <- normaliseExprs(sce, method = "TMM")

Commands using edgeR TMM method: (Approach 1)

norm.factors <- calcNormFactors(assay(sce, "counts"), method = "TMM")
tmm.size.factors <- norm.factors * colSums(assay(sce, "counts"))
sizeFactors(sce) <- tmm.size.factors
sce <- normalize(sce)

I've also tried using normalise command: (Approach 2)

sce <- computeSumFactors(sce, sizes=seq(20, 80, 5))
sce <- normalise(sce)

I get same expression values using approach 1 and 2, and not match with the expression values using "normaliseExprs" command.

Can you please suggest if I missed anything.

Thanks

Sharvari

ADD REPLY • link 5.3 years ago sharvari gujja ▴ 40

0

Entering edit mode

The computeSumFactors call in your previous code does nothing, as normalizeExprs will simply overwrite any computed size factors with TMM-derived size factors.

Approach 1 should be identical to what normalizeExprs used to do in BioC 3.7. (I assume you're comparing the "logcounts" output across the different approaches.) If this is yielding different output to your previous code, then I don't know why. I would suggest you double-check your inputs.

In any case, it doesn't matter because approach 2 is the correct thing to do anyway. I don't see how you could possibly get the same results from approaches 1 and 2, they should not yield the same results in real data.

ADD REPLY • link 5.3 years ago Aaron Lun ★ 28k