Question

ERROR - Size factors should be positive real numbers (using normalize() function)

0

Entering edit mode

kushshah ▴ 10

@kushshah-20393

Last seen 23 months ago

University of North Carolina, Chapel Hi…

I have a SingleCellExperiment object, and no matter what I do, when I run normalize(filtered.sce), I get the error: size factors should be positive real numbers.

It is my understanding that even though computeSumFactors() coerces to positive by default if necessary, it doesn't imply that normalize() will run automatically.

I have done many things to my pancreas dataset (Segerstolpe et. al., 2016) in terms of QC after starting with the 1308 high-quality cells specified in the metadata. Nothing seems to be working:

libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
spike.drop <- isOutlier(sce$pct_counts_ERCC, nmads=3, type="higher")
- Together, these three methods removed 62, 73, and 143 cells, respectively, from the original 1308. This seems to be a lot.
After defining ave.raw.counts <- calcAverage(sce, use_size_factors=FALSE), I've reduced the sce object down to the genes with ave.raw.counts >= 1, which is about 14000 out of the original 25000 genes

When running filtered.sce <- computeSumFactors(filtered.sce), it runs WITHOUT any warning of encountering negative size factor estimates.

However, when running the following two commands, I get a warning and then an error:

filtered.sce <- computeSpikeFactors(filtered.sce, type="ERCC", general.use=FALSE)
- Warning message: zero spike-in counts during spike-in normalization
filtered.sce <- normalize(filtered.sce)
- Error in .local(object,...): size factors should be positive real numbers

I even tried filtering by keep <- ave.raw.counts >= 50 just to see if there was any way I could get it to work, but my final error during normalization was still size factors should be positive real numbers.

I would appreciate any help as to why this may be happening. I can also provide any more information that is required. Thank you so much.

scater scran singlecellexperiment normalize qc • 3.8k views

ADD COMMENT • link updated 2.3 years ago by Dayme • 0 • written 5.3 years ago by kushshah ▴ 10

score 2 · Accepted Answer · 2019-04-10

First, calm down.

Secondly, let's have a look at the warning:

zero spike-in counts during spike-in normalization

Sounds pretty straightforward. If you don't have any spike-in counts for a cell, you can't compute a meaningful spike-in size factor for that cell. (Technically, the spike-in size factor is reported as zero, which is meaningless; hence the warning.) This then leads to the error in normalize, because otherwise it would divide the counts for that cell by zero.

So, depending on what you aim to do, you can either:

If you must have the spike-ins for a downstream analysis step, remove the cells with zero spike-in size factors.
Otherwise, remove the spike-ins and proceed onward with all cells.

Of course, you can do both of these steps, e.g., do 1 to estimate the technical mean-variance trend for feature selection, and then do 2 to use all cells for downstream analysis (possibly with the subset of features selected from 1). This is, in fact, exactly what I did with this same data set here.

P.S.

Together, these three methods removed 62, 73, and 143 cells, respectively, from the original 1308. This seems to be a lot.

I lose about 10% of cells in routine experiments, so what you're seeing is not so bad. Keep in mind that the three methods will overlap, so the total number of removed cells is unlikely to be sum of 62, 73 and 143. Of course, what they consider to be "not-low-quality" may or may not be your definition of "high quality". It's all pretty arbitrary and there's a lot of wiggle room during quality control - I mean, what cell isn't damaged by getting dunked in a foreign buffer and shot through microfluidics? They're all going to be a bit screwed up, but the hope is that there's still something useful in there.

Another factor is that there are strong patient-to-patient differences in sample processing (e.g., in the spike-in percentages if nothing else), which suggests that batch= should be used in isOutlier. Perhaps I should have done so in my code, but frankly, I was so tired from wrangling their "count" matrix into shape that I just moved on ASAP.