Question

Why does scran's computeSumFactors produce different size factors for cells with the exact same gene counts?

1

Entering edit mode

Friederike ▴ 10

@friederike-12850

Last seen 3.7 years ago

NUC, Weill Cornell Medicine

Hi,

I've been playing around with different normalization strategies for scRNA-seq data. Contrary to the header, I think, I have actually two questions, the first one being: Is the computeSumFactors() philosophy really applicable to current Drop-seq data sets?

The sample that initiated the question was generated with Drop-seq, i.e., it covers around 3,000 cells, but with fairly low coverage:

> pData(sceset)$total_features %>% summary

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
     17     415    1144    1304    1988    6617

The average expression values for each gene across all those cells are quite low, of course, so in order to avoid getting size factors of zero, I use a very small subset of genes (around 370) where the average count across all cells is greater than 1. (I do not filter out cells with low gene counts, which may also be worthy of a discussion). I then use these size factors to normalize the entire data set.

While I was somewhat satisfied with the tSNE results of the normalized counts, I eventually noticed that the data set contained a couple of cell pairs that had exactly the same counts for all their genes. This is clearly an unwanted artifact and I am going to exclude these duplicate cells in the future, but I noticed that computeSumFactors() had assigned quite different size factors to these cells.

For example:

> sizeFactors( sceset[, colnames(duplicated_cells))[c(1:2)] ] )

 I1_1     I2_1
2.646779 5.142105

where I1 and I2 are different cells with the exact same counts for about 5,000 genes. The size factors were quite different, regardless of whether I used the cluster parameter of the function or not.

I just have trouble grasping why that would be - again, I will definitely exclude duplicate cells from future analyses, but I was just trying to get a better understanding of why the sum factors would be so different.

I appreciate any insights - also about whether the approach is actually applicable to Drop-seq data, I have the feeling that a lot of the filtering strategies that are shown in tutorials (such as the one for scater and scran) are not really realistic for Drop-seq data with its fairly low coverage of genes, but relatively high numbers of cells.

Thanks a lot!

Friederike

scran scater drop-seq scrnaseq • 4.1k views

ADD COMMENT • link updated 7.0 years ago by Aaron Lun ★ 28k • written 7.0 years ago by Friederike ▴ 10

score 1 · Answer 1 · 2017-04-17

1

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 9 hours ago

The city by the bay

Even if you have two cells with exactly identical counts, they should get slightly different size factors because they get placed into different pools. A (very small) proportion of equations in the linear system will differ between the two cell copies, which results in slight differences to the estimates. However, these differences should indeed be small; I can't think of a reason for why you should see a >2-fold change between copies. I can only suggest:

Remove cells with low read counts. These cells don't contribute much information, while their size factors still require estimation. As a result, the solution of the linear system is less stable, such that the few equations that differ between the cell copies may have a large impact on the final estimates.
The filtering is more aggressive than I would have done, but I doubt this makes much of a difference.
Make sure you haven't set sizes lower than 20; below that, the pooled counts are not stable.

I've used computeSumFactors for a number of 10X data sets, and it seems to work okay. I usually plot the size factors against the library sizes as a diagnostic, most cells should be scattered around the diagonal (depending on the heterogeneity of the data set).

P.S. If none of that works, you can send me a MWE offline - this behaviour is interesting and should be resolved.

Edit: I'll summarize the offline discussion I had with Friederike after looking at the data. Basically, the problem was that, for the majority of cells, the count matrix was simply too sparse. Most cells express fewer than 700 genes, out of around 6000 genes that are expressed at an average count greater than 0.1. More than 90% of the counts are zero, such that the pooled size factors are also likely to be zero.

The solution was to QC more aggressively, e.g., remove cells with fewer than 500 expressed features. This removes the cells that were previously getting negative size factors, which allows us to avoid using positive=TRUE (see my comments below). In addition, the precision of the size factor estimates can be improved by increasing the number of window sizes, e.g., sizes=2:10*10. After applying these two changes, the cell copies have very close (+/- 5%) estimates, which is what we'd expect.

Of course, this means that we lose those cells with very few expressed features. However, I would argue that all normalization methods will struggle with these cells. TMM and DESeq will return gibberish because there are far too many zeroes. Library size normalization will be entirely driven by a handful of highly expressed genes - composition biases aside, the estimates will be highly imprecise.

ADD COMMENT • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

thanks for the quick reply!

* What would you consider "low read count"? With a different data set I tend to follow Seurat's default of min. 500 genes per cell, but this particular data set we wanted to explore rare cell types which is why I've been trying to keep the filtering to a minimum, but maybe that's just not a viable option. I'll test whether the difference in size factors is still showing up with more stringent filtering of the cells.

* I've used the default parameters of computeSumFactors, so I'm assuming the sizes should be ok.

* The plot that you mentioned looks ok after the aggressive gene filtering that I described (i.e., it didn't look that good without the filtering because there was an abundance of size factors with value zero). Here's the one for the example I described above:

ADD REPLY • link 7.0 years ago Friederike ▴ 10

0

Entering edit mode

Ok, so here's what happened with more stringent filtering of cells (at least 500 genes per cell) - I'll post the code to make sure that it's not just all due to me setting a wrong parameter.

# 1. filtering cells
> mito.drop <- sceset$pct_counts_feature_controls_mito >= 50 | is.na(sceset$pct_counts_feature_controls_mito)
> geneCount.drop <- pData(sceset)$total_features < 500
> sceset.filt3 <- sceset[,!(geneCount.drop | mito.drop)]
> sceset.filt3 <- scater::calculateQCMetrics(sceset.filt3,
                                          feature_controls = list(mito = mito_genes))
> summary(sceset.filt3$total_counts)
  # Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  # 822    2932    5500    6599    8973   49690 

# 2. filtering genes (just for the sum factor calculation)
> ave.counts <- rowMeans(counts(sceset.filt3))
> keep.genes.sf <- ave.counts >= 1
> sceset.sf <- sceset.filt3[keep.genes.sf,]

# 3. calculate sum factors on filtered data
> lun.sf3 <- scran::computeSumFactors(sceset.sf, clusters = NULL, sf.out = TRUE, positive = TRUE)
> summary(lun.sf3)
  #Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## 0.0000  0.4230  0.7655  0.9962  1.3260  8.1900
## btw, without the gene filtering (step 2), this looked like this:
## Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## 0.00000 0.09026 0.49350 0.82060 1.11700 9.07400

# 4. adding all the expression values of interest to me at the moment
> set_exprs(sceset.filt3, "cpm") <- exprs(sceset.filt3)
> sizeFactors(sceset.filt3) <- lun.sf3
> sceset.filt3 <- normalize(sceset.filt3, exprs_values = "counts", recompute_cpm = TRUE)

# 5. remove those with size factor = 0
> lsf.keep <-  sceset.filt3[,!is.na(colSums(exprs(sceset.filt3)))] %>% sampleNames
> sceset.filt3 <- sceset.filt3[, lsf.keep] # removes 8 cells

# 6. plot
png("libSize_vs_sizeFactor.png")
plot( log10(sceset.filt3$total_counts), log10(sizeFactors(sceset.filt3)) )
dev.off()

# 7. check size factors for example cells with exact same counts
sizeFactors(sceset.filt3[, colnames(tt_i2_1dup)])
## I1_1     I2_1
## 2.734696 3.993735

The plot generated in step 6 looks like this:

ADD REPLY • link 7.0 years ago Friederike ▴ 10

1

Entering edit mode

The documentation for computeSumFactors should probably adopt a sharper tone against the use of positive=TRUE. It is intended as a measure of last resort, when all reasonable attempts at quality control have been attempted and you're still getting negative size factors. In this case, it seems as if you've still got two cells with size factors below 0.01, which is pretty small; I would have a look at what these cells are, and whether they are outliers for particular parameters, e.g., total counts, total features. The aim would be to remove enough cells until you don't have to use positive=TRUE. This may well be doing funny things to cause the differences in size factors between cell copies, though I don't know enough about linear inverse models to say for sure.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

You have some cells with crazy size factors in there; the ones hanging off the bottom (e.g., library size of ~1000, size factor of < 0.001) are particularly concerning. These probably correspond to cells where a handful of genes (probably mitochondrial or ribosomal) are dominating the total counts. The size factors are computed from the majority of genes, so if most genes have near-zero counts, the size factors will end up being really small even if the library size is large.

Filtering on the number of expressed features should avoid this. Check out this workflow for how I usually do my quality control, but setting a fixed threshold a la Seurat should also work, provided you have some "feel" for what an appropriate threshold is for Drop-seq data. In general, the size factors shouldn't span more than two orders of magnitude. No matter how accurate my scaling normalization might be, I don't really feel comfortable comparing expression values if the original counts were 1000 times smaller in one cell compared to another.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

thanks, will look into that!

PS: I was loosely following your workflow - there you specifically just check for the lower tail when excluding cells based on library size and gene counts, so I didn't pay enough attention to the other end of the spectrum, I guess

ADD REPLY • link 7.0 years ago Friederike ▴ 10

0

Entering edit mode

Having cells with lots of counts is fine - if I had two cells with plenty of reads, then order of magnitude differences in total coverage wouldn't concern me as much, because I still have enough reads. However, the problem occurs at the low end where we might be scaling up cells with low counts by over a thousand-fold during normalization. This is less reliable because my original counts were so low.

ADD REPLY • link 7.0 years ago Aaron Lun ★ 28k

0

Entering edit mode

sorry, just saw this now.

alright, so let's see if I get this right: I should try to find cells that have inconspicuous counts, but low gene numbers. In the example posted today, I excluded cells with less than 500 genes, so I thought I was on the somewhat safe side.
Let me check what's going on in the cells with extreme size factors.

ADD REPLY • link 7.0 years ago Friederike ▴ 10

0

Entering edit mode

alright. here's the problem: the extreme values do not stem from contaminations. they are, indeed, the marker gene(s) of these particular cells.

ADD REPLY • link 7.0 years ago Friederike ▴ 10