Why does scran's computeSumFactors produce different size factors for cells with the exact same gene counts?
1
1
Entering edit mode
Friederike ▴ 10
@friederike-12850
Last seen 14 months ago
NUC, Weill Cornell Medicine

Hi,

I've been playing around with different normalization strategies for scRNA-seq data. Contrary to the header, I think, I have actually two questions, the first one being: Is the computeSumFactors() philosophy really applicable to current Drop-seq data sets?

The sample that initiated the question was generated with Drop-seq, i.e., it covers around 3,000 cells, but with fairly low coverage:

> pData(sceset)$total_features %>% summary  Min. 1st Qu. Median Mean 3rd Qu. Max. 17 415 1144 1304 1988 6617 The average expression values for each gene across all those cells are quite low, of course, so in order to avoid getting size factors of zero, I use a very small subset of genes (around 370) where the average count across all cells is greater than 1. (I do not filter out cells with low gene counts, which may also be worthy of a discussion). I then use these size factors to normalize the entire data set. While I was somewhat satisfied with the tSNE results of the normalized counts, I eventually noticed that the data set contained a couple of cell pairs that had exactly the same counts for all their genes. This is clearly an unwanted artifact and I am going to exclude these duplicate cells in the future, but I noticed that computeSumFactors() had assigned quite different size factors to these cells. For example: > sizeFactors( sceset[, colnames(duplicated_cells))[c(1:2)] ] ) I1_1 I2_1 2.646779 5.142105 where I1 and I2 are different cells with the exact same counts for about 5,000 genes. The size factors were quite different, regardless of whether I used the cluster parameter of the function or not. I just have trouble grasping why that would be - again, I will definitely exclude duplicate cells from future analyses, but I was just trying to get a better understanding of why the sum factors would be so different. I appreciate any insights - also about whether the approach is actually applicable to Drop-seq data, I have the feeling that a lot of the filtering strategies that are shown in tutorials (such as the one for scater and scran) are not really realistic for Drop-seq data with its fairly low coverage of genes, but relatively high numbers of cells. Thanks a lot! Friederike scran scater drop-seq scrnaseq • 2.3k views ADD COMMENT 1 Entering edit mode Aaron Lun ★ 27k @alun Last seen 4 hours ago The city by the bay Even if you have two cells with exactly identical counts, they should get slightly different size factors because they get placed into different pools. A (very small) proportion of equations in the linear system will differ between the two cell copies, which results in slight differences to the estimates. However, these differences should indeed be small; I can't think of a reason for why you should see a >2-fold change between copies. I can only suggest: • Remove cells with low read counts. These cells don't contribute much information, while their size factors still require estimation. As a result, the solution of the linear system is less stable, such that the few equations that differ between the cell copies may have a large impact on the final estimates. • The filtering is more aggressive than I would have done, but I doubt this makes much of a difference. • Make sure you haven't set sizes lower than 20; below that, the pooled counts are not stable. I've used computeSumFactors for a number of 10X data sets, and it seems to work okay. I usually plot the size factors against the library sizes as a diagnostic, most cells should be scattered around the diagonal (depending on the heterogeneity of the data set). P.S. If none of that works, you can send me a MWE offline - this behaviour is interesting and should be resolved. Edit: I'll summarize the offline discussion I had with Friederike after looking at the data. Basically, the problem was that, for the majority of cells, the count matrix was simply too sparse. Most cells express fewer than 700 genes, out of around 6000 genes that are expressed at an average count greater than 0.1. More than 90% of the counts are zero, such that the pooled size factors are also likely to be zero. The solution was to QC more aggressively, e.g., remove cells with fewer than 500 expressed features. This removes the cells that were previously getting negative size factors, which allows us to avoid using positive=TRUE (see my comments below). In addition, the precision of the size factor estimates can be improved by increasing the number of window sizes, e.g., sizes=2:10*10. After applying these two changes, the cell copies have very close (+/- 5%) estimates, which is what we'd expect. Of course, this means that we lose those cells with very few expressed features. However, I would argue that all normalization methods will struggle with these cells. TMM and DESeq will return gibberish because there are far too many zeroes. Library size normalization will be entirely driven by a handful of highly expressed genes - composition biases aside, the estimates will be highly imprecise. ADD COMMENT 0 Entering edit mode Hi Aaron, thanks for the quick reply! * What would you consider "low read count"? With a different data set I tend to follow Seurat's default of min. 500 genes per cell, but this particular data set we wanted to explore rare cell types which is why I've been trying to keep the filtering to a minimum, but maybe that's just not a viable option. I'll test whether the difference in size factors is still showing up with more stringent filtering of the cells. * I've used the default parameters of computeSumFactors, so I'm assuming the sizes should be ok. * The plot that you mentioned looks ok after the aggressive gene filtering that I described (i.e., it didn't look that good without the filtering because there was an abundance of size factors with value zero). Here's the one for the example I described above: ADD REPLY 0 Entering edit mode Ok, so here's what happened with more stringent filtering of cells (at least 500 genes per cell) - I'll post the code to make sure that it's not just all due to me setting a wrong parameter. # 1. filtering cells > mito.drop <- sceset$pct_counts_feature_controls_mito >= 50 | is.na(sceset$pct_counts_feature_controls_mito) > geneCount.drop <- pData(sceset)$total_features < 500
> sceset.filt3 <- sceset[,!(geneCount.drop | mito.drop)]
> sceset.filt3 <- scater::calculateQCMetrics(sceset.filt3,
feature_controls = list(mito = mito_genes))
> summary(sceset.filt3$total_counts) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 822 2932 5500 6599 8973 49690 # 2. filtering genes (just for the sum factor calculation) > ave.counts <- rowMeans(counts(sceset.filt3)) > keep.genes.sf <- ave.counts >= 1 > sceset.sf <- sceset.filt3[keep.genes.sf,] # 3. calculate sum factors on filtered data > lun.sf3 <- scran::computeSumFactors(sceset.sf, clusters = NULL, sf.out = TRUE, positive = TRUE) > summary(lun.sf3) #Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.4230 0.7655 0.9962 1.3260 8.1900 ## btw, without the gene filtering (step 2), this looked like this: ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00000 0.09026 0.49350 0.82060 1.11700 9.07400 # 4. adding all the expression values of interest to me at the moment > set_exprs(sceset.filt3, "cpm") <- exprs(sceset.filt3) > sizeFactors(sceset.filt3) <- lun.sf3 > sceset.filt3 <- normalize(sceset.filt3, exprs_values = "counts", recompute_cpm = TRUE) # 5. remove those with size factor = 0 > lsf.keep <- sceset.filt3[,!is.na(colSums(exprs(sceset.filt3)))] %>% sampleNames > sceset.filt3 <- sceset.filt3[, lsf.keep] # removes 8 cells # 6. plot png("libSize_vs_sizeFactor.png") plot( log10(sceset.filt3$total_counts), log10(sizeFactors(sceset.filt3)) )
dev.off()

# 7. check size factors for example cells with exact same counts
sizeFactors(sceset.filt3[, colnames(tt_i2_1dup)])
## I1_1     I2_1
## 2.734696 3.993735

The plot generated in step 6 looks like this:

1
Entering edit mode

The documentation for computeSumFactors should probably adopt a sharper tone against the use of positive=TRUE. It is intended as a measure of last resort, when all reasonable attempts at quality control have been attempted and you're still getting negative size factors. In this case, it seems as if you've still got two cells with size factors below 0.01, which is pretty small; I would have a look at what these cells are, and whether they are outliers for particular parameters, e.g., total counts, total features. The aim would be to remove enough cells until you don't have to use positive=TRUE. This may well be doing funny things to cause the differences in size factors between cell copies, though I don't know enough about linear inverse models to say for sure.

0
Entering edit mode

You have some cells with crazy size factors in there; the ones hanging off the bottom (e.g., library size of ~1000, size factor of < 0.001) are particularly concerning. These probably correspond to cells where a handful of genes (probably mitochondrial or ribosomal) are dominating the total counts. The size factors are computed from the majority of genes, so if most genes have near-zero counts, the size factors will end up being really small even if the library size is large.

Filtering on the number of expressed features should avoid this. Check out this workflow for how I usually do my quality control, but setting a fixed threshold a la Seurat should also work, provided you have some "feel" for what an appropriate threshold is for Drop-seq data. In general, the size factors shouldn't span more than two orders of magnitude. No matter how accurate my scaling normalization might be, I don't really feel comfortable comparing expression values if the original counts were 1000 times smaller in one cell compared to another.

0
Entering edit mode

thanks, will look into that!

PS: I was loosely following your workflow - there you specifically just check for the lower tail when excluding cells based on library size and gene counts, so I didn't pay enough attention to the other end of the spectrum, I guess

0
Entering edit mode

Having cells with lots of counts is fine - if I had two cells with plenty of reads, then order of magnitude differences in total coverage wouldn't concern me as much, because I still have enough reads. However, the problem occurs at the low end where we might be scaling up cells with low counts by over a thousand-fold during normalization. This is less reliable because my original counts were so low.

0
Entering edit mode

sorry, just saw this now.

alright, so let's see if I get this right: I should try to find cells that have inconspicuous counts, but low gene numbers. In the example posted today, I excluded cells with less than 500 genes, so I thought I was on the somewhat safe side.
Let me check what's going on in the cells with extreme size factors.

0
Entering edit mode

alright. here's the problem: the extreme values do not stem from contaminations. they are, indeed, the marker gene(s) of these particular cells.