Hi,
I've been playing around with different normalization strategies for scRNA-seq data. Contrary to the header, I think, I have actually two questions, the first one being: Is the computeSumFactors() philosophy really applicable to current Drop-seq data sets?
The sample that initiated the question was generated with Drop-seq, i.e., it covers around 3,000 cells, but with fairly low coverage:
> pData(sceset)$total_features %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max. 17 415 1144 1304 1988 6617
The average expression values for each gene across all those cells are quite low, of course, so in order to avoid getting size factors of zero, I use a very small subset of genes (around 370) where the average count across all cells is greater than 1. (I do not filter out cells with low gene counts, which may also be worthy of a discussion). I then use these size factors to normalize the entire data set.
While I was somewhat satisfied with the tSNE results of the normalized counts, I eventually noticed that the data set contained a couple of cell pairs that had exactly the same counts for all their genes. This is clearly an unwanted artifact and I am going to exclude these duplicate cells in the future, but I noticed that computeSumFactors() had assigned quite different size factors to these cells.
For example:
> sizeFactors( sceset[, colnames(duplicated_cells))[c(1:2)] ] ) I1_1 I2_1 2.646779 5.142105
where I1 and I2 are different cells with the exact same counts for about 5,000 genes. The size factors were quite different, regardless of whether I used the cluster parameter of the function or not.
I just have trouble grasping why that would be - again, I will definitely exclude duplicate cells from future analyses, but I was just trying to get a better understanding of why the sum factors would be so different.
I appreciate any insights - also about whether the approach is actually applicable to Drop-seq data, I have the feeling that a lot of the filtering strategies that are shown in tutorials (such as the one for scater and scran) are not really realistic for Drop-seq data with its fairly low coverage of genes, but relatively high numbers of cells.
Thanks a lot!
Friederike
Hi Aaron,
thanks for the quick reply!
* What would you consider "low read count"? With a different data set I tend to follow Seurat's default of min. 500 genes per cell, but this particular data set we wanted to explore rare cell types which is why I've been trying to keep the filtering to a minimum, but maybe that's just not a viable option. I'll test whether the difference in size factors is still showing up with more stringent filtering of the cells.
* I've used the default parameters of
computeSumFactors
, so I'm assuming the sizes should be ok.* The plot that you mentioned looks ok after the aggressive gene filtering that I described (i.e., it didn't look that good without the filtering because there was an abundance of size factors with value zero). Here's the one for the example I described above:
Ok, so here's what happened with more stringent filtering of cells (at least 500 genes per cell) - I'll post the code to make sure that it's not just all due to me setting a wrong parameter.
The plot generated in step 6 looks like this:
The documentation for
computeSumFactors
should probably adopt a sharper tone against the use ofpositive=TRUE
. It is intended as a measure of last resort, when all reasonable attempts at quality control have been attempted and you're still getting negative size factors. In this case, it seems as if you've still got two cells with size factors below 0.01, which is pretty small; I would have a look at what these cells are, and whether they are outliers for particular parameters, e.g., total counts, total features. The aim would be to remove enough cells until you don't have to usepositive=TRUE
. This may well be doing funny things to cause the differences in size factors between cell copies, though I don't know enough about linear inverse models to say for sure.You have some cells with crazy size factors in there; the ones hanging off the bottom (e.g., library size of ~1000, size factor of < 0.001) are particularly concerning. These probably correspond to cells where a handful of genes (probably mitochondrial or ribosomal) are dominating the total counts. The size factors are computed from the majority of genes, so if most genes have near-zero counts, the size factors will end up being really small even if the library size is large.
Filtering on the number of expressed features should avoid this. Check out this workflow for how I usually do my quality control, but setting a fixed threshold a la Seurat should also work, provided you have some "feel" for what an appropriate threshold is for Drop-seq data. In general, the size factors shouldn't span more than two orders of magnitude. No matter how accurate my scaling normalization might be, I don't really feel comfortable comparing expression values if the original counts were 1000 times smaller in one cell compared to another.
thanks, will look into that!
PS: I was loosely following your workflow - there you specifically just check for the lower tail when excluding cells based on library size and gene counts, so I didn't pay enough attention to the other end of the spectrum, I guess
Having cells with lots of counts is fine - if I had two cells with plenty of reads, then order of magnitude differences in total coverage wouldn't concern me as much, because I still have enough reads. However, the problem occurs at the low end where we might be scaling up cells with low counts by over a thousand-fold during normalization. This is less reliable because my original counts were so low.
sorry, just saw this now.
alright, so let's see if I get this right: I should try to find cells that have inconspicuous counts, but low gene numbers. In the example posted today, I excluded cells with less than 500 genes, so I thought I was on the somewhat safe side.
Let me check what's going on in the cells with extreme size factors.
alright. here's the problem: the extreme values do not stem from contaminations. they are, indeed, the marker gene(s) of these particular cells.