I've been playing around with different normalization strategies for scRNA-seq data. Contrary to the header, I think, I have actually two questions, the first one being: Is the computeSumFactors() philosophy really applicable to current Drop-seq data sets?
The sample that initiated the question was generated with Drop-seq, i.e., it covers around 3,000 cells, but with fairly low coverage:
> pData(sceset)$total_features %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max. 17 415 1144 1304 1988 6617
The average expression values for each gene across all those cells are quite low, of course, so in order to avoid getting size factors of zero, I use a very small subset of genes (around 370) where the average count across all cells is greater than 1. (I do not filter out cells with low gene counts, which may also be worthy of a discussion). I then use these size factors to normalize the entire data set.
While I was somewhat satisfied with the tSNE results of the normalized counts, I eventually noticed that the data set contained a couple of cell pairs that had exactly the same counts for all their genes. This is clearly an unwanted artifact and I am going to exclude these duplicate cells in the future, but I noticed that computeSumFactors() had assigned quite different size factors to these cells.
> sizeFactors( sceset[, colnames(duplicated_cells))[c(1:2)] ] ) I1_1 I2_1 2.646779 5.142105
where I1 and I2 are different cells with the exact same counts for about 5,000 genes. The size factors were quite different, regardless of whether I used the cluster parameter of the function or not.
I just have trouble grasping why that would be - again, I will definitely exclude duplicate cells from future analyses, but I was just trying to get a better understanding of why the sum factors would be so different.
I appreciate any insights - also about whether the approach is actually applicable to Drop-seq data, I have the feeling that a lot of the filtering strategies that are shown in tutorials (such as the one for scater and scran) are not really realistic for Drop-seq data with its fairly low coverage of genes, but relatively high numbers of cells.
Thanks a lot!