Hello! I am following the Aaron T Lun et al. 2016 paper to analyze my brain scRNA-seq data, but I am blocked in the step of gene filtering. I have my data in an sce object and I have already filtered outlier cells (isOutlier) and genes based on low-abundance genes (with threshold 0.1 because we used UMIs and this is what they recommended). The last part is like this:
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey",xlab=expression(Log~"average count"))
abline(v=log10(0.1), col="red", lwd=2, lty=2)
Here it happens that I already see that gene expression of my cells is very low (bad quality?), because almost all genes appeared around -1 in the log10-average-count X axis.
So I lose most genes with the 0.1 threshold. Nevertheless, I continued removing genes with 0 average counts:
rowData(sce)$ave.count <- ave.counts
to.keep <- ave.counts > 0
sce <- sce[to.keep,]
And finally I tried to calculate size factors with deconvolution with computeSumFactors, firstly doing a quick clustering with quickCluster because I think I understood it is better when having a big dataset as mine (10000 cells, after filtering around 5000; 24000 genes, after filtering zero average counts around 17000). The problem came when I entered:
high.ave <- rowData(sce)$ave.count >= 0.1
clusters <- quickCluster(sce, subset.row=high.ave, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, subset.row=high.ave, min.mean=NULL)
I obtained this warning:
Warning message: In .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row, : encountered negative size factor estimates
To be continue below...
Sorry for the split, but the full text did not fit in one post, so I did not see another way to continue writing it. Now I see "ADD COMMENT". Really sorry. I am not a bioinformatic person, so this is all new to me, and I am learning to analyze this data by reading and trying to understand as much as possible. I am very sorry if I commit many mistakes.
I have to say that my scRNA-seq data is composed by tdtomato-sorted cells that are supposed to label a unique neuronal subpopulation, but I know that some other cells have been sorted into the chip too (tdtomato unspecificity). So, at any step (I think after gene filtering), I need to select only cells expressing the specific gene that characterize this subpopulation (or do I maybe have to do it before any filtering?)
So here is the code I used since the beginning, I was basically following your publication:
#I found that some wells were empty in my chip (no cell had fall into them- it was FAC-sorting into the plate), so there is no expression of any gene in those wells, and I decided to remove first all cells with total counts=0; otherwise I find NaN in next instructions and can not continue.
#So I filtered with these three parameters:
And this is all I have done before the code I wrote in the first post, from:
I am now looking at size factors:
There are 87 cells with negative size factors (more than 1% of the total number of cells). If I look at library sizes of two of the cells, i.e. 129 and 609 position:
I do not know if this is what you meant by "have a look at the libraries". It seems that they have no zeroes in gene expression. That is what I meant when I said I do not understand why I have zeroes if I have removed cells with 0 expressed genes and genes with average counts=0 before computing size factors.
Is this saying anything relevant/new to you? Thanks very much for your answers and time. I really appreciate this.
You're throwing away 3430 cells based on the spike-in proportions. This is very odd. You need to figure out why this is happening, as losing 40% of the cells would be a very bad attrition rate during the quality control step. Moreover, the QC metrics usually agree with each other, i.e., cells with high spike-in proportions should also have low numbers of expressed features. This is not the case with your data set, as you are discarding the bulk of cells based on the spike-in proportions only. I also find this very unusual - at best it is biological, more likely there is something very strange with your QC distributions.
In short, I don't believe that your quality control step is working as it should. The outlier strategy described in the paper is usually sensible, but you may need a hard threshold on the QC metrics if the MAD is very large. For example, what is the lowest number of expressed features in the set of cells that are retained after QC? If it's something very low (e.g., below 50), you're effectively keeping cells that only have non-zero counts for several genes, compared to the tens of thousands of other genes that have counts of zero. This will result in problems with normalization, as I discussed above. In such cases, some manual intervention is required to choose a "sensible" threshold, e.g., keep only cells with total number of features greater than 500.
Regarding the zeroes; looking at the total counts is largely uninformative. For example, for cell 609, are the 1472 counts present in one gene only, or spread evenly across 1472 genes? The former will have a lot more genes with zero expression than the latter. This is what I mean by having a look at the libraries; you need to inspect the count profile across genes for some of these cells. You want to find out how many genes are expressed in these cells; which genes are most expressed; and whether they correspond to any particular biological feature. In my RBC example, I looked at the most expressed genes in the cells with negative size factors, and found Hbb-b1, which pretty much told me what was going on here. Consider taking the subset of problematic cells and running
plotHighestExprsor other scater diagnostic functions.
Obviously, removing cells with no expressed genes and genes with average counts of zero will not get rid of the zeroes in the expression matrix. I hope this would be self-evident, but here's a simple example:
... where every column has non-zero sum and every row has a non-zero mean.
I am trying to figure out the problem with the suggestions you made. I found that, after my QC metrics filtering (with nmad=2.5 as I wrote before), the min value for total_features is 1! And the maximum is 6885. I think now the problem is in total_features filtering, right? It seems to me that total_features=1 might be an outlier, unless all cells were in the same situation (this is what happens here, I think). I have about 600 cells with total_features<250, where 461 have <50, and about 350 <5. I do not know either why cells are not correlated as you mentioned: low total_features with high ERCC proportions. So, as you said too, cells with value 1, 5, 50 or 250 for total_features have almost all genes with value zero, and problems will arise later.
I have also looked at the total_features of several cells that have negative size factors and I have found that, for example, for a cell that have total_counts of 8, this value is expressed by 8 genes (meaning that all 8 genes have a count of 1 molecule); for another cell that have 80 total_counts, these counts are spread across 58 genes... To sum up, I have lot of zeroes in these cells and here is the problem.
I am now trying to subset the negative-size-factor population and run plotHighestExprs function, but I can not find any argument to do this inside this function. I do not know where the result of sizeFactors is kept, but I will try to figure out how to do this subset inside sce.
Anyway, I think I must do a manual intervention, as you proposed, to choose a threshold to remove cells with total_features lower than a value (I think I could choose this value based on what most cells have), instead of using isOutlier function for total_features filtering. This is the idea, right? Am I correct?
Thanks a lot!
Yes, I would suggest you set a manual threshold on the total number of expressed features. The difficulty is choosing a "sensible" threshold for your experimental system and protocol. I would say somewhere around 500 would be sensible, but no lower than 100. (Obviously, we wouldn't have to guess a threshold if
isOutlierwas working properly, but clearly it isn't, probably because of the high MAD in your data.)
Regarding the use of
plotHighestExprs, just do something simple:
... though based on your other results, I would guess that the poor quality control is the real problem, not any extreme biology like RBCs. So you may not get an obvious biological signal in your cells with negative size factors.
Thanks a lot, Aaron! I will try this manual filtering, just for total_features or also for percentage of ERCCs and total_counts, to see all possibilities and results. Hope I could find a proper quality control before normalization :)
I am very grateful for all your help, sincerely. Thank you very much :)
Hi again Aaron,
I already did the filtering with your recommendations and it works very good! I could finished that part happily, but it came another doubt to me. If you remembered, I need to select only cells with specific expression of one gene inside my sce object and remove the rest. But I can not achieve the script to do it... Could you please explain me how can I create another SCE object (from my sce one) with only cells that express this gene and with all the information that I already have in sce (size factors, log-norm counts, PCs...) for these cells? I mean, I know I have to select counts for that gene>0 but I do not know where to add the name of the gene. I am trying several scripts:
but it removes genes, not cells, or it returns errors.
Anything to take into account? Thanks again.
Let's break it down:
I hope this helps.
Thanks!! Just one more doubt: from which level of expression it is considered that a gene is expressed in a cell? I mean, maybe Pvalb>0 is not sufficient because 1 or 5 counts in a cell is not saying that this cell is PV (possible contamination?). So, is there any threshold in count level to select that?
And also, can I keep this new Pvalb.sce in a file in order to not having to read all the script from the beginning every time? Something like write.table and then read it again as an singleCellExperiment directly in a new R session/script?
Your first question is unanswerable. Maybe the gene is genuinely lowly expressed; or maybe it's highly expressed but difficult to capture/sequence, resulting in low counts; or maybe it's caused by contamination or cross-mapping. Nobody knows. Unless you have a clear bimodal distribution of counts, or some prior knowledge about which cells are expressing the gene, I don't think there's any good way of setting a threshold here.
More generally, we just use the phrase "number of expressed genes" as a proxy for the number of detected genes, mostly for quality control purposes. Whether a gene is detected or not in a particular cell is usually not that informative due to the technical noise in the single-cell RNA-seq protocol (UMIs or not). I would argue that such knowledge by itself isn't even interesting; it is the comparisons between cells that matters most, e.g., in DE analyses, clustering and so on.
For your second question, see the
You are right. I completely understand and agree with you: we can not decide this threshold because we do not know which is the cause of low expression, and obviously comparisons is what matters. I am now starting with this part :) Thanks so much again!
One new doubt has come to me. In the step of gene filtering, when we calculated ave.counts and plot histogram with limit line in 0.1 (in the case of UMIs), you run in your publication:
ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey",
abline(v=log10(0.1), col="blue", lwd=2, lty=2)
Then you only filtered genes > 0, but you still kept genes with ave.counts between 0 and 0.1, although you only used genes with ave.counts > 0.1 to calculate size factors and normalize. So my question is: Don't we need to keep not genes with ave.counts > 0 but genes with ave.counts > 0.1, as this is our limit for UMIs counts? In my case, following your publication, I only discarded genes with average counts < 0, so these between 0 and 0.1 are still in the sce object and will be use for further analysis (for example, var.fit, var.out, PCA, tSNE...). Am I doing wrong or this is the correct way?
Yes, the documentation is a bit confusing as I changed the workflow a while ago to avoid explicit filtering of the
SingleCellExperimentobject. This is because, for heterogeneous datasets with uncommon cell types, marker genes up-regulated in those cell types will have low average counts. Filtering would eliminate these genes, which would not be good for resolving those populations in downstream analysis.
That being said, there is still filtering on the abundance for specific steps such as in
computeSumFactors, as this fails at very large numbers of zero counts. However, this filter is only applied within the function, and the size factors that are returned can be applied to all genes (as it is just a scaling effect). Various other scran functions do their own filtering (e.g.,
trendVar) but this is mostly hidden from the user and can be ignored.
There were some other statistical arguments for filtering, such as reducing the severity of the FDR correction by eliminating uninteresting genes. However, on reflection, this doesn't really matter for exploratory single-cell analyses. Explicit error control is not necessary for HVGs (no one is happy with just a list of HVGs, even if the FDR is controlled) and difficult for DEGs between empirically identified clusters (a classic case of double-dipping into the data).
You might think that a lack of an abundance filter is a bad idea, as it would allow the low-abundance, noisy genes to affect the PCA and clustering, etc. However, keep in mind that the HVG calling step is itself a filter, ensuring that genes are only retained (e.g., for use in
denoisePCA) if they have positive biological components. This is a more direct approach to ensuring that only relevant genes are used in downstream analyses.
P.S. The workflows are migrating to a new location, which is more updated: