Question

Warning: negative size factors

0

Entering edit mode

Marina V.V. • 0

@marina-vv-14943

Last seen 6.0 years ago

Spain

Hello! I am following the Aaron T Lun et al. 2016 paper to analyze my brain scRNA-seq data, but I am blocked in the step of gene filtering. I have my data in an sce object and I have already filtered outlier cells (isOutlier) and genes based on low-abundance genes (with threshold 0.1 because we used UMIs and this is what they recommended). The last part is like this:

ave.counts <- calcAverage(sce)
hist(log10(ave.counts), breaks=100, main="", col="grey",xlab=expression(Log[10]~"average count"))
abline(v=log10(0.1), col="red", lwd=2, lty=2)

Here it happens that I already see that gene expression of my cells is very low (bad quality?), because almost all genes appeared around -1 in the log10-average-count X axis.

So I lose most genes with the 0.1 threshold. Nevertheless, I continued removing genes with 0 average counts:

rowData(sce)$ave.count <- ave.counts
to.keep <- ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)

And finally I tried to calculate size factors with deconvolution with computeSumFactors, firstly doing a quick clustering with quickCluster because I think I understood it is better when having a big dataset as mine (10000 cells, after filtering around 5000; 24000 genes, after filtering zero average counts around 17000). The problem came when I entered:

library("scran")
high.ave <- rowData(sce)$ave.count >= 0.1
clusters <- quickCluster(sce, subset.row=high.ave, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, subset.row=high.ave, min.mean=NULL)

I obtained this warning:

Warning message:
In .computeSumFactors(assay(x, i = assay.type), subset.row = subset.row,  :
  encountered negative size factor estimates

To be continue below...

size factor negative scran singlecellexperiment scater • 3.9k views

ADD COMMENT • link updated 6.1 years ago by Aaron Lun ★ 28k • written 6.1 years ago by Marina V.V. • 0

0

Entering edit mode

Marina V.V. • 0

@marina-vv-14943

Last seen 6.0 years ago

Spain

Continuation...

As it is recommended in the help for deconvolution methods when to deal with negative size factors, I have tried more restrictive filters for cells (like nmad=2, that I think is the most one) and/or also applied more restrictive gene filter (like average counts >=0.2, 0.5 or 1) and it always returned the same warning message. So I do not know anymore if the problem has to be with filtering. But what I do not understand is why I am supposed to have lot of zero counts if I have already removed cells with library sizes=0 and gene average counts=0? When computeSumFactors pooled cells together and sum their expression values, no cell pool has to 0, right? How is this happening then? Maybe I missunderstood something.

ADD COMMENT • link 6.1 years ago Marina V.V. • 0

0

Entering edit mode

Marina V.V. • 0

@marina-vv-14943

Last seen 6.0 years ago

Spain

Continuation 2...

In the plot, then, I obtained negative size factors like below and I think that if I do not remove these cells, I will have problems in further analysis (I can not attached the plot, sorry, very big).

summary(sizeFactors(sce))

Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.01834  0.26210  0.78675  0.98255  1.42705  9.21388

Anything that may help me, please? I have been trying different ideas and reading many papers but I can not find the solution. Although the conclusion was that my expression data is poor, it may be any alternative/approach to solve this and continue, right? :(

Thanks a lot! Any help would be great!

Marina

ADD COMMENT • link 6.1 years ago Marina V.V. • 0

score 2 · Accepted Answer · 2018-02-28

2

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 10 hours ago

The city by the bay

Firstly, don't split up your post. Don't use "add answer" unless you've solved your own problem.

Now, onto your question. You don't actually show what you've done; please post the relevant code you used for quality control. Did you filter by the number of expressed features? Just saying that you used isOutlier is not informative. The little code you have shown looks correct, but it is not possible to judge without the context.

It is instructive to have a look at the libraries with negative size factors. These occur when there are zeroes in the vast majority of genes, with only a few genes with (very high) expression. For example, even after filtering, I often get negative size factors for red blood cells, as these express only a small handful of genes. These semi-systematic zeroes cannot be overcome by pooling; to continue with the example above, no matter how many libraries for RBCs you pool together, the sum of zeroes is still zero. Such genes will not be removed by filtering either, if they are expressed in populations other than the RBCs such that the average expression across the entire population is still high.

A few months ago, I modified computeSumFactors to filter on the mean abundance within each cluster via the min.mean= argument. This should reduce the effect of the potential problems discussed above, where high-abundance genes for the entire population of cells are downregulated in a particular cluster, resulting in lots of zeroes and possibly negative size factors. However, this functionality requires access to the BioC-devel branch of scran, which you may not be willing to upgrade to.

In the meantime; I would suggest (i) understanding the origin of the negative size factors - as these are often problematic libraries anyway, and (ii) if there are not many of them (<<1% of the total number of cells) and the negative values are small relative to the positive values, just removing them and proceeding with the analysis. The latter is valid provided you only have a few small negative values, as this means that the estimates of the other positive size factors will not be distorted (much).

ADD COMMENT • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Sorry for the split, but the full text did not fit in one post, so I did not see another way to continue writing it. Now I see "ADD COMMENT". Really sorry. I am not a bioinformatic person, so this is all new to me, and I am learning to analyze this data by reading and trying to understand as much as possible. I am very sorry if I commit many mistakes.

I have to say that my scRNA-seq data is composed by tdtomato-sorted cells that are supposed to label a unique neuronal subpopulation, but I know that some other cells have been sorted into the chip too (tdtomato unspecificity). So, at any step (I think after gene filtering), I need to select only cells expressing the specific gene that characterize this subpopulation (or do I maybe have to do it before any filtering?)

So here is the code I used since the beginning, I was basically following your publication:

AttFinal<-read.csv(file = "AttributesFinal.csv",sep=";", header=TRUE, row.names=1)#Attributes
ExprFinal<-read.csv(file = "ExpressionFinal.csv",sep = ";", header=TRUE, row.names=1)#Gene data

sce<-SingleCellExperiment(list(counts=as.matrix(ExprFinal)), colData=t(AttFinal))

dim(sce)
[1] 24487  9568

is.spike <- grepl("^ERCC", rownames(sce))
isSpike(sce, "ERCC") <- is.spike
summary(is.spike)
   Mode   FALSE    TRUE 
logical   24395      92 
is.mito <- grepl("^mt-", rownames(sce))
summary(is.mito)
   Mode   FALSE    TRUE 
logical   24450      37

#Quality control on cells
sce <- calculateQCMetrics(sce, feature_controls=list(ERCC=is.spike, Mt=is.mito))
head(colnames(colData(sce)))
[1] "Sample"               "Barcode"              "Plate"                "Genotype"            
[5] "total_features"       "log10_total_features"

#I found that some wells were empty in my chip (no cell had fall into them- it was FAC-sorting into the plate), so there is no expression of any gene in those wells, and I decided to remove first all cells with total counts=0; otherwise I find NaN in next instructions and can not continue.

sce <- sce[,!((sce$total_counts == 0))

dim(sce)
[1] 24487  8742

#When I saw histograms for libsizes, nºexpr genes, ERCC% and mit%, I decided to use an nmad=2.5 (moderatly conservative) because I see in that most cells have low library sizes (1-10 thousands), low number of expressed genes (1-4000) and low ERCC % (0-1.5%).

libsize.out <- isOutlier(sce$total_counts, nmads=2.5, type="lower", log=TRUE)
feature.out <- isOutlier(sce$total_features, nmads=2.5, type="lower", log=TRUE)
spike.out <- isOutlier(sce$pct_counts_ERCC, nmads=2.5, type="higher")

#So I filtered with these three parameters:

sce <- sce[,!(libsize.out | feature.out | spike.out)]
data.frame(ByLibSize=sum(libsize.out), ByFeature=sum(feature.out), BySpike=sum(spike.out), Remaining=ncol(sce))

 ByLibSize ByFeature BySpike Remaining
1       199       378    3430      4989

And this is all I have done before the code I wrote in the first post, from:

ave.counts <- calcAverage(sce) ...

I am now looking at size factors:

which(sizeFactors(sce)<0)
 [1]  129  319  320  453  466  609  688  701  718 1024 1302 1380 1381 1382 1539 1701 1981 1984
[19] 2034 2818 2906 3019 3043 3059 3109 3122 3141 3145 3156 3182 3186 3191 3213 3231 3269 3277
[37] 3340 3351 3353 3363 3365 3367 3370 3372 3377 3378 3379 3380 3448 3490 3492 3547 3651 3696
[55] 3737 3774 3776 3821 3848 3870 3907 3980 3981 4066 4169 4191 4210 4232 4236 4258 4483 4526
[73] 4546 4550 4551 4574 4598 4627 4671 4693 4694 4854 4930 4949 4973 4975 4977

There are 87 cells with negative size factors (more than 1% of the total number of cells). If I look at library sizes of two of the cells, i.e. 129 and 609 position:

> sce$total_counts[609]
[1] 1472
> sce$total_counts[129]
[1] 309

I do not know if this is what you meant by "have a look at the libraries". It seems that they have no zeroes in gene expression. That is what I meant when I said I do not understand why I have zeroes if I have removed cells with 0 expressed genes and genes with average counts=0 before computing size factors.

Is this saying anything relevant/new to you? Thanks very much for your answers and time. I really appreciate this.

ADD REPLY • link 6.1 years ago Marina V.V. • 0

1

Entering edit mode

You're throwing away 3430 cells based on the spike-in proportions. This is very odd. You need to figure out why this is happening, as losing 40% of the cells would be a very bad attrition rate during the quality control step. Moreover, the QC metrics usually agree with each other, i.e., cells with high spike-in proportions should also have low numbers of expressed features. This is not the case with your data set, as you are discarding the bulk of cells based on the spike-in proportions only. I also find this very unusual - at best it is biological, more likely there is something very strange with your QC distributions.

In short, I don't believe that your quality control step is working as it should. The outlier strategy described in the paper is usually sensible, but you may need a hard threshold on the QC metrics if the MAD is very large. For example, what is the lowest number of expressed features in the set of cells that are retained after QC? If it's something very low (e.g., below 50), you're effectively keeping cells that only have non-zero counts for several genes, compared to the tens of thousands of other genes that have counts of zero. This will result in problems with normalization, as I discussed above. In such cases, some manual intervention is required to choose a "sensible" threshold, e.g., keep only cells with total number of features greater than 500.

Regarding the zeroes; looking at the total counts is largely uninformative. For example, for cell 609, are the 1472 counts present in one gene only, or spread evenly across 1472 genes? The former will have a lot more genes with zero expression than the latter. This is what I mean by having a look at the libraries; you need to inspect the count profile across genes for some of these cells. You want to find out how many genes are expressed in these cells; which genes are most expressed; and whether they correspond to any particular biological feature. In my RBC example, I looked at the most expressed genes in the cells with negative size factors, and found Hbb-b1, which pretty much told me what was going on here. Consider taking the subset of problematic cells and running plotHighestExprs or other scater diagnostic functions.

Obviously, removing cells with no expressed genes and genes with average counts of zero will not get rid of the zeroes in the expression matrix. I hope this would be self-evident, but here's a simple example:

... where every column has non-zero sum and every row has a non-zero mean.

ADD REPLY • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

I am trying to figure out the problem with the suggestions you made. I found that, after my QC metrics filtering (with nmad=2.5 as I wrote before), the min value for total_features is 1! And the maximum is 6885. I think now the problem is in total_features filtering, right? It seems to me that total_features=1 might be an outlier, unless all cells were in the same situation (this is what happens here, I think). I have about 600 cells with total_features<250, where 461 have <50, and about 350 <5. I do not know either why cells are not correlated as you mentioned: low total_features with high ERCC proportions. So, as you said too, cells with value 1, 5, 50 or 250 for total_features have almost all genes with value zero, and problems will arise later.

I have also looked at the total_features of several cells that have negative size factors and I have found that, for example, for a cell that have total_counts of 8, this value is expressed by 8 genes (meaning that all 8 genes have a count of 1 molecule); for another cell that have 80 total_counts, these counts are spread across 58 genes... To sum up, I have lot of zeroes in these cells and here is the problem.

I am now trying to subset the negative-size-factor population and run plotHighestExprs function, but I can not find any argument to do this inside this function. I do not know where the result of sizeFactors is kept, but I will try to figure out how to do this subset inside sce.

Anyway, I think I must do a manual intervention, as you proposed, to choose a threshold to remove cells with total_features lower than a value (I think I could choose this value based on what most cells have), instead of using isOutlier function for total_features filtering. This is the idea, right? Am I correct?

Thanks a lot!

ADD REPLY • link 6.1 years ago Marina V.V. • 0

1

Entering edit mode

Yes, I would suggest you set a manual threshold on the total number of expressed features. The difficulty is choosing a "sensible" threshold for your experimental system and protocol. I would say somewhere around 500 would be sensible, but no lower than 100. (Obviously, we wouldn't have to guess a threshold if isOutlier was working properly, but clearly it isn't, probably because of the high MAD in your data.)

Regarding the use of plotHighestExprs, just do something simple:

plotHighestExprs(sce[,sizeFactors(sce)<0])

... though based on your other results, I would guess that the poor quality control is the real problem, not any extreme biology like RBCs. So you may not get an obvious biological signal in your cells with negative size factors.

ADD REPLY • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks a lot, Aaron! I will try this manual filtering, just for total_features or also for percentage of ERCCs and total_counts, to see all possibilities and results. Hope I could find a proper quality control before normalization :)

I am very grateful for all your help, sincerely. Thank you very much :)

ADD REPLY • link 6.1 years ago Marina V.V. • 0

0

Entering edit mode

Hi again Aaron,

I already did the filtering with your recommendations and it works very good! I could finished that part happily, but it came another doubt to me. If you remembered, I need to select only cells with specific expression of one gene inside my sce object and remove the rest. But I can not achieve the script to do it... Could you please explain me how can I create another SCE object (from my sce one) with only cells that express this gene and with all the information that I already have in sce (size factors, log-norm counts, PCs...) for these cells? I mean, I know I have to select counts for that gene>0 but I do not know where to add the name of the gene. I am trying several scripts:

Pvalb.sce <- sce[counts(sce)["Pvalb",]>0]

Pvalb.sce <- sce[counts(sce)["Pvalb">0]]

but it removes genes, not cells, or it returns errors.

Anything to take into account? Thanks again.

ADD REPLY • link 6.1 years ago Marina V.V. • 0

0

Entering edit mode

Let's break it down:

pvalb.exprs <- counts(sce)["Pvalb",] # counts for this gene
keep <- pvalb.exprs > 0 # which cells express this gene?
pvalb.sce <- sce[,keep] # keeping those cells in the SCE object

I hope this helps.

ADD REPLY • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Thanks!! Just one more doubt: from which level of expression it is considered that a gene is expressed in a cell? I mean, maybe Pvalb>0 is not sufficient because 1 or 5 counts in a cell is not saying that this cell is PV (possible contamination?). So, is there any threshold in count level to select that?

And also, can I keep this new Pvalb.sce in a file in order to not having to read all the script from the beginning every time? Something like write.table and then read it again as an singleCellExperiment directly in a new R session/script?

Thank you.

ADD REPLY • link 6.1 years ago Marina V.V. • 0

0

Entering edit mode

Your first question is unanswerable. Maybe the gene is genuinely lowly expressed; or maybe it's highly expressed but difficult to capture/sequence, resulting in low counts; or maybe it's caused by contamination or cross-mapping. Nobody knows. Unless you have a clear bimodal distribution of counts, or some prior knowledge about which cells are expressing the gene, I don't think there's any good way of setting a threshold here.

More generally, we just use the phrase "number of expressed genes" as a proxy for the number of detected genes, mostly for quality control purposes. Whether a gene is detected or not in a particular cell is usually not that informative due to the technical noise in the single-cell RNA-seq protocol (UMIs or not). I would argue that such knowledge by itself isn't even interesting; it is the comparisons between cells that matters most, e.g., in DE analyses, clustering and so on.

For your second question, see the save and saveRDS functions.

ADD REPLY • link 6.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

You are right. I completely understand and agree with you: we can not decide this threshold because we do not know which is the cause of low expression, and obviously comparisons is what matters. I am now starting with this part :) Thanks so much again!

ADD REPLY • link 6.1 years ago Marina V.V. • 0

0

Entering edit mode

Hi Aaron,

One new doubt has come to me. In the step of gene filtering, when we calculated ave.counts and plot histogram with limit line in 0.1 (in the case of UMIs), you run in your publication:

ave.counts <- calcAverage(sce) hist(log10(ave.counts), breaks=100, main="", col="grey", xlab=expression(Log[10]~"average count")) abline(v=log10(0.1), col="blue", lwd=2, lty=2)

Then you only filtered genes > 0, but you still kept genes with ave.counts between 0 and 0.1, although you only used genes with ave.counts > 0.1 to calculate size factors and normalize. So my question is: Don't we need to keep not genes with ave.counts > 0 but genes with ave.counts > 0.1, as this is our limit for UMIs counts? In my case, following your publication, I only discarded genes with average counts < 0, so these between 0 and 0.1 are still in the sce object and will be use for further analysis (for example, var.fit, var.out, PCA, tSNE...). Am I doing wrong or this is the correct way?

rowData(sce)$ave.count <- ave.counts
to.keep <- ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)

high.ave <- rowData(sce)$ave.count >= 0.1
clusters <- quickCluster(sce, subset.row=high.ave, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, 
    subset.row=high.ave, min.mean=NULL)
summary(sizeFactors(sce))

Thanks!

ADD REPLY • link 6.0 years ago Marina V.V. • 0

0

Entering edit mode

Yes, the documentation is a bit confusing as I changed the workflow a while ago to avoid explicit filtering of the SingleCellExperiment object. This is because, for heterogeneous datasets with uncommon cell types, marker genes up-regulated in those cell types will have low average counts. Filtering would eliminate these genes, which would not be good for resolving those populations in downstream analysis.

That being said, there is still filtering on the abundance for specific steps such as in computeSumFactors, as this fails at very large numbers of zero counts. However, this filter is only applied within the function, and the size factors that are returned can be applied to all genes (as it is just a scaling effect). Various other scran functions do their own filtering (e.g., trendVar) but this is mostly hidden from the user and can be ignored.

There were some other statistical arguments for filtering, such as reducing the severity of the FDR correction by eliminating uninteresting genes. However, on reflection, this doesn't really matter for exploratory single-cell analyses. Explicit error control is not necessary for HVGs (no one is happy with just a list of HVGs, even if the FDR is controlled) and difficult for DEGs between empirically identified clusters (a classic case of double-dipping into the data).

You might think that a lack of an abundance filter is a bad idea, as it would allow the low-abundance, noisy genes to affect the PCA and clustering, etc. However, keep in mind that the HVG calling step is itself a filter, ensuring that genes are only retained (e.g., for use in denoisePCA) if they have positive biological components. This is a more direct approach to ensuring that only relevant genes are used in downstream analyses.

P.S. The workflows are migrating to a new location, which is more updated:

https://www.bioconductor.org/packages/devel/workflows/html/simpleSingleCell.html

ADD REPLY • link 6.0 years ago Aaron Lun ★ 28k