Discrepancy between cell type and cluster using SingleCellExperiment and SingleR
Entering edit mode
Last seen 6 months ago


I am using singlecellexperiment with the proposed pipeline in the bioconductor book, and plot the UMAP plot with clusters as labels. I also used singleR to predict cell types and colored the cells in the UMAP according to their prediction.

Interestingly, different clusters which are apart from each other in the UMAP plot contain similar cell types (neurons, for example) from the singleR prediction. In addition, different cell types appear in the same cluster (neurons and B cell), which is strange too. When I compared the differential expression between similar cell types in different clusters I got very few to none differentially expressed genes (which makes sense), yet the cells of similar types cluster separately and are located apart in the UMAP plot.

I used the same code and clustering methods as suggested in the single cell bioconductor book (https://osca.bioconductor.org/)

Any help would be appreciated.


Here is the code I am using. If you run the code, even without clustering you can see that the B cells are all over the place without being located in one cluster:

# dataset
sce.zeisel <- ZeiselBrainData()

#  QC, normalization, PCA
stats <- perCellQCMetrics(sce.zeisel, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.zeisel <- sce.zeisel[,!high.mito]
clusters <- quickCluster(sce.zeisel)
sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters)
sce.zeisel <- logNormCounts(sce.zeisel)
dec.sce.zeisel <- modelGeneVarByPoisson(sce.zeisel)
top.sce.zeisel <- getTopHVGs(dec.sce.zeisel, prop=0.1)
sce.zeisel <- denoisePCA(sce.zeisel, subset.row=top.sce.zeisel, technical=dec.sce.zeisel)

sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA")

#cell type prediction

pred <- SingleR(test = sce.zeisel, ref= hpca.se, labels=hpca.se$label.main)
sce.zeisel$celltypes <- pred$pruned.labels

plotUMAP(sce.zeisel, colour_by="celltypes")

# clustering 
g <- buildSNNGraph(sce.zeisel, k=10, use.dimred = 'PCA') # choosing K affects the seperation between the clusters
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.zeisel) <- factor(clust)

# UMAP with clusters
plotUMAP(sce.zeisel, colour_by="label", text_by = "label")
singlecell singlecellexperiment umap singleR • 337 views
Entering edit mode

Without a reproducible example, who knows?

Entering edit mode

I added the code using a publicly available dataset, and getting the same issue. Similarly predicted cell types by singleR (in the case B cells) are dominating different clusters, which are separated in the UMAP space.

Entering edit mode
Aaron Lun ★ 26k
Last seen 52 minutes ago
The city by the bay

I had a hunch about what the cause would be, and if your actual dataset is anything like the public one, then I would say my hunch was correct, because...

ZeiselBrainData() is mouse, and hpca.se (from HumanPrimaryCellAtlasData(), presumably) is... human.

SingleR()'s annotation is performed using the intersection of the row names of the reference and test datasets. The function will throw an error if the intersection is empty, which leads one to think that it would error out if given a human reference and a mouse test. Unfortunately, this is not the case when gene symbols are used, because there are certain genes that have the same name across both annotations. These are usually some wacky lncRNAs, predicted genes and other useless bits and pieces that are not informative for annotation. Little wonder, then, that the SingleR() output doesn't match up with the clustering.

This would not be a problem if the references were generated with stable identifiers like Ensembl IDs; there's no way to mix up ENSG000XXX with ENSMUSG00000XXX. In such cases, SingleR() will error out and it will be pretty clear that something is wrong (and should be fixed by the user). But for various historical reasons, the references use gene symbols by default, resulting in the potential for mistakes like these.

I suppose I could add some kind of warning if more than X% of genes are lost in the intersection, but that can be a false warning in many cases. And besides, it's not SingleR's responsibility to check that the user is specifying the correct species. I mean, it's a pretty low bar to expect that an analyst is aware of the species of their data!

Entering edit mode

As much as I want your hunch to be right, it is unfortunately wrong. The dataset I am using is human, and not mouse (like the one in my code above).... If you (or anyone else) have another other suggestion, I would be very grateful.


Entering edit mode

Then I have no idea, beyond suggesting that you try a different reference.


Login before adding your answer.

Traffic: 282 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6