Question

Discrepancy between cell type and cluster using SingleCellExperiment and SingleR

0

Entering edit mode

lirongrossmann ▴ 80

@lirongrossmann-13938

Last seen 3.8 years ago

Hi,

I am using singlecellexperiment with the proposed pipeline in the bioconductor book, and plot the UMAP plot with clusters as labels. I also used singleR to predict cell types and colored the cells in the UMAP according to their prediction.

Interestingly, different clusters which are apart from each other in the UMAP plot contain similar cell types (neurons, for example) from the singleR prediction. In addition, different cell types appear in the same cluster (neurons and B cell), which is strange too. When I compared the differential expression between similar cell types in different clusters I got very few to none differentially expressed genes (which makes sense), yet the cells of similar types cluster separately and are located apart in the UMAP plot.

I used the same code and clustering methods as suggested in the single cell bioconductor book (https://osca.bioconductor.org/)

Any help would be appreciated.

Thanks!

Here is the code I am using. If you run the code, even without clustering you can see that the B cells are all over the place without being located in one cluster:

# dataset
sce.zeisel <- ZeiselBrainData()

#  QC, normalization, PCA
stats <- perCellQCMetrics(sce.zeisel, subsets=list(Mito=which(location=="MT")))
high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce.zeisel <- sce.zeisel[,!high.mito]
set.seed(1000)
clusters <- quickCluster(sce.zeisel)
sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters)
sce.zeisel <- logNormCounts(sce.zeisel)
set.seed(1001)
dec.sce.zeisel <- modelGeneVarByPoisson(sce.zeisel)
top.sce.zeisel <- getTopHVGs(dec.sce.zeisel, prop=0.1)
set.seed(10000)
sce.zeisel <- denoisePCA(sce.zeisel, subset.row=top.sce.zeisel, technical=dec.sce.zeisel)

set.seed(1000000)
sce.zeisel <- runUMAP(sce.zeisel, dimred="PCA")


#cell type prediction

pred <- SingleR(test = sce.zeisel, ref= hpca.se, labels=hpca.se$label.main)
table(pred$labels)
sce.zeisel$celltypes <- pred$pruned.labels

# UMAP
plotUMAP(sce.zeisel, colour_by="celltypes")

# clustering 
g <- buildSNNGraph(sce.zeisel, k=10, use.dimred = 'PCA') # choosing K affects the seperation between the clusters
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce.zeisel) <- factor(clust)

# UMAP with clusters
plotUMAP(sce.zeisel, colour_by="label", text_by = "label")

singlecell singlecellexperiment umap singleR • 2.8k views

ADD COMMENT • link updated 3.9 years ago by Aaron Lun ★ 28k • written 3.9 years ago by lirongrossmann ▴ 80

0

Entering edit mode

Without a reproducible example, who knows?

ADD REPLY • link 3.9 years ago Aaron Lun ★ 28k

0

Entering edit mode

I added the code using a publicly available dataset, and getting the same issue. Similarly predicted cell types by singleR (in the case B cells) are dominating different clusters, which are separated in the UMAP space.

ADD REPLY • link 3.9 years ago lirongrossmann ▴ 80

score 0 · Answer 1 · 2020-09-02

I had a hunch about what the cause would be, and if your actual dataset is anything like the public one, then I would say my hunch was correct, because...

ZeiselBrainData() is mouse, and hpca.se (from HumanPrimaryCellAtlasData(), presumably) is... human.

SingleR()'s annotation is performed using the intersection of the row names of the reference and test datasets. The function will throw an error if the intersection is empty, which leads one to think that it would error out if given a human reference and a mouse test. Unfortunately, this is not the case when gene symbols are used, because there are certain genes that have the same name across both annotations. These are usually some wacky lncRNAs, predicted genes and other useless bits and pieces that are not informative for annotation. Little wonder, then, that the SingleR() output doesn't match up with the clustering.

This would not be a problem if the references were generated with stable identifiers like Ensembl IDs; there's no way to mix up ENSG000XXX with ENSMUSG00000XXX. In such cases, SingleR() will error out and it will be pretty clear that something is wrong (and should be fixed by the user). But for various historical reasons, the references use gene symbols by default, resulting in the potential for mistakes like these.

I suppose I could add some kind of warning if more than X% of genes are lost in the intersection, but that can be a false warning in many cases. And besides, it's not SingleR's responsibility to check that the user is specifying the correct species. I mean, it's a pretty low bar to expect that an analyst is aware of the species of their data!