Question

DropletUtils read10xcounts adds the 12 Cell Multiplexing Oligo tags as gene names

0

Entering edit mode

pcantalupo ▴ 10

@pcantalupo-8617

Last seen 7 weeks ago

United States

Hello,

I'm using DropletUtils 1.20.0 in Bioconductor 3.17. I used cellranger multi (version 7.1.0) to demultiplex my samples. I loaded one of the H5 files into R with read10xCounts(h5file). When I look at the rownames of the sce object (see below), the last 12 gene names are CMO301, CMO302 up to CMO312. These are the 10X genomics CMO tags that are used for tagging cells for cell multiplexing. The CMOs should not be added to the sce object as gene names. I could not find an option in read10xCounts to eliminate these rows (nor were google searches productive).

Why are CMO tags added as genes? Is this a bug or expected?

> h5file
                                    h2 
"sample_filtered_feature_bc_matrix.h5" 
> sce = read10xCounts(h5file)
> tail(rownames(sce),n=15)
 [1] "ENSMUSG00000094855" "ENSMUSG00000095019" "ENSMUSG00000095041" "CMO301"            
 [5] "CMO302"             "CMO303"             "CMO304"             "CMO305"            
 [9] "CMO306"             "CMO307"             "CMO308"             "CMO309"            
[13] "CMO310"             "CMO311"             "CMO312"            
>

Thank you

DropletUtils droplet • 1.1k views

ADD COMMENT • link written 8 months ago by pcantalupo ▴ 10

score 1 · Answer 1 · 2023-11-23

1

Entering edit mode

ATpoint ★ 4.2k

@atpoint-13662

Last seen 18 hours ago

Germany

In rowData(sce) you'll find Type which is "Antibody Capture" for hash tag oligos. You can simply move them to an altExp.

# read data
sce <- DropletUtils::read10xCounts(...)

# identify which rows are the CMOs
isCMO <- rowData(sce)$Type=="Antibody Capture"

# move CMOs to altExp
altExp(sce, "CMO") <- sce[isCMO,]

# clean the main sce from CMOs
sce <- sce[!isCMO,]

untested...something like this.

Edit: Or just use the dedicated splitAltExps function from DropletUtils for it.

ADD COMMENT • link 8 months ago ATpoint ★ 4.2k

0

Entering edit mode

Thank you. That should be the default behavior.

The problem is that how many people are going to know about this? The documentation doesn't describe this behavior and needs to be updated. It is going to affect downstream analyses by keeping the CMO tags as genes.

ADD REPLY • link 8 months ago pcantalupo ▴ 10

2

Entering edit mode

This doesn't happen by default because it would create instability in the behaviour of the package when new types of feature are introduced - we don't especially want to privilege "CMO" type features in case of future changes that would break data interactions with older versions of DropletUtils.

Thanks to ATpoint for the very helpful answers you have provided here.

ADD REPLY • link 8 months ago Jonathan Griffiths ▴ 90

0

Entering edit mode

I understand that it can't be the default but I believe it should be added as a parameter. How do I submit an issue for this? Is this the github repo? https://github.com/MarioniLab/DropletUtils. I couldn't find it under the Bioconductor github. Thank you

ADD REPLY • link 8 months ago pcantalupo ▴ 10

0

Entering edit mode

You can submit an issue there, but since I am the maintainer I will not promise you that I will implement it! I would welcome the suggestion there though, so please add it if you would like :)

Certainly I wouldn't implement the splitting by default, because many existing pieces of code will be set up to run as things stand, and we don't want to break them. We could add an argument to split the matrices off, but it's hard to understand why a user wouldn't just use splitAltExps themselves.

ADD REPLY • link 8 months ago Jonathan Griffiths ▴ 90

0

Entering edit mode

Everyone who reads ?read10xCounts and knows that one's dataset has hashtags, so one naturally would start checking where they end up, no? Thousands of people use the function, I think it's not all bad.

ADD REPLY • link 8 months ago ATpoint ★ 4.2k

0

Entering edit mode

I read this in the docs and assumed that only genes were in the rows A SingleCellExperiment object containing count data for each gene (row) and cell (column) across all samples. Unfortunately in this context genes is ambiguous. Honestly, I wasn't even thinking where the CMO counts were going

I come from a Seurat background and rarely use DropletUtils. When I use Seurat's Read10X_h5 function it returns a matrix by default. If there are multiple modalities in the h5 file, it will return a list of matrices and give the user useful feedback. Maybe I expected this when using DropletUtils.

Example

> dat = Seurat::Read10X_h5(h5file)
Genome matrix has multiple modalities, returning a list of matrices for this genome
> names(dat)
[1] "Gene Expression"      "Multiplexing Capture"

Thank you for your help

ADD REPLY • link 8 months ago pcantalupo ▴ 10