DropletUtils::swappedDrops more cells than using Read10x with filtered_feature_bc_matrix files
1
0
Entering edit mode
@gilstelzer-23042
Last seen 4.7 years ago

Four samples were sequenced on the Illumina NovaSeq machine (which is prone to index hopping). I used DropletUtils::swappedDrops to correct this issue.

#An xlsx file with filepath details for every sample
samples.swapped.reads = rio::import(sample.file.map.swapped.reads)
#creating filepath for each sample
h5.matrix.filepath.incl.swapped.reads = paste0(samples.incl.swapped.reads$matrix.files.dir, samples.incl.swapped.reads$sample.dir.name, samples.incl.swapped.reads$h5.filt.matrix.filepath)
cleaned.reads.analysis = swappedDrops(h5.matrix.filepath.incl.swapped.reads, min.frac = 0.9, get.swapped = TRUE, get.diagnostics = TRUE)

for(i in 1:length(samples.incl.swapped.reads$sample.name))
{
    import.samples.cleaned.reads[[i]] =  CreateSeuratObject(counts = cleaned.reads.analysis$cleaned[[i]], project = names(import.samples.cleaned.reads)[i], min.cells = minimum.cells, min.features = minimum.features)
}

I then merged the import.samples.cleaned.reads objects.

object.list.cleaned.reads$general.filt = merge(x = import.samples.cleaned.reads[[1]], y = import.samples.cleaned.reads[2:length(import.samples.cleaned.reads)], add.cell.ids = samples$sample.name[1:length(samples$sample.name)])

table(object.list.cleaned.reads$general.filt@meta.data$orig.ident)
#output
sh1_condition   sh1_normal   sh2_condition   sh2_normal 
7365            5858         5670            9461 

When I compare this to the cell count WITHOUT using swappedDrops (using the Read10x function in Seurat and the files within filtered-feature-bc-matrix folder, that is created by Cell Ranger), I receive LESS cells.

sh1_condition   sh1_normal   sh2_condition   sh2_normal 
6395            5162         4932            7387

I repeated the same process of using Read10x but this time with the rawfeaturebc_matrix files

sh1_condition   sh1_normal   sh2_condition   sh2_normal 
8241            7199         6704            10696

It seems that molecule-info.h5 contains raw counts prior to filtering. The Seurat package recommends using the filtered-feature-bc-matrix files as input (this was implemented in multiple analyses in our group) since it removes barcodes that contain only ambient RNA, but uses the emptyDrops function from this package to include barcodes that contain cells with low RNA expression that are not ambient RNA. I tried using the filtered-feature-bc-matrix.h5 as input for the swappedDrops function but obviously this didn't work.

How can I use the swappedDrops as described (molecule_info.h5 etc...) but still filter as Cell Ranger would, in order to receive similar results to the previous use of the Seurat package (using the filtered-feature-bc-matrix files)?

Please advise

10x genomics DropletUtils single cell RNA seq • 1.8k views
ADD COMMENT
0
Entering edit mode

For starters, you could format your post properly. See https://commonmark.org/help/. Some punctuation would also be in order.

ADD REPLY
0
Entering edit mode

Done. Formatted. Punctuated.

ADD REPLY
0
Entering edit mode

Excellent.

I also guess I should say some more things otherwise the system doesn't let me make a comment with just the word "Excellent". So that's that.

ADD REPLY
0
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 1 hour ago
The city by the bay

First, I'm going to put aside all of the Seurat stuff, because I don't know if they do their own filtering in their import functions. If they do, then it's hard to conclude anything from your numbers.

Instead, to simplify matters, let us consider the following approaches:

  1. You ran swappedDrops() to get matrices that you can use in downstream analyses.
  2. You ran swappedDrops() to get matrices that are passed through emptyDrops() and then used in downstream analyses.
  3. You loaded count matrices from the filtered_feature_bc_matrix/ directory and used them in downstream analyses.

1 is obviously crazy because you didn't do any cell calling. You're basically doing analyses on every cell barcode that has a non-zero count, and you're probably lucky that one of your downstream steps did some filtering to make this error less obvious. (Otherwise you'd be trying to analyze >700,000 "cells".)

2 is the recommended approach and is closest to what CellRanger is doing. This should be reasonably similar to but will not give the exact same number of cells as CellRanger, most obviously because you removed swapped molecules. However, even if we were to put that difference aside, there is also the fact that the cell calling algorithm is different between CellRanger and emptyDrops. The former is based on the latter but the 10X Genomics folks put their own spin on the algorithm. You'll have to ask them why they did that, but that's what they did, and I assume they had reasons for doing so.

3 is the most expedient approach if you just want to go with the flow and don't care about swapping or fiddling with the cell calling algorithm.

ADD COMMENT
0
Entering edit mode

Great. I used emptyDrops to find the barcodes that pass an FDR threshold I would like to use. How do I subset a cleaned dgCMatrix with the desired barcodes?

> cleaned.reads.analysis$cleaned[[i]][,desired.barcodes]

Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : 
  invalid character indexing

I also tried to first convert the dgCMatrix into a matrix (as.matrix) or data frame (as.data.frame) and then select columns but got an error

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Thanks

ADD REPLY
0
Entering edit mode

At moments like these, it is best to take a deep breath and try to think your way through the problem.

Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : invalid character indexing

Well, I don't know what desired.barcodes is, but I bet that if you looked at it, you would see a bunch of NAs. Why are there NAs? Well, from the ?emptyDrops documentation:

'NA' values in the results:

We assume that barcodes with total UMI counts below ‘lower’ correspond to empty droplets. These are used to estimate the ambient expression profile against which the remaining barcodes are tested. Under this definition, these low-count barcodes cannot be cell-containing droplets and are excluded from the hypothesis testing.

... [not copying the entire thing, go and look it up]

Then it's a simple matter of just figuring out how to avoid getting NAs. If you're picking the desired barcodes based on FDR values below a certain threshold, one way is to do:

your.matrix.here[,desired.barcodes & !is.na(desired.barcodes),drop=FALSE]

Or for less typing you could do:

your.matrix.here[,which(desired.barcodes),drop=FALSE]

I also tried to first convert the dgCMatrix into a matrix (as.matrix) or data frame (as.data.frame) and then select columns but got an error

As you have found out, that is unwise.

ADD REPLY
0
Entering edit mode

Thank you for your kind reply

ADD REPLY

Login before adding your answer.

Traffic: 699 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6