Question

Merging only those samples sequenced over multiple batches - scRNA-seq

0

Entering edit mode

camerond • 0

@camerond-15316

Last seen 5 weeks ago

United Kingdom

I have a large 10X scRNA-seq dataset spanning 5 brain regions collected from three donors. I have 18 samples in total that were sequenced in 2 batches and I'm using Scater and scDblFinder for my QC.

I need to keep track of the sampleID, brain region and batch number which I've coded into the column names:

##  Load data  ------------------------------------------------------------------------
  sce <- DropletUtils::read10xCounts(

    c(paste0(DATA_DIR, "510_Cer_B2/filtered_feature_bc_matrix"), 
      paste0(DATA_DIR, "510_Hip_B2/filtered_feature_bc_matrix"), 
      paste0(DATA_DIR, "510_PFC_B1/filtered_feature_bc_matrix"), 
      paste0(DATA_DIR, "510_PFC_B2/filtered_feature_bc_matrix"), 
      ...
    ),
    sample.names = SAMPLES, type = "auto"
  )

##  Adding metadata  -------------------------------------------------------------------
sce@colData$batch <- ifelse(grepl("B1", sce@colData$Sample), "B1", "B2")
sce@colData$region <- substr(sce@colData$Sample, 5, 7)
colnames(sce) <- paste(substr(sce@colData$Sample, 1, 7), colData(sce)$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol

However, as demonstrated with 510_PFC above, three of the samples were sequenced over both batches (B1 and B2), meaning that reads from the same test tube were sequenced twice. We had to do this to increase the number of reads per cell in those samples.

The issue I have is that, for those samples that were sequenced in both batches, I'm currently counting their reads as if they derived from cells from a different prep, when in fact they come from cells from the same prep.

I understand that 10X barcodes are reused and before I added the sampleIDs to the column names I did have duplicate barcodes.

So my question is, how should I deal with merging the reads for these three samples?

Will it suffice to just omit the batch ID from the column names as they stand, which would leave duplicate column names/cell IDs? If so, I assume that Scater and downstream Bioconductor packages would process reads from duplicate columns as deriving from the same cell?

Or do these duplicated columns need to be be merged in some way so that I have single columns for each cell?

Any advice you could offer regarding this would be greatly appreciated.

SingleCell SingleCellExperiment scater • 2.2k views

ADD COMMENT • link 3.2 years ago camerond • 0

score 3 · Accepted Answer · 2021-02-06

I run into this situation on occasion when I process public data, e.g., from GEO. In these situations, the authors typically have multiple sequencing runs of the same Cellranger-derived cDNA library. If I'm not paying attention, I sometimes accidentally treat the FASTQ files from each run as a separate sample. This is doubly incorrect as (i) the same cell is artificially copied across multiple count matrices, and (ii) the sequencing depth is diluted across all the copies.

Once you're at the count matrices, there is no clean way to resolve this situation. You can't just add the count matrices together because the cell calling is different between the two runs, causing slight to modest differences in the identities of the columns. And even if the set of called cells was exactly the same, you can't add this kind of UMI counts together, because each UMI count is computed without considering the potential for PCR duplicates in the other count.

If you really only have the count matrices, then you'll just have to pick one sequencing run for each 10X sample. However, the much better solution is just to reprocess it from the top. Pool the FASTQ files together from all sequencing runs for the same sample and run Cellranger again; this will solve all of the problems that I just described. There is another problem of batch effects from sequencing on different days, but this is the lesser of two evils. From what I remember, it's the library preparation that is most susceptible to batch effects, so if you're using the same prep, the run-to-run variation in the sequencer should be pretty small.

Will it suffice to just omit the batch ID from the column names as they stand, which would leave duplicate column names/cell IDs? If so, I assume that Scater and downstream Bioconductor packages would process reads from duplicate columns as deriving from the same cell?

Most packages will not think of duplicate column names as an indicator that the two columns come from the same cell. Non-unique column names for different cells are pretty common when you're cbinding SingleCellExperiment objects together, e.g., the same cell barcode in different 10X runs, the same plate position across multiple plates for Smart-seq2. Your case is definitely unusual.

In fact, come to think of it, I don't think I use the column names for anything in my packages. If there's any important information, it'll usually be pulled out of the colData somewhere.