I have a large 10X scRNA-seq dataset spanning 5 brain regions collected from three donors. I have 18 samples in total that were sequenced in 2 batches and I'm using Scater and scDblFinder for my QC.
I need to keep track of the sampleID, brain region and batch number which I've coded into the column names:
## Load data ------------------------------------------------------------------------
sce <- DropletUtils::read10xCounts(
c(paste0(DATA_DIR, "510_Cer_B2/filtered_feature_bc_matrix"),
paste0(DATA_DIR, "510_Hip_B2/filtered_feature_bc_matrix"),
paste0(DATA_DIR, "510_PFC_B1/filtered_feature_bc_matrix"),
paste0(DATA_DIR, "510_PFC_B2/filtered_feature_bc_matrix"),
...
),
sample.names = SAMPLES, type = "auto"
)
## Adding metadata -------------------------------------------------------------------
sce@colData$batch <- ifelse(grepl("B1", sce@colData$Sample), "B1", "B2")
sce@colData$region <- substr(sce@colData$Sample, 5, 7)
colnames(sce) <- paste(substr(sce@colData$Sample, 1, 7), colData(sce)$Barcode, sep = "_")
rownames(sce) <- rowData(sce)$Symbol
However, as demonstrated with 510_PFC
above, three of the samples were sequenced over both batches (B1 and B2), meaning that reads from the same test tube were sequenced twice. We had to do this to increase the number of reads per cell in those samples.
The issue I have is that, for those samples that were sequenced in both batches, I'm currently counting their reads as if they derived from cells from a different prep, when in fact they come from cells from the same prep.
I understand that 10X barcodes are reused and before I added the sampleIDs to the column names I did have duplicate barcodes.
So my question is, how should I deal with merging the reads for these three samples?
Will it suffice to just omit the batch ID from the column names as they stand, which would leave duplicate column names/cell IDs? If so, I assume that Scater and downstream Bioconductor packages would process reads from duplicate columns as deriving from the same cell?
Or do these duplicated columns need to be be merged in some way so that I have single columns for each cell?
Any advice you could offer regarding this would be greatly appreciated.
Many thanks for taking the time to provide a very clear and detailed answer. I have the FASTQ files so it is no problem to run them through CR again. The reason for coding sample details into the column names is that Seurat does this, so it was force of habit really. I've only just started exploring the BioC single-cell packages in the last few weeks. Thanks again!