Some of you may remember my post requesting help loading very large time series single cell data set from the GEO for DE. I'm now struggling with a very different issue. I believe I may be performing the DE wrong and I'm hoping someone can help me. I have a time series of single cell expression data and what I'm trying to do is filter for the genes that are more variable across the time series. That is when you consider all pairwise time points the genes that are highly differentially expressed between them. That's what the code below is trying to do.
library(DropletUtils)
library(SingleCellExperiment)
library(scuttle)
library(scran)
library(edgeR)
set.seed(123)
bFiles <- list.files(path="GSE242423_RAW", pattern = ".*\\.barcodes\\.tsv\\.gz$")
mFiles <- list.files(path="GSE242423_RAW", pattern = ".*\\.matrix\\.mtx\\.gz$")
sce <- list()
myColCount <- 0
for (bFile in bFiles) {
myKey <- substr(bFile, 1, nchar(bFile)-16)
mFile <- paste(myKey, ".matrix.mtx.gz", sep="")
myKey <- strsplit(myKey, split = "_")[[1]]
myKey <- myKey[[length(myKey)]]
if (!(mFile %in% mFiles))
{
next
}
t_dir <- file.path(tempdir(), myKey)
dir.create(t_dir)
file.symlink(file.path("~/Documents/oskm project/GSE242423_RAW", bFile),
file.path(t_dir, "barcodes.tsv.gz"))
file.symlink(file.path("~/Documents/oskm project/GSE242423_RAW", mFile),
file.path(t_dir, "matrix.mtx.gz"))
file.symlink(file.path("~/Documents/oskm project/","GSE242423_scRNA_genes.tsv.gz"),
file.path(t_dir, "genes.tsv.gz"))
sce[[myKey]] <- read10xCounts(t_dir)
sce[[myKey]]$time <- myKey
# quality controle of cells
cellQC <- perCellQCMetrics(sce[[myKey]])
qclib <- cellQC$sum < 700
qcnexprs <- cellQC$detected < 500
qcdiscard <- qclib | qcnexprs
sce[[myKey]] <- sce[[myKey]][,!qcdiscard]
# labeling cells into psudo samples
t <- sapply(1:20,function(j){paste0("Sample", myKey,"_",j)})
t <- rep_len(t, length(colData(sce[[myKey]])$time))
colData(sce[[myKey]])$sample <- t
print(myKey)
myColCount <- myColCount + ncol(counts(sce[[myKey]]))
}
# --- Step 2: Pseudo-bulk aggregation (sum counts per gene per sample) ---
print("step 2")
sce <- do.call(cbind, sce)
if (myColCount != ncol(counts(sce))) {
print("warning column loss")
}
pb_counts <- aggregateAcrossCells(sce, id=colData(sce)[,c("sample")])
# --- Step 3: Create edgeR object ---
print("step 3")
group <- factor(colData(pb_counts)$time)
dge <- DGEList(counts = convertTo(pb_counts,type="edgeR"), group = group)
dge <- calcNormFactors(dge)
# --- Step 4: Design matrix for time series ---
print("step 4")
design <- model.matrix(~0 + group) # no intercept, one coef per time
colnames(design) <- levels(group)
# --- Step 5: Differential expression analysis ---
print("step 5")
dge <- estimateDisp(dge, design)
fit <- glmQLFit(dge, design, dispersion = NULL)
lrt <- glmLRT(fit,coef="D0") #is this coef right?
# --- Step 6: View results ---
print("step 6")
my_gene_id <- topTags(lrt, n=30, sort.by = "PValue")
print(my_gene_id)
my_gene_id <- my_gene_id$table
my_gene_id <- rownames(my_gene_id)
print(sum(colSums(counts(sce))!=0)) # i get 156531
sce <- sce[my_gene_id,] # something really weird going on in filtering process
print(sum(colSums(counts(sce))!=0)) # i get 1121
gene_table <- cbind((seq(0,length(rownames(sce))-1)),(rownames(sce)))
write.table(gene_table, file = "panel_genes.txt", sep = "\t", row.names = FALSE,
col.names = FALSE, quote = FALSE)
my_time <- substr(sce$time, 2, nchar(sce$time))
my_time <- replace(my_time, my_time=="PSC", "-1")
my_time <- as.integer(my_time)
my_time <- matrix(my_time, nrow = 1, byrow = TRUE,
dimnames=list("time",colnames(sce)))
panel_real <- rbind(my_time,counts(sce))
gene_labels <- matrix(c(0,seq(0,length(rownames(sce))-1,1)),ncol=1)
panel_real <- cbind(gene_labels,panel_real)
panel_real <- as.matrix(panel_real)
write.table(panel_real, file = "panel_real.txt", sep = "\t", row.names = FALSE,
col.names = FALSE)
However when I filter for the returned genes the remaining counts are incredibly sparse. Looking at the 1st 30 'best' genes they all have p values of 0 which seams very unlikely. Less that one 1% of cells have any counts left. I did do some crude filtering of 'bad' cells before performing the pseudo bulking. Can anybody suggest what I might be doing wrong?

The use case is kinetic model parameterisation. I'm actually a systems biologist not a bioinformatician. (which is why I'm struggling here) I'm just trying to clean the data (in terms of less relevant genes and bad cells) to reduce it's size for a completely different kind of analysis. As you say there are no replicates. I'm doing the pseudo bulking in such a way as to create several fake replicates. I was recommended pseudo bulking for a number of reasons not least of which the previous process I was trying to use was forcing conversion to a dense matrix which massively exceeded my available ram. ... as for edgeR v4 QL I guess I'm just going off older tutorials so I've not come across it yet.
If you have fake replicates, then I think my code answers your question about how to run the edgeR time-course analysis.
Regarding tutorials, you might consider the edgeR User's Guide:
https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
The "Quick start" section is only a few lines long and would already move you along quite a bit.