How to do differential expresion across time serise with edgeR?
1
0
Entering edit mode
@61b63a9f
Last seen 8 hours ago
United Kingdom

Some of you may remember my post requesting help loading very large time series single cell data set from the GEO for DE. I'm now struggling with a very different issue. I believe I may be performing the DE wrong and I'm hoping someone can help me. I have a time series of single cell expression data and what I'm trying to do is filter for the genes that are more variable across the time series. That is when you consider all pairwise time points the genes that are highly differentially expressed between them. That's what the code below is trying to do.

library(DropletUtils)
library(SingleCellExperiment)
library(scuttle)
library(scran)
library(edgeR)

set.seed(123)

bFiles <- list.files(path="GSE242423_RAW", pattern = ".*\\.barcodes\\.tsv\\.gz$")
mFiles <- list.files(path="GSE242423_RAW", pattern = ".*\\.matrix\\.mtx\\.gz$")
sce <- list()
myColCount <- 0
for (bFile in bFiles) {
    myKey <- substr(bFile, 1, nchar(bFile)-16)
    mFile <- paste(myKey, ".matrix.mtx.gz", sep="")
    myKey <- strsplit(myKey, split = "_")[[1]]
    myKey <- myKey[[length(myKey)]]
    if (!(mFile %in% mFiles))
    {
     next
    }
    t_dir <- file.path(tempdir(), myKey)
    dir.create(t_dir)
    file.symlink(file.path("~/Documents/oskm project/GSE242423_RAW", bFile),
                file.path(t_dir, "barcodes.tsv.gz"))
    file.symlink(file.path("~/Documents/oskm project/GSE242423_RAW", mFile),
                file.path(t_dir, "matrix.mtx.gz"))
    file.symlink(file.path("~/Documents/oskm project/","GSE242423_scRNA_genes.tsv.gz"),
                file.path(t_dir, "genes.tsv.gz"))
    sce[[myKey]] <- read10xCounts(t_dir)
    sce[[myKey]]$time <- myKey
    # quality controle of cells
    cellQC <- perCellQCMetrics(sce[[myKey]])
    qclib <- cellQC$sum < 700
    qcnexprs <- cellQC$detected < 500
    qcdiscard <- qclib | qcnexprs
    sce[[myKey]] <- sce[[myKey]][,!qcdiscard]
    # labeling cells into psudo samples
    t <- sapply(1:20,function(j){paste0("Sample", myKey,"_",j)})
    t <- rep_len(t, length(colData(sce[[myKey]])$time))
    colData(sce[[myKey]])$sample <- t
    print(myKey)
    myColCount <- myColCount + ncol(counts(sce[[myKey]]))
}

# --- Step 2: Pseudo-bulk aggregation (sum counts per gene per sample) ---
print("step 2")
sce <- do.call(cbind, sce)
if (myColCount != ncol(counts(sce))) {
    print("warning column loss")
}
pb_counts <- aggregateAcrossCells(sce, id=colData(sce)[,c("sample")])

# --- Step 3: Create edgeR object ---
print("step 3")
group <- factor(colData(pb_counts)$time)
dge <- DGEList(counts = convertTo(pb_counts,type="edgeR"), group = group)
dge <- calcNormFactors(dge)

# --- Step 4: Design matrix for time series ---
print("step 4")
design <- model.matrix(~0 + group) # no intercept, one coef per time
colnames(design) <- levels(group)

# --- Step 5: Differential expression analysis ---
print("step 5")
dge <- estimateDisp(dge, design)
fit <- glmQLFit(dge, design, dispersion = NULL)
lrt <- glmLRT(fit,coef="D0") #is this coef right?

# --- Step 6: View results ---
print("step 6")
my_gene_id <- topTags(lrt, n=30, sort.by = "PValue")
print(my_gene_id)
my_gene_id <- my_gene_id$table
my_gene_id <- rownames(my_gene_id)

print(sum(colSums(counts(sce))!=0)) # i get 156531
sce <- sce[my_gene_id,] # something really weird going on in filtering process
print(sum(colSums(counts(sce))!=0)) # i get 1121

gene_table <- cbind((seq(0,length(rownames(sce))-1)),(rownames(sce)))
write.table(gene_table, file = "panel_genes.txt", sep = "\t", row.names = FALSE,
            col.names = FALSE, quote = FALSE)
my_time <- substr(sce$time, 2, nchar(sce$time))
my_time <- replace(my_time, my_time=="PSC", "-1")
my_time <- as.integer(my_time)
my_time <- matrix(my_time, nrow = 1, byrow = TRUE,
          dimnames=list("time",colnames(sce)))
panel_real <- rbind(my_time,counts(sce))
gene_labels <- matrix(c(0,seq(0,length(rownames(sce))-1,1)),ncol=1)
panel_real <- cbind(gene_labels,panel_real) 
panel_real <- as.matrix(panel_real)
write.table(panel_real, file = "panel_real.txt", sep = "\t", row.names = FALSE,
            col.names = FALSE)

However when I filter for the returned genes the remaining counts are incredibly sparse. Looking at the 1st 30 'best' genes they all have p values of 0 which seams very unlikely. Less that one 1% of cells have any counts left. I did do some crude filtering of 'bad' cells before performing the pseudo bulking. Can anybody suggest what I might be doing wrong?

edgeR timecoursedata • 117 views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 24 minutes ago
WEHI, Melbourne, Australia

You actually haven't done any DE analysis at all, you've just selected for genes with either zero or large D0 counts.

Another issue is that you are using convertTo() and DGEList() incorrectly. convertTo() creates a DGEList object, not a matrix, so there is no need for DGEList().

Also, you are conducting an old-style edgeR v2 LRT test instead of using the newer edgeR v4 QL pipeline designed to deal with sparse data.

If you had biological replicates, then a simpler and better edgeR pipeline would be:

dge <- convertTo(pb_counts,type="edgeR")
dge <- normLibSizes(dge)
design <- model.matrix(~group)
fit <- glmQLFit(dge, design)
ftest <- glmQLFTest(fit, coef=2:8)

which would immediately make comparisons between all the time-points. But I'm not even sure that you have any replicates, because the GEO series has only one sample for each time point. So it would be impossible to do a simple time course analysis like this, and I am very much unclear what analysis you are trying to do.

The edgeR User's Guide has advice on how to do time series analyses, including series without biological replicates.

You seem to be doing a pseudo bulk analysis, then trying to subset the original single cell data according to the pseudo bulk DE results. That doesn't make a lot of sense to me.

ADD COMMENT
0
Entering edit mode

The use case is kinetic model parameterisation. I'm actually a systems biologist not a bioinformatician. (which is why I'm struggling here) I'm just trying to clean the data (in terms of less relevant genes and bad cells) to reduce it's size for a completely different kind of analysis. As you say there are no replicates. I'm doing the pseudo bulking in such a way as to create several fake replicates. I was recommended pseudo bulking for a number of reasons not least of which the previous process I was trying to use was forcing conversion to a dense matrix which massively exceeded my available ram. ... as for edgeR v4 QL I guess I'm just going off older tutorials so I've not come across it yet.

ADD REPLY
0
Entering edit mode

If you have fake replicates, then I think my code answers your question about how to run the edgeR time-course analysis.

Regarding tutorials, you might consider the edgeR User's Guide:
https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
The "Quick start" section is only a few lines long and would already move you along quite a bit.

ADD REPLY

Login before adding your answer.

Traffic: 1828 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6