Hi! I'm new to differential expression analysis and I don't understand quite well the edgeR's output. Maybe my questions will be basic knowledge but I will really appreciate any help.
I've run a clustering analysis with Single-cell RNA seq data. Now I would like to know the marker genes for each of the obtained clusters. So I'm doing pairwise DE with edgeR, that is, comparing each cluster against the others. My experimental design is something like this:
> exp.des
cellName sample cluster sample.cluster
cellA x c1 x.c1
cellB x c1 x.c1
cellC x c2 x.c2
cellD x c2 x.c2
cellE y c1 y.c1
cellF y c1 y.c1
cellG y c2 y.c2
cellH y c2 y.c2
That is, I have clusters grouping cells coming from different samples (as I expected). To put it another way, the samples contain the same cell types (clusters). My first question is, should I do a pseudobulk analysis and aggregate the counts coming from cells belonging to the same cluster and sample? If that's the case, how should I aggregate the counts? just summing up? In this way I will not treat cells as replicates and my experimental design will be something like this:
> exp.des.pseudobulk
sample cluster sample.cluster
x c1 x.c1
x c2 x.c2
y c1 y.c1
y c2 y.c2
For simplicity, I've followed the edgeR analysis just with one sample so I can understand it better. My code so far:
library(edgeR)
# INPUT
x <- read.table("~/Downloads/DE_test/count_table_ILCs.txt", row.names = 1, header=TRUE)
exp.design <- read.table("~/Downloads/DE_test/exp_design.txt", row.names = 1, header = TRUE)
> head(exp.design)
clusters
SRR2088319 cluster_1
SRR2088517 cluster_1
SRR2088318 cluster_1
SRR2088317 cluster_1
SRR2088515 cluster_1
SRR2088554 cluster_1
# ... and so on, there's 649 cells and 10 clusters
clusters <- exp.design$clusters
y <- DGEList(counts = x, group = clusters)
# FILTERING
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]
# NORMALIZATION FACTORS
y <- calcNormFactors(y)
# DE WITH GLM APPROACH
# ~0 because there's no reference cluster
design <- model.matrix(~0+clusters, data = y$samples)
colnames(design) <- levels(y$samples$group)
> head(design)
cluster_1 cluster_10 cluster_2 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7 cluster_8 cluster_9
SRR2088319 1 0 0 0 0 0 0 0 0 0
SRR2088439 1 0 0 0 0 0 0 0 0 0
SRR2088318 1 0 0 0 0 0 0 0 0 0
SRR2088317 1 0 0 0 0 0 0 0 0 0
SRR2088438 1 0 0 0 0 0 0 0 0 0
SRR2088559 1 0 0 0 0 0 0 0 0 0
y <- estimateDisp(y, design = design)
fit <- glmQLFit(y, design)
# test cluster_1 vs cluster_10
de <- glmQLFTest(fit, coef = c(1, 2))
> head(de$table)
logFC.clusterscluster_1 logFC.clusterscluster_10 logCPM F PValue
ARL14EPP1 -19.11290 -16.60276 2.611280 470.6668 9.267717e-127
DHFR -15.12338 -15.13977 4.869757 953.6638 2.338677e-193
AC064836.3 -15.37740 -16.65843 4.845060 286.1980 1.089586e-89
MBD3 -15.55926 -16.89455 3.736475 332.5589 5.724872e-100
MBD4 -13.70132 -14.81308 6.274161 227.3210 1.922395e-75
MBD5 -13.28945 -13.74994 7.104998 229.5596 5.185558e-76
My second question is: why is there two logFC, one for each cluster? I thought that de <- glmQLFTest(fit, coef = c(1, 2))
tests the expression in cluster 1 in comparison to cluster 10, so I expected just one logFC. The second one is the logFC of cluster 10 in comparison with cluster 1? If that's the case, why are both of them negative? shouldn't one be positive and the other negative?
de <- glmQLFTest(fit, coef = c(2, 1))
> head(de$table)
logFC.clusterscluster_10 logFC.clusterscluster_1 logCPM F PValue
ARL14EPP1 -16.60276 -19.11290 2.611280 470.6668 9.267717e-127
DHFR -15.13977 -15.12338 4.869757 953.6638 2.338677e-193
AC064836.3 -16.65843 -15.37740 4.845060 286.1980 1.089586e-89
MBD3 -16.89455 -15.55926 3.736475 332.5589 5.724872e-100
MBD4 -14.81308 -13.70132 6.274161 227.3210 1.922395e-75
MBD5 -13.74994 -13.28945 7.104998 229.5596 5.185558e-76
Third question: why do I obtain the same results if I test cluster 10 vs cluster 1 instead?
Thank you very much for reading my post, Marta.
Hi Gordon! you're right, I messed up with the argument
pair=c(1,2)
of theexactTest()
function. Now I did it with themakeContrasts
function and the output makes sense:I've also seen your answer in my previous post and it responds to my first question here. Thank you very much for your help in both posts.
Now I'm wondering, if there's only one p-value for each pairwise comparison, why is it necessary to adjust p-values for multiple testing in the
topTags
function?sorry for the bother and thanks again for the answer!
There are around 10,000 p-values, one for each gene.
okay now I get it, thank you!