edgeR output doubts
1
0
Entering edit mode
mbenegas • 0
@1deca0b7
Last seen 7 weeks ago
Spain

Hi! I'm new to differential expression analysis and I don't understand quite well the edgeR's output. Maybe my questions will be basic knowledge but I will really appreciate any help.

I've run a clustering analysis with Single-cell RNA seq data. Now I would like to know the marker genes for each of the obtained clusters. So I'm doing pairwise DE with edgeR, that is, comparing each cluster against the others. My experimental design is something like this:

> exp.des
cellName    sample cluster sample.cluster
cellA      x      c1           x.c1
cellB      x      c1           x.c1
cellC      x      c2           x.c2
cellD      x      c2           x.c2
cellE      y      c1           y.c1
cellF      y      c1           y.c1
cellG      y      c2           y.c2
cellH      y      c2           y.c2


That is, I have clusters grouping cells coming from different samples (as I expected). To put it another way, the samples contain the same cell types (clusters). My first question is, should I do a pseudobulk analysis and aggregate the counts coming from cells belonging to the same cluster and sample? If that's the case, how should I aggregate the counts? just summing up? In this way I will not treat cells as replicates and my experimental design will be something like this:

> exp.des.pseudobulk
sample  cluster sample.cluster
x   c1  x.c1
x   c2  x.c2
y   c1  y.c1
y   c2  y.c2


For simplicity, I've followed the edgeR analysis just with one sample so I can understand it better. My code so far:

library(edgeR)

# INPUT
clusters
SRR2088319 cluster_1
SRR2088517 cluster_1
SRR2088318 cluster_1
SRR2088317 cluster_1
SRR2088515 cluster_1
SRR2088554 cluster_1
# ... and so on, there's 649 cells and 10 clusters
clusters <- exp.design$clusters y <- DGEList(counts = x, group = clusters) # FILTERING keep <- filterByExpr(y) y <- y[keep, , keep.lib.sizes=FALSE] # NORMALIZATION FACTORS y <- calcNormFactors(y) # DE WITH GLM APPROACH # ~0 because there's no reference cluster design <- model.matrix(~0+clusters, data = y$samples)
colnames(design) <-  levels(y$samples$group)
cluster_1 cluster_10 cluster_2 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7 cluster_8 cluster_9
SRR2088319         1          0         0         0         0         0         0         0         0         0
SRR2088439         1          0         0         0         0         0         0         0         0         0
SRR2088318         1          0         0         0         0         0         0         0         0         0
SRR2088317         1          0         0         0         0         0         0         0         0         0
SRR2088438         1          0         0         0         0         0         0         0         0         0
SRR2088559         1          0         0         0         0         0         0         0         0         0

y <- estimateDisp(y, design = design)
fit <- glmQLFit(y, design)

# test cluster_1 vs cluster_10
de <- glmQLFTest(fit, coef = c(1, 2))
> head(de$table) logFC.clusterscluster_1 logFC.clusterscluster_10 logCPM F PValue ARL14EPP1 -19.11290 -16.60276 2.611280 470.6668 9.267717e-127 DHFR -15.12338 -15.13977 4.869757 953.6638 2.338677e-193 AC064836.3 -15.37740 -16.65843 4.845060 286.1980 1.089586e-89 MBD3 -15.55926 -16.89455 3.736475 332.5589 5.724872e-100 MBD4 -13.70132 -14.81308 6.274161 227.3210 1.922395e-75 MBD5 -13.28945 -13.74994 7.104998 229.5596 5.185558e-76  My second question is: why is there two logFC, one for each cluster? I thought that de <- glmQLFTest(fit, coef = c(1, 2)) tests the expression in cluster 1 in comparison to cluster 10, so I expected just one logFC. The second one is the logFC of cluster 10 in comparison with cluster 1? If that's the case, why are both of them negative? shouldn't one be positive and the other negative? de <- glmQLFTest(fit, coef = c(2, 1)) > head(de$table)
logFC.clusterscluster_10 logFC.clusterscluster_1   logCPM        F        PValue
ARL14EPP1                 -16.60276               -19.11290 2.611280 470.6668 9.267717e-127
DHFR                      -15.13977               -15.12338 4.869757 953.6638 2.338677e-193
AC064836.3                -16.65843               -15.37740 4.845060 286.1980  1.089586e-89
MBD3                      -16.89455               -15.55926 3.736475 332.5589 5.724872e-100
MBD4                      -14.81308               -13.70132 6.274161 227.3210  1.922395e-75
MBD5                      -13.74994               -13.28945 7.104998 229.5596  5.185558e-76


Third question: why do I obtain the same results if I test cluster 10 vs cluster 1 instead?

Thank you very much for reading my post, Marta.

logFC edgeR DifferentialExpression scRNAseq • 210 views
0
Entering edit mode
@gordon-smyth
Last seen 3 hours ago
WEHI, Melbourne, Australia

The way you are defining comparisons is not correct. Do you mind me asking what documentation you are following? What would lead you to think that coef = c(1,2) compares between groups 1 and 2?

In your previous post ( differential expression for Single-cell RNAseq data using edgeR ) you were defining contrasts using makeContrasts. Why did you stop doing that?

0
Entering edit mode

Hi Gordon! you're right, I messed up with the argument pair=c(1,2) of the exactTest() function. Now I did it with the makeContrasts function and the output makes sense:

> contrast <- makeContrasts(cluster_1 - cluster_10, levels = design)
> de <- glmQLFTest(fit,contrast = contrast)
> head(de$table) logFC logCPM F PValue ARL14EPP1 -2.51013769 2.611280 13.822981090 0.0002183349 DHFR 0.01639287 4.869757 0.002178002 0.9627914292 AC064836.3 1.28102818 4.845060 1.637685708 0.2011035236 MBD3 1.33529217 3.736475 1.818092986 0.1780138455 MBD4 1.11175497 6.274161 1.018621922 0.3132246764 MBD5 0.46048469 7.104998 0.296306380 0.5863954829 > de$comparison
[1] "1*cluster_1 -1*cluster_10"
> contrast <- makeContrasts(cluster_10 - cluster_1, levels = design)
> de <- glmQLFTest(fit,contrast = contrast)
> head(de$table) logFC logCPM F PValue ARL14EPP1 2.51013769 2.611280 13.822981090 0.0002183349 DHFR -0.01639287 4.869757 0.002178002 0.9627914292 AC064836.3 -1.28102818 4.845060 1.637685708 0.2011035236 MBD3 -1.33529217 3.736475 1.818092986 0.1780138455 MBD4 -1.11175497 6.274161 1.018621922 0.3132246764 MBD5 -0.46048469 7.104998 0.296306380 0.5863954829 > de$comparison
[1] "-1*cluster_1 1*cluster_10"


I've also seen your answer in my previous post and it responds to my first question here. Thank you very much for your help in both posts.

Now I'm wondering, if there's only one p-value for each pairwise comparison, why is it necessary to adjust p-values for multiple testing in the topTags function?

sorry for the bother and thanks again for the answer!

2
Entering edit mode

if there's only one p-value for each pairwise comparison, why is it necessary to adjust p-values for multiple testing?

There are around 10,000 p-values, one for each gene.

0
Entering edit mode

okay now I get it, thank you!