edgeR output doubts
1
0
Entering edit mode
mbenegas ▴ 10
@1deca0b7
Last seen 3.3 years ago
Spain

Hi! I'm new to differential expression analysis and I don't understand quite well the edgeR's output. Maybe my questions will be basic knowledge but I will really appreciate any help.

I've run a clustering analysis with Single-cell RNA seq data. Now I would like to know the marker genes for each of the obtained clusters. So I'm doing pairwise DE with edgeR, that is, comparing each cluster against the others. My experimental design is something like this:

> exp.des
cellName    sample cluster sample.cluster
cellA      x      c1           x.c1
cellB      x      c1           x.c1
cellC      x      c2           x.c2
cellD      x      c2           x.c2
cellE      y      c1           y.c1
cellF      y      c1           y.c1
cellG      y      c2           y.c2
cellH      y      c2           y.c2

That is, I have clusters grouping cells coming from different samples (as I expected). To put it another way, the samples contain the same cell types (clusters). My first question is, should I do a pseudobulk analysis and aggregate the counts coming from cells belonging to the same cluster and sample? If that's the case, how should I aggregate the counts? just summing up? In this way I will not treat cells as replicates and my experimental design will be something like this:

> exp.des.pseudobulk
sample  cluster sample.cluster
x   c1  x.c1
x   c2  x.c2
y   c1  y.c1
y   c2  y.c2

For simplicity, I've followed the edgeR analysis just with one sample so I can understand it better. My code so far:

library(edgeR)

# INPUT
x <- read.table("~/Downloads/DE_test/count_table_ILCs.txt", row.names = 1, header=TRUE)
exp.design <- read.table("~/Downloads/DE_test/exp_design.txt", row.names = 1, header = TRUE)
> head(exp.design)
            clusters
SRR2088319 cluster_1
SRR2088517 cluster_1
SRR2088318 cluster_1
SRR2088317 cluster_1
SRR2088515 cluster_1
SRR2088554 cluster_1
# ... and so on, there's 649 cells and 10 clusters
clusters <- exp.design$clusters
y <- DGEList(counts = x, group = clusters)

# FILTERING
keep <- filterByExpr(y)
y <- y[keep, , keep.lib.sizes=FALSE]

# NORMALIZATION FACTORS
y <- calcNormFactors(y)

# DE WITH GLM APPROACH
# ~0 because there's no reference cluster
design <- model.matrix(~0+clusters, data = y$samples)
colnames(design) <-  levels(y$samples$group)
> head(design)
           cluster_1 cluster_10 cluster_2 cluster_3 cluster_4 cluster_5 cluster_6 cluster_7 cluster_8 cluster_9
SRR2088319         1          0         0         0         0         0         0         0         0         0
SRR2088439         1          0         0         0         0         0         0         0         0         0
SRR2088318         1          0         0         0         0         0         0         0         0         0
SRR2088317         1          0         0         0         0         0         0         0         0         0
SRR2088438         1          0         0         0         0         0         0         0         0         0
SRR2088559         1          0         0         0         0         0         0         0         0         0

y <- estimateDisp(y, design = design)
fit <- glmQLFit(y, design)

# test cluster_1 vs cluster_10  
de <- glmQLFTest(fit, coef = c(1, 2))
> head(de$table)
           logFC.clusterscluster_1 logFC.clusterscluster_10   logCPM        F        PValue
ARL14EPP1                -19.11290                -16.60276 2.611280 470.6668 9.267717e-127
DHFR                     -15.12338                -15.13977 4.869757 953.6638 2.338677e-193
AC064836.3               -15.37740                -16.65843 4.845060 286.1980  1.089586e-89
MBD3                     -15.55926                -16.89455 3.736475 332.5589 5.724872e-100
MBD4                     -13.70132                -14.81308 6.274161 227.3210  1.922395e-75
MBD5                     -13.28945                -13.74994 7.104998 229.5596  5.185558e-76

My second question is: why is there two logFC, one for each cluster? I thought that de <- glmQLFTest(fit, coef = c(1, 2)) tests the expression in cluster 1 in comparison to cluster 10, so I expected just one logFC. The second one is the logFC of cluster 10 in comparison with cluster 1? If that's the case, why are both of them negative? shouldn't one be positive and the other negative?

de <- glmQLFTest(fit, coef = c(2, 1))
> head(de$table)
           logFC.clusterscluster_10 logFC.clusterscluster_1   logCPM        F        PValue
ARL14EPP1                 -16.60276               -19.11290 2.611280 470.6668 9.267717e-127
DHFR                      -15.13977               -15.12338 4.869757 953.6638 2.338677e-193
AC064836.3                -16.65843               -15.37740 4.845060 286.1980  1.089586e-89
MBD3                      -16.89455               -15.55926 3.736475 332.5589 5.724872e-100
MBD4                      -14.81308               -13.70132 6.274161 227.3210  1.922395e-75
MBD5                      -13.74994               -13.28945 7.104998 229.5596  5.185558e-76

Third question: why do I obtain the same results if I test cluster 10 vs cluster 1 instead?

Thank you very much for reading my post, Marta.

logFC edgeR DifferentialExpression scRNAseq • 1.6k views
ADD COMMENT
0
Entering edit mode
@gordon-smyth
Last seen 5 hours ago
WEHI, Melbourne, Australia

The way you are defining comparisons is not correct. Do you mind me asking what documentation you are following? What would lead you to think that coef = c(1,2) compares between groups 1 and 2?

In your previous post ( differential expression for Single-cell RNAseq data using edgeR ) you were defining contrasts using makeContrasts. Why did you stop doing that?

ADD COMMENT
0
Entering edit mode

Hi Gordon! you're right, I messed up with the argument pair=c(1,2) of the exactTest() function. Now I did it with the makeContrasts function and the output makes sense:

> contrast <- makeContrasts(cluster_1 - cluster_10, levels = design)
> de <- glmQLFTest(fit,contrast = contrast)
> head(de$table)
                 logFC   logCPM            F       PValue
ARL14EPP1  -2.51013769 2.611280 13.822981090 0.0002183349
DHFR        0.01639287 4.869757  0.002178002 0.9627914292
AC064836.3  1.28102818 4.845060  1.637685708 0.2011035236
MBD3        1.33529217 3.736475  1.818092986 0.1780138455
MBD4        1.11175497 6.274161  1.018621922 0.3132246764
MBD5        0.46048469 7.104998  0.296306380 0.5863954829
> de$comparison
[1] "1*cluster_1 -1*cluster_10"
> contrast <- makeContrasts(cluster_10 - cluster_1, levels = design)
> de <- glmQLFTest(fit,contrast = contrast)
> head(de$table)
                 logFC   logCPM            F       PValue
ARL14EPP1   2.51013769 2.611280 13.822981090 0.0002183349
DHFR       -0.01639287 4.869757  0.002178002 0.9627914292
AC064836.3 -1.28102818 4.845060  1.637685708 0.2011035236
MBD3       -1.33529217 3.736475  1.818092986 0.1780138455
MBD4       -1.11175497 6.274161  1.018621922 0.3132246764
MBD5       -0.46048469 7.104998  0.296306380 0.5863954829
> de$comparison
[1] "-1*cluster_1 1*cluster_10"

I've also seen your answer in my previous post and it responds to my first question here. Thank you very much for your help in both posts.

Now I'm wondering, if there's only one p-value for each pairwise comparison, why is it necessary to adjust p-values for multiple testing in the topTags function?

sorry for the bother and thanks again for the answer!

ADD REPLY
2
Entering edit mode

if there's only one p-value for each pairwise comparison, why is it necessary to adjust p-values for multiple testing?

There are around 10,000 p-values, one for each gene.

ADD REPLY
0
Entering edit mode

okay now I get it, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6