Question

How to model a comparison between comparisons in edgeR?

0

Entering edit mode

Nick N ▴ 60

@nick-n-6370

Last seen 8.6 years ago

United Kingdom

I am working with two subtypes of immune cells. For each of them I have two states - naive and activated. I am interested in the differential expression between these states in each cell type. I'll call these comparisons Contrast A and Contrast B. I am also interested to find out what is the difference between these two contrasts, i.e. to "compare the comparisons" (an illustration can be found here: http://bit.ly/1A1o6ZP). How can I model such comparison in edgeR?

My sample metadata looks like this: http://bit.ly/1EBZ9Kv

Modelling contrasts A and B is trivial (e.g. exactTest(d, pair=c("A-n", "A-act")). Can you write down the model for the "comparison of the comparisons"?

edger rnaseq • 4.0k views

ADD COMMENT • link updated 9.1 years ago by Ryan C. Thompson ★ 7.9k • written 9.1 years ago by Nick N ▴ 60

score 2 · Answer 1 · 2015-03-07

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

You're going to need to switch from the exactTest function to edgeR's glm functionality, which I believe is the recommended method in any case. You're going to want to read Section 3.3 of the edgeR User's Guide: "Experiments with all combinations of multiple fac-
tors", which explains how to do all the analyses you want to do.

ADD COMMENT • link 9.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan - thanks for your fast reply. I, actually, read that chapter and have previously used GLM models to account for batch effects and paired sample design. But I still have difficulty expressing the type of contrast that I am aiming at here. The problem is that the examples in the guide don't seem to include the type of comparison that I am aiming at. The closest example in the guide is 3.5 ("Comparison both between and within subjects"). I could map my case onto that case ("cell type" maps onto "Disease", "Activation" maps onto "Treatment"). The problem is that my samples are all independent whereas the example has paired patient samples. I am not quite sure how to change the model matrix to reflect that. I am sure this is something simple - but I still can't get my head round it.

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

You really just need a design of ~0+Condition, since your two experimental factors are already combined into a single factor in your sample metadata. Then, your contrast A will be (A.act - A.n) and your contrast B will be (B.act - B.n). So, to construct your contrast of contrasts, just subtract one from the other: (B.act - B.n) - (A.act - A.n). This is demonstrated in Subsection 3.3.1 "Defining each treatment combination as a group".

(I have replaced the dashes with dots in your condition names using make.names so that they will be syntactically valid. Otherwse you will not be able to construct contrasts with them.)

ADD REPLY • link 9.1 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Ryan - thanks. This was really helpful. However, I have one question. It occurs to me that if the contrast is defined like this the test is based on the absolute differences between the treatments. I'll give an example below:

Say, there is a gene which in naive cells of type A has on average 10 reads. Upon activation it becomes 20. In cell type B the naive cells have on average 200 reads which, upon activation, become 400. If the contrast is defined like you suggest than what happens is that edgeR finds that in one case the absolute difference is just 10 (i.e. 20 minus 10), in the other it's 200. From this edgeR computes a big FC (20) and the gene gets a really low FDR. But, in fact, the gene works in both cell types in exactly the same way - it increases 2 fold as result of the activation. In essence, this contrast mixes the impact of the differences between the cell types with those caused by the activation. If there are big differences between the cell types these can dominate as in the example I gave.

I reckon what I was looking for was genes that work differently in these different cell types. I thought I could define the contrasts like this (B.act/B.n) - (A.act/A.n). I tried it but I got:

Error in qr.default(contrast) : NA/NaN/Inf in foreign function call (arg 1)

I reckon there are probably zero values somewhere which cause a problem. I could add some small pseudocounts but I am not sure this would be methodologically sound. In your opinion, what is the best way to approach the task that I am trying to solve?

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

In edgeR, coefficients are on a log scale, so subtraction is actually computing log ratios. A 2-fold change is always 1 on the log2 scale, regardless of the absolute expression.

ADD REPLY • link 9.1 years ago Ryan C. Thompson ★ 7.9k

2

Entering edit mode

Of course, some care is still required for zeros. In such cases, the log-fold change becomes undefined, so any contrast between log-fold changes will behave strangely. Consider the example below:

> y <- rbind(c(1000, 0, 10, 0),
+     c(1000, 1, 10, 0),
+     c(1000, 0, 10, 1),
+     c(1000, 1, 10, 1))
> design <- model.matrix(~0+factor(c("A.act", "A.n", "B.act", "B.n"))
> fit <- glmFit(y, design, offset=0, dispersion=0.05)
> lrt <- glmLRT(fit, contrast=c(1, -1, -1, 1))
> lrt$table$PValue
[1] 0.999429030 0.887898066 0.002791298 0.012825046

The first and second rows give large p-values as there is no strong evidence for differences in the log-FCs between A and B. However, the p-values for the third and fourth rows are much smaller, as more evidence is available when the appropriate log-FCs are defined. You can see that the size of the p-value is very sensitive to the gain or loss of a single read count. This complicates the interpretation of the results somewhat.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Ryan - thanks for your fast reply!

I am still a bit confused. I have posted a spreadsheet containing the example of two genes at http://bit.ly/1xbA7LS.

Gene X has the following values for type A activated: 0, 0.038634506, 0. In naive cells type A it has the following values: 0.032516233, 0, 0.036664022. In activated cells of type B it has the following values: 38.19672174, 40.45134583, 32.8059597. In naive cells of type B: 73.65871935, 78.06736892, 72.95833963.

In short, this gene behaves in both cell types quite similarly - its expression decreases by appr. 2 times upon activation. Its FDR is computed as 0 - this is one of the few genes that have this lowest possible FDR.

And there is gene Y. In activated cells of type A it has the following values: 7.384365199, 6.451962448, 7.640117933. In naive cells of type A its values are: 38.5967687, 43.05284305, 40.44041616. In activated cells of type B the values are: 16.14960138, 13.47144545, 15.79395832. In naive cells of type B its values are: 31.18923199, 37.51382043, 36.63212231.

In short, in cell type A the expression of gene Y decreases appr. 5 times upon activation, in cell type B it decreases only 2 times. Its FDR is computed as 0.0496034551314143 - thus it is considered "borderline" - very close to being "not significantly different".

Intuitively for me gene Y seems to change more than gene X. Yet, for edgeR the opposite is true. By the way, logFC computed are 11.3730230557303 for gene X and -0.210034235403313 for gene Y.

I am a bit puzzled by this result. Can you, perhaps, help to interpret it?

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

Are these CPM values? If so, did you compute it with normalized.lib.sizes=TRUE? The size of the log-FC indicates that these values are not what edgeR is looking at internally.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

I think so. Here is my code:

############################# #read in DGE ############################# d <- readDGE(targets$Files) ############################# #Filter & Library Size Re-set ############################# keep <- rowSums(cpm(d)>1) >=3 d <- d[keep,] d$samples$lib.size <- colSums(d$counts) ############################# #Normalisation ############################# d <- calcNormFactors(d) ############################# #Recording the normalized counts ############################# all_cpm=cpm(d, normalized.lib.size=TRUE) all_counts <- cbind(rownames(all_cpm), all_cpm) colnames(all_counts)[1] <- "Ensembl.Gene.ID" rownames(all_counts) <- NULL ############################# #Dispersion estimation ############################# y <- estimateGLMCommonDisp(d, design) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y, design)

############################# #Model fit & Test ############################# fit <- glmFit(y, design) my.contrasts <- makeContrasts(..) lrt <- glmLRT(fit, contrast=my.contrasts[, "Activation-AvsActivation-B"]) ############################# #Results Exploration & Export ############################# detags <- rownames(topTags(lrt, n=20)) cpm(d)[detags,] summary(de <- decideTestsDGE(lrt, p=0.05, adjust="BH")) tt <- topTags(lrt, n=nrow(d)) ############################# #Get the gene metadata ############################# gene_hits <- cbind(rownames(tt$table), tt$table) colnames(gene_hits)[1] <- "Ensembl.Gene.ID" rownames(gene_hits) <- NULL

# with ens75

mm_ens75_biomart <- read.table(file="biomaRt.csv",header=TRUE,sep=",") gene_hits_biomart <- merge(gene_hits, mm_ens75_biomart, by="Ensembl.Gene.ID") ############################# #Merge the gene metadata and the counts data ############################# gene_hits_biomart_counts <- merge(gene_hits_biomart, all_counts, by="Ensembl.Gene.ID") ############################# #Filter, sort and store the merged table ############################# numCols <- length(colnames(gene_hits_biomart_counts)) gene_hits_biomart_counts_filtered <- gene_hits_biomart_counts[gene_hits_biomart_counts$FDR < 0.05, c(1,2,5:numCols)] merged_data_fdr_sorted <- gene_hits_biomart_counts_filtered[order(gene_hits_biomart_counts_filtered[,3]),] write.csv(merged_data_fdr_sorted, file="Activation-AvsActivation-B.csv", row.names = FALSE)

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

How were design and my.contrasts formulated?

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Here are the design and contrasts:

> design A.act B.act C.act A.n B.n C.n 1 1 0 0 0 0 0 2 1 0 0 0 0 0 3 1 0 0 0 0 0 4 0 1 0 0 0 0 5 0 1 0 0 0 0 6 0 1 0 0 0 0 7 0 0 1 0 0 0 8 0 0 1 0 0 0 9 0 0 1 0 0 0 10 0 0 0 1 0 0 11 0 0 0 1 0 0 12 0 0 0 1 0 0 13 0 0 0 0 1 0 14 0 0 0 0 1 0 15 0 0 0 0 1 0 16 0 0 0 0 0 1 17 0 0 0 0 0 1 18 0 0 0 0 0 1

my.contrasts <- makeContrasts(Activation-AvsActivation-B = (A.act - A.n) - (B.act - B.n), levels=design)

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

Can you show the counts and tagwise dispersion estimates for the offending rows, along with the normalization factors and library sizes? Also, you should quote Activation-AvsActivation-B in the call to makeContrasts, otherwise an error is raised.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

Apologies - the actual my.contrast line is:

my.contrasts <- makeContrasts(ActivationR71vsActivationR72 = (AR71 - NR71) - (AR72 - NR72))

which doesn't raise an exception.

AR71 stands for A.act, NR71 for A.n. AR72 is B.act, NR72 is B.n.

I just tried to stick with the original names that I used in the first post in order to avoid confusion.

Are you asking about the raw counts? The normalised are given in my post.

As for library sizes and norm factors - how can I show them? I haven't done this in quite a while so I am not sure how to get these values.

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

Yeah, the raw counts for the relevant genes would be nice. As for the library sizes and normalization factors; just show y$samples$lib.size and y$samples$norm.factors. As for the dispersions, show y$tagwise.dispersion[...], subsetted for the relevant genes.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

I put the raw counts in a spreadsheet: http://bit.ly/1E4k2KL.

Here are the lib sizes and the norm factors:

y$samples$lib.size [1] 29330462 26357396 22566633 28454909 26533897 23963450 33220755 27540983 26982561 31371013 26908984 27572867 21458072 28155268 26135268 34088199 25877118 19837215 > y$samples$norm.factors [1] 0.9695880 0.9820241 0.9802117 1.0249482 1.0183244 1.0278016 0.9980380 0.9981807 0.9967784 0.9803276 0.9943822 0.9891860 1.0369650 0.9931721 1.0006368 1.0083422 1.0191137 0.9849640

As for the dispersion - how shall I do the subsetting? The gene IDs are ENSMUSG00000094525 and ENSMUSG00000022848.

ADD REPLY • link 9.1 years ago Nick N ▴ 60

2

Entering edit mode

I'm getting sensible results with the data that you've provided. I've stored the library sizes and normalization factors in "norm.txt", and the counts in "counts.txt", and I've run the following:

> grouping <- factor(rep(c("A.act", "B.act", "C.act", "A.n", "B.n", "C.n"), each=3))
> design <- model.matrix(~0+grouping)
> colnames(design) <- levels(grouping)
>
> d <- read.table("counts.txt", header=TRUE, row.names=1)
> norms <- read.table("norm.txt")
> y <- DGEList(d, lib.size=as.numeric(norms[1,]), norm.factors=as.numeric(norms[2,]))
> y$common.dispersion <- 0.05
> fit <- glmFit(y, design, prior.count=3)
>
> my.contrast <- makeContrasts(A.act - A.n - (B.act - B.n), levels=design)
> lrt <- glmLRT(fit, contrast=my.contrast)
> lrt$table
                        logFC   logCPM          LR     PValue
ENSMUSG00000094525  0.8927441 4.500039 0.007924653 0.92906560
ENSMUSG00000022848 -0.9128559 4.711778 5.790652591 0.01611161

See if it behaves like this on your machine (feed your own y into the commands for making fit and lrt). My best guess is that you've somehow specified the wrong contrast in your original code.

ADD REPLY • link 9.1 years ago Aaron Lun ★ 28k

0

Entering edit mode

You are right Aaron - it must have been some sort of code mistake on my part. I couldn't reproduce what I saw before. Thank you very much for your help!

ADD REPLY • link 9.1 years ago Nick N ▴ 60