Question

Obtaining Differentially expressed gene lists at a Fold change and FDR cut-off

3

Entering edit mode

candida.vaz ▴ 50

@candidavaz-6923

Last seen 7.0 years ago

Singapore

Hi, My name is Candida Vaz, a post doctoral research fellow at the Bioinformatics Institute (BII) Singapore.

I am using edgeR package for my RNASeq data analysis. The User guide is extremely helpful, and I have been following it well. However, I am not able to find the R command to obtain the list of Differentially expressed genes at a Fold change (log-fold change) and FDR cut-off. The R command summary:

summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))

only gives the summary/number of DE genes. And the command topTags:

topTags(et, n=10, adjust.method="BH", sort.by="PValue")

gives the information/list of the top "n" number of DE genes.

I am interested in getting the full list of DE genes at the log fold change cut-off of (+-1) and FDR cut-off of <0.05. Would appreciate your help with this query.

Thanks a lot.

Candida Vaz

edger rnaseq • 18k views

ADD COMMENT • link updated 11.3 years ago by James W. MacDonald 68k • written 11.3 years ago by candida.vaz ▴ 50

4

Entering edit mode

Yunshun Chen ▴ 900

@yunshun-chen-5451

Last seen 14 months ago

Australia

There is a formal statistical test to test for DE against a certain fold-change threshold. It is called "TREAT" and It was originally implemented in the limma package for microarray data. Now the TREAT method has been extended to RNA-Seq data and it is implemented in the treatDGE function in the latest release of edgeR. Read the documentation of that function for more detail.

Note that testing for |logFC| > 1 by TREAT is not the same as selecting genes with |logFC| > 1. Genes will need to exceed this threshold by some margin, depending on the data, before being declared statistically significant. It would be better to interpret the threshold as ‘the fold-change below which we are definitely not interested in the gene’ rather than ‘the fold-change above which we are interested in the gene’.

ADD COMMENT • link 11.3 years ago Yunshun Chen ▴ 900

score 6 · Accepted Answer · 2014-10-27

6

Entering edit mode

Aaron Lun ★ 29k

@alun

Last seen 9 hours ago

The city by the bay

To get the list of DE genes with those criteria, you'd need to do a bit more work. Something like the code below would probably suffice.

out <- topTags(et, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05 & abs(out$table$logFC) >= 1
out[keep,]

However, keep in mind that the selection of genes by log-fold change is ad hoc and may compromise FDR control.

ADD COMMENT • link 11.3 years ago Aaron Lun ★ 29k

1

Entering edit mode

It would be useful if topTags had similar options to topTable in limma, such as p.value and lfc.

ADD REPLY • link 11.3 years ago Dario Strbenac ★ 1.6k

3

Entering edit mode

It is actually a deliberate omission as we discourage people from cutting by p.value and logFC simultaneously. RNA-seq logFC are potentially too unstable for that. We would prefer people to use treatDGE().

Having said that, a fold change cutoff becomes more acceptable if a substantial prior.count has been used.

ADD REPLY • link 11.3 years ago Gordon Smyth 53k

0

Entering edit mode

Dear Gordon Smyth,

Thanks for pointing this out. However, on reading about treatDGE(), I understand that it can be applied to linear models (experiments that have more than one factor) as it is applied on the output from glmFit. Please correct me if I'm understanding it wrongly.

My project involves a single factor. In this case, I suppose I should use the classic edgeR. So, for classic edgeR, how should I obtain the list of DE genes at a particular p-value and logFC cut-off.

Thanks once again for your kind help!

Candida

ADD REPLY • link 11.3 years ago candida.vaz ▴ 50

1

Entering edit mode

The GLM framework in edgeR can easily handle a simple experimental design involving just one factor. Just parameterize your design matrix appropriately, e.g., for an experiment with two replicates in each of two groups "a" and "b":

design <- model.matrix(~factor(c("a", "a", "b", "b")))

For this type of analysis, the GLM methods will give performance similar to classic edgeR. However, the use of GLMs will provide a lot more flexibility, e.g., you can now apply treatDGE. More generally, the fact that the classic methods are restricted to one-factor designs doesn't mean that they're superior to the GLM methods in analyzing those designs.

ADD REPLY • link 11.3 years ago Aaron Lun ★ 29k

0

Entering edit mode

Dear Aaron,

Thank you so much. I had been away for a few weeks.

This is the code I am using: I would be grateful if you could let me know if there is anything wrong with it.

library(edgeR)
x <- read.delim("known-miR-expression-profiling-max-read-counts.txt",row.names="Symbol")
group <- factor(c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e"))
y <- DGEList(counts=x,group=group)
data.frame(Sample=colnames(y),group)
y <- calcNormFactors(y)
write.table(y$pseudo.counts, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = TRUE, sep = "\t")
design <- model.matrix(~group)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
tr <- treatDGE(fit, coef=b, lfc=1)
topTags(tr)

I have two specific questions:-

1. How can I export the TMM normalised data? The following command doesn't work.

write.table(y$pseudo.counts, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = TRUE, sep = "\t")

2. I am able to use the treatDGE function, but not sure of how to get the entire list of DE genes at the logFC and p-value cut-off.

Really appreciate your help!

ADD REPLY • link 11.2 years ago candida.vaz ▴ 50

1

Entering edit mode

In the GLM framework, pseudo-counts no longer exist. TMM-normalized data can be obtained as CPM values instead:

norm.vals <- cpm(y)

This will account for any normalization factors that are present in y, assuming that y is a DGEList object.

As for treatDGE, the log-FC threshold is already considered in computing the p-value. This means that you only need to decide on a threshold for the p-value (or, more practically, the FDR) to get DE genes. You don't have to explicitly filter on log-FC. I'd so something like this:

out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05 
out[keep,]

Be aware, though, that a gene with a log-FC of 1, or just above 1, will probably not be detected by treatDGE. This is because such a log-FC value is not significantly different from the specified threshold (of 1). See Yunshun's answer for more details.

Finally, what's coef=b in your treatDGE call? I don't see b defined anywhere.

ADD REPLY • link 11.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Dear Aaron,

Thank you so much for your prompt help!

I'm sorry it should be coef=2 instead of "b".

Please correct me if I'm wrong: My reference sample is "a" (3 replicates) in the first 3 columns. I want to compare all the others (samples "b" to "e" 3 replicates each) to this reference sample. So I need to change the coef to 2 then 3,4,5 for comparing b vs a, c vs a, d vs a and e vs a; am I right?

Secondly, for obtaining the TMM normalised data, should I replace

y <- calcNormFactors(y) with norm.vals <- cpm(y) ?

Thirdly, how can I get the output in the form of a tab delimited file using the write.table()?

Thanks again for all your help! Really appreciate your effort!

Regards,

Candida

ADD REPLY • link 11.2 years ago candida.vaz ▴ 50

1

Entering edit mode

The parametrization of your contrasts is correct.
Don't replace anything. You need the call to calcNormFactors to perform TMM normalization. The calculation of norm.vals is only suggested to provide some normalized values for exporting; they are not involved in the rest of the pipeline.
I usually set sep="\t" and quote=FALSE in the call to write.table. It may also be useful to set col.names=NA, to add an empty first field to the header row (in order to accommodate row names in the first column of the file).

ADD REPLY • link 11.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thanks a lot Aaron,

Just a last bit of clarification:-

1. To get the TMM normalised values in a file. Are the following lines ok?

y <- calcNormFactors(y)
norm.vals <- cpm(y)
write.table(norm.vals, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")---This works!

2. Similarly, to get the DE gene list using treatDGE function, how should the write.table command be?

tr <- treatDGE(fit, coef=2, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(keep, file="DE-mirs-1-2.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")----This isn't working

Thanks a lot for your help.

Regards,

Candida

ADD REPLY • link 11.2 years ago candida.vaz ▴ 50

1

Entering edit mode

Saving of TMM-normalized values looks okay. To get the DE gene list, you need:

write.table(out$table[keep,], file="DE-mirs-1-2.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")

ADD REPLY • link 11.2 years ago Aaron Lun ★ 29k

0

Entering edit mode

Thank you very much Aaron. I guess all my doubts are cleared now!

Just one last check:-

So once I run the first contrast (2 vs 1) and get the DE list as follows:

fit <- glmFit(y, design)
tr <- treatDGE(fit, coef=2, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(out$table[keep,], file="DE-mirs-1-2.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")

For the next contrasts (3 vs 1 and others vs 1) I just need to alter the commands as follows right?
tr <- treatDGE(fit, coef=3, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(out$table[keep,], file="DE-mirs-1-3.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")

Really appreciate your kind help. Thanks once again!

Regards,

Candida

ADD REPLY • link 11.1 years ago candida.vaz ▴ 50