Hi, My name is Candida Vaz, a post doctoral research fellow at the Bioinformatics Institute (BII) Singapore.
I am using edgeR package for my RNASeq data analysis. The User guide is extremely helpful, and I have been following it well. However, I am not able to find the R command to obtain the list of Differentially expressed genes at a Fold change (log-fold change) and FDR cut-off. The R command summary:
summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))
only gives the summary/number of DE genes. And the command topTags:
topTags(et, n=10, adjust.method="BH", sort.by="PValue")
gives the information/list of the top "n" number of DE genes.
I am interested in getting the full list of DE genes at the log fold change cut-off of (+-1) and FDR cut-off of <0.05. Would appreciate your help with this query.
Thanks a lot.
Candida Vaz
It would be useful if
topTags
had similar options totopTable
in limma, such asp.value
andlfc
.It is actually a deliberate omission as we discourage people from cutting by p.value and logFC simultaneously. RNA-seq logFC are potentially too unstable for that. We would prefer people to use
treatDGE()
.Having said that, a fold change cutoff becomes more acceptable if a substantial prior.count has been used.
Dear Gordon Smyth,
Thanks for pointing this out. However, on reading about treatDGE(), I understand that it can be applied to linear models (experiments that have more than one factor) as it is applied on the output from glmFit. Please correct me if I'm understanding it wrongly.
My project involves a single factor. In this case, I suppose I should use the classic edgeR. So, for classic edgeR, how should I obtain the list of DE genes at a particular p-value and logFC cut-off.
Thanks once again for your kind help!
Candida
The GLM framework in
edgeR
can easily handle a simple experimental design involving just one factor. Just parameterize your design matrix appropriately, e.g., for an experiment with two replicates in each of two groups"a"
and"b"
:For this type of analysis, the GLM methods will give performance similar to classic
edgeR
. However, the use of GLMs will provide a lot more flexibility, e.g., you can now applytreatDGE
. More generally, the fact that the classic methods are restricted to one-factor designs doesn't mean that they're superior to the GLM methods in analyzing those designs.Dear Aaron,
Thank you so much. I had been away for a few weeks.
This is the code I am using: I would be grateful if you could let me know if there is anything wrong with it.
library(edgeR)
x <- read.delim("known-miR-expression-profiling-max-read-counts.txt",row.names="Symbol")
group <- factor(c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e"))
y <- DGEList(counts=x,group=group)
data.frame(Sample=colnames(y),group)
y <- calcNormFactors(y)
write.table(y$pseudo.counts, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = TRUE, sep = "\t")
design <- model.matrix(~group)
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design)
tr <- treatDGE(fit, coef=b, lfc=1)
topTags(tr)
I have two specific questions:-
1. How can I export the TMM normalised data? The following command doesn't work.
write.table(y$pseudo.counts, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = TRUE, sep = "\t")
2. I am able to use the treatDGE function, but not sure of how to get the entire list of DE genes at the logFC and p-value cut-off.
Really appreciate your help!
In the GLM framework, pseudo-counts no longer exist. TMM-normalized data can be obtained as CPM values instead:
This will account for any normalization factors that are present in
y
, assuming thaty
is aDGEList
object.As for
treatDGE
, the log-FC threshold is already considered in computing the p-value. This means that you only need to decide on a threshold for the p-value (or, more practically, the FDR) to get DE genes. You don't have to explicitly filter on log-FC. I'd so something like this:Be aware, though, that a gene with a log-FC of 1, or just above 1, will probably not be detected by
treatDGE
. This is because such a log-FC value is not significantly different from the specified threshold (of 1). See Yunshun's answer for more details.Finally, what's
coef=b
in yourtreatDGE
call? I don't seeb
defined anywhere.Dear Aaron,
Thank you so much for your prompt help!
I'm sorry it should be coef=2 instead of "b".
Please correct me if I'm wrong: My reference sample is "a" (3 replicates) in the first 3 columns. I want to compare all the others (samples "b" to "e" 3 replicates each) to this reference sample. So I need to change the coef to 2 then 3,4,5 for comparing b vs a, c vs a, d vs a and e vs a; am I right?
Secondly, for obtaining the TMM normalised data, should I replace
y <- calcNormFactors(y) with norm.vals <- cpm(y) ?
Thirdly, how can I get the output in the form of a tab delimited file using the write.table()?
Thanks again for all your help! Really appreciate your effort!
Regards,
Candida
calcNormFactors
to perform TMM normalization. The calculation ofnorm.vals
is only suggested to provide some normalized values for exporting; they are not involved in the rest of the pipeline.sep="\t"
andquote=FALSE
in the call towrite.table
. It may also be useful to setcol.names=NA
, to add an empty first field to the header row (in order to accommodate row names in the first column of the file).Thanks a lot Aaron,
Just a last bit of clarification:-
1. To get the TMM normalised values in a file. Are the following lines ok?
y <- calcNormFactors(y)
norm.vals <- cpm(y)
write.table(norm.vals, file="TMM_edgeR_Normalized.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")---This works!
2. Similarly, to get the DE gene list using treatDGE function, how should the write.table command be?
tr <- treatDGE(fit, coef=2, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(keep, file="DE-mirs-1-2.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")----This isn't working
Thanks a lot for your help.
Regards,
Candida
Saving of TMM-normalized values looks okay. To get the DE gene list, you need:
Thank you very much Aaron. I guess all my doubts are cleared now!
Just one last check:-
So once I run the first contrast (2 vs 1) and get the DE list as follows:
fit <- glmFit(y, design)
tr <- treatDGE(fit, coef=2, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(out$table[keep,], file="DE-mirs-1-2.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")
For the next contrasts (3 vs 1 and others vs 1) I just need to alter the commands as follows right?
tr <- treatDGE(fit, coef=3, lfc=1)
out <- topTags(tr, n=Inf, adjust.method="BH")
keep <- out$table$FDR <= 0.05
write.table(out$table[keep,], file="DE-mirs-1-3.txt", row.names = TRUE, col.names = NA, quote=FALSE, sep = "\t")
Really appreciate your kind help. Thanks once again!
Regards,
Candida