Question: edgeR_robust: where are moderately expressed genes?
0
3.8 years ago by
Poland
marek_koter0 wrote:

Dear all,

I'm running this edgeR-robust code to analyse my miRNA-seq DE sequences. But I receive only highly/lowly expressed miRNAs with logFC values higher than 4/lower than -4. But nothing in between... Should I change  some parametres to get the whole logFC spectrum???

Thank you, Marek D. Koter

My code:

rm(list = ls())
library(edgeR)
group <- factor(c(1,1,1,2,2,2))
keep<-rowSums(cpm(y)>5) >=1
y<-y[keep, ]
y <- DGEList(counts=y,group=group)
design <- model.matrix(~group)
y <- calcNormFactors(y)
y <- estimateGLMRobustDisp(y, design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt, n = 50)

Fragment of my results:

 logFC logCPM LR PValue FDR CTACCCCTGAACTTATTTTAT -5,28003 2,902426 4,407249 0,035787 0,362601 TTTCGTTACTCTGAGGCACA -4,83913 2,484342 3,876514 0,048966 0,362601 AAAGAGACATAACTTTGGGCT 4,173604 1,771509 3,934848 0,047295 0,362601 AACAGGACACTCTAAATCGGG 4,173604 1,771509 3,934848 0,047295 0,362601
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
[5] LC_TIME=Polish_Poland.1250
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] edgeR_3.10.5  limma_3.24.15
loaded via a namespace (and not attached):
[1] tools_3.2.2

rnaseq edger microrna • 855 views
modified 3.8 years ago by Aaron Lun24k • written 3.8 years ago by marek_koter0

You're using an old version of Bioconductor. I suggest you upgrade to BioC 3.2 and edgeR 3.12.0 and use that for your analyses, to make sure this behaviour isn't present in the latest version.

Answer: edgeR_robust: where are moderately expressed genes?
2
3.8 years ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

Obviously, if you take the top set of DE genes with topTags, you'll be looking at the genes that have the strongest differences between conditions. So it's no surprise that you get lots of genes with extreme log-fold changes. If you want to get all the genes (including those with smaller log-FCs), you should set n=Inf when running topTags. This will report all genes in the data set, ordered by p-value.

As an aside, it seems that none of your top genes are significant at a FDR of 5%, even though they have strong log-fold changes. I'd suggest having a closer look at the counts and seeing if your large fold changes are being driven by an outlier sample. This would also inflate the dispersion and reduce the significance of any DE. If there is an outlier, you could try removing it to see whether it improves DE detection. Of course, the other possibility is that your data is just inherently variable, in which case there's not much that can be done.

You forgot to mention that there is something you can do with variable data! The OP could try using voomWithQualityWeights!

While that's true, I was thinking more of a situation where you have high biological variability across all samples, rather than variability induced by a number of poor-quality samples in each condition. In such cases, I don't think sample-based weighting would provide any benefit, as each sample would be of the same (relative) quality such that the weights would all be the same. Besides, you'd have to switch from edgeR to voom, which may or may not be a problem for miRNA data depending on the size of the counts, the mean-variance relationship, etc. I've never worked with that stuff, so I wouldn't know the details - I suspect that voom would be okay for miRNA counts, but the proof is in the pudding.

Thanks Steve :-)