edgeR_robust: where are moderately expressed genes?
Entering edit mode
Last seen 7.0 years ago

Dear all,

I'm running this edgeR-robust code to analyse my miRNA-seq DE sequences. But I receive only highly/lowly expressed miRNAs with logFC values higher than 4/lower than -4. But nothing in between... Should I change  some parametres to get the whole logFC spectrum???

Thank you, Marek D. Koter

My code:

rm(list = ls())
y <- read.delim("3DPI_All_bez_K2_I2.txt", row.names="seq", header = T)
group <- factor(c(1,1,1,2,2,2))
keep<-rowSums(cpm(y)>5) >=1
y<-y[keep, ]
y <- DGEList(counts=y,group=group)
design <- model.matrix(~group)
y <- calcNormFactors(y)
y <- estimateGLMRobustDisp(y, design)
fit <- glmFit(y,design)
lrt <- glmLRT(fit,coef=2)
topTags(lrt, n = 50)

Fragment of my results:

  logFC logCPM LR PValue FDR
CTACCCCTGAACTTATTTTAT -5,28003 2,902426 4,407249 0,035787 0,362601
TTTCGTTACTCTGAGGCACA -4,83913 2,484342 3,876514 0,048966 0,362601
AAAGAGACATAACTTTGGGCT 4,173604 1,771509 3,934848 0,047295 0,362601
AACAGGACACTCTAAATCGGG 4,173604 1,771509 3,934848 0,047295 0,362601
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=Polish_Poland.1250    
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     
other attached packages:
[1] edgeR_3.10.5  limma_3.24.15
loaded via a namespace (and not attached):
[1] tools_3.2.2





edger rnaseq microrna • 1.2k views
Entering edit mode

You're using an old version of Bioconductor. I suggest you upgrade to BioC 3.2 and edgeR 3.12.0 and use that for your analyses, to make sure this behaviour isn't present in the latest version.

Entering edit mode
Aaron Lun ★ 27k
Last seen 4 hours ago
The city by the bay

Obviously, if you take the top set of DE genes with topTags, you'll be looking at the genes that have the strongest differences between conditions. So it's no surprise that you get lots of genes with extreme log-fold changes. If you want to get all the genes (including those with smaller log-FCs), you should set n=Inf when running topTags. This will report all genes in the data set, ordered by p-value.

As an aside, it seems that none of your top genes are significant at a FDR of 5%, even though they have strong log-fold changes. I'd suggest having a closer look at the counts and seeing if your large fold changes are being driven by an outlier sample. This would also inflate the dispersion and reduce the significance of any DE. If there is an outlier, you could try removing it to see whether it improves DE detection. Of course, the other possibility is that your data is just inherently variable, in which case there's not much that can be done.

Entering edit mode

You forgot to mention that there is something you can do with variable data! The OP could try using voomWithQualityWeights!

Entering edit mode

While that's true, I was thinking more of a situation where you have high biological variability across all samples, rather than variability induced by a number of poor-quality samples in each condition. In such cases, I don't think sample-based weighting would provide any benefit, as each sample would be of the same (relative) quality such that the weights would all be the same. Besides, you'd have to switch from edgeR to voom, which may or may not be a problem for miRNA data depending on the size of the counts, the mean-variance relationship, etc. I've never worked with that stuff, so I wouldn't know the details - I suspect that voom would be okay for miRNA counts, but the proof is in the pudding.

Entering edit mode

Thanks Steve :-)

Entering edit mode

Many thanks Aaron :-)


Login before adding your answer.

Traffic: 504 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6