Hello,
I've sequenced microRNA from 32 samples (16 treatment vs 16 control), and I've got a count matrix. While using DESeq2 to looking for DE, I found the padj for all microRNAs are as high as > 0.99. Following is my code, any help would be appreciated!
library(DESeq2)
dds <- DESeqDataSetFromMatrix(myframe, coldata, ~ treatment)
dds <- DESeq(dds)
res <- results(dds)
resultsNames(dds)
res.df <- as.data.frame(res)
miRNA baseMean log2FoldChange lfcSE stat pvalue adj
MIMAT0027586 4.805481e+00 3.1333849 0.9929389 3.1556673 0.001601314 0.9523321
MIMAT0003294 7.621532e+02 0.8122833 0.2640110 3.0767030 0.002093038 0.9523321
MIMAT0000510 3.344650e+03 -1.4582643 0.5118796 -2.8488424 0.004387861 0.9972360
MIMAT0027689 1.848324e+01 3.0956418 1.1009402 2.8118165 0.004926259 0.9972360
MIMAT0004980 4.053776e+00 -5.8138759 2.1070047 -2.7593084 0.005792384 0.9972360
MIMAT0005924 4.901990e+00 -3.8081033 1.4067856 -2.7069535 0.006790376 0.9972360
MIMAT0025479 2.370199e+01 2.0295897 0.7704835 2.6341769 0.008434152 0.9972360
MIMAT0003312 3.352260e+00 -4.0317022 1.5723891 -2.5640615 0.010345522 0.9972360
MIMAT0010133 1.468150e+03 -0.9922154 0.3941768 -2.5171838 0.011829708 0.9972360
MIMAT0000261 6.857970e+02 -1.1344064 0.4678614 -2.4246635 0.015322584 0.9972360
MIMAT0004594 3.627429e+01 -1.7125782 0.7255511 -2.3603824 0.018256105 0.9972360
Thanks!
Dear Kevin,
Thanks for your advice! Like you said, the two groups didn't separate well on PCA bi-plot. I was trying to find some potential biomarkers and the microRNAs were extracted from plasma. As plasma is a mix of microRNA from all kinds of tissue, difference between treatment and control group may not as significant as some specific tissue.
As for your advice, I still have some questions,
cnts <- matrix(rnbinom(n=10000, mu=100, size=1/0.5), ncol=10)
). Even for a randomly generated dataset, the padj could be as low as 0.11. Does that mean that my data is more random than a randomly generated dataset?Thanks again! Looking forward to your reply.
Hey, I actually recently analysed a micro RNA-seq dataset from plasma and I found good signal between the diseased and healthy condition. I do think that pre-filtering the raw counts will help; so, try the value of 10, to start.
Regarding the generation of a random negative binomial, you would have to bootstrap that many times in order to find statistically significant findings. Try it briefly on your terminal by running it a few times - you should eventually notice one or more variables passing FDR.
I would be hesitant to make any assumption here. I will leave it open for anyone else to comment regarding this part.
Hi Kevin,
I've tried your suggestion of pre-filtering low-expressed microRNAs, and I chose a cut-off of 10 counts in each group (which means a miRNA will be filtered out if its expression is below 10 counts in both treatment and control group). After filtering, I have 379 microRNAs left, originally it was 1228. The situation seems better, but still no padj below 0.05. The smallest padj is 0.689.
I'm wondering may I just use the unadjusted p-value instead of padj? Maybe FDR is too strict for my data, and I'll do more qPCR to validate the microRNA.
By the way, may I ask how many microRNAs did you get from your plasma sequencing experiment? And how many DE microRNAs did you find? Thanks a lot! :)
I can only reply from experience, in this case, and say that there will probably be some reviewer who focuses on this very part as being the 'crux of the matter', i.e., your work being published or not. It may be easier to relax the FDR threshold to 10% FDR - I have even seen people using 20% FDR.
You could also explore other avenues via PCA - see my own package, PCAtools.
For whatever reason, it may be that the capture from plasma did not work too well. In our case, we used a Qiagen kit and identified >100 differentially expressed miRs. Years ago, I was working in breast cancer cfDNA isolation from plasma, and we noted great differences between different capture kits / sample prep ( https://www.ncbi.nlm.nih.gov/pubmed/24205045 ).
Hi Kevin, Thanks for the package and article, it really helped a lot :) Since I have 16 replicates in each group, and there are many subgroups according to the patients' outcomes or other variants, I finally decided to analyze in subgroups. This time DESeq2 ran more reasonable. I've got microRNAs with padj less than 0.05. I think that the reason why I couldn't get significant padj may be the heterogeneity within each group, since the patients received different kind of cardiac surgery, which may lead to a giant disturb to circulating miRNAs. Really appreciate your help, wish you have a good day!
I see - yes - a large amount of un-controlled heterogeneity can affect the statistical inferences on the data, in which case a stratified analysis can help.