High padj in DESeq2
1
0
Entering edit mode
saulwang0102 ▴ 10
@saulwang0102-23530
Last seen 4.5 years ago

Hello,

I've sequenced microRNA from 32 samples (16 treatment vs 16 control), and I've got a count matrix. While using DESeq2 to looking for DE, I found the padj for all microRNAs are as high as > 0.99. Following is my code, any help would be appreciated!

library(DESeq2)
dds <- DESeqDataSetFromMatrix(myframe, coldata, ~ treatment)
dds <- DESeq(dds)
res <- results(dds)
resultsNames(dds)
res.df <- as.data.frame(res)

miRNA baseMean log2FoldChange lfcSE stat pvalue adj
MIMAT0027586    4.805481e+00    3.1333849   0.9929389   3.1556673   0.001601314 0.9523321
MIMAT0003294    7.621532e+02    0.8122833   0.2640110   3.0767030   0.002093038 0.9523321
MIMAT0000510    3.344650e+03    -1.4582643  0.5118796   -2.8488424  0.004387861 0.9972360
MIMAT0027689    1.848324e+01    3.0956418   1.1009402   2.8118165   0.004926259 0.9972360
MIMAT0004980    4.053776e+00    -5.8138759  2.1070047   -2.7593084  0.005792384 0.9972360
MIMAT0005924    4.901990e+00    -3.8081033  1.4067856   -2.7069535  0.006790376 0.9972360
MIMAT0025479    2.370199e+01    2.0295897   0.7704835   2.6341769   0.008434152 0.9972360
MIMAT0003312    3.352260e+00    -4.0317022  1.5723891   -2.5640615  0.010345522 0.9972360
MIMAT0010133    1.468150e+03    -0.9922154  0.3941768   -2.5171838  0.011829708 0.9972360
MIMAT0000261    6.857970e+02    -1.1344064  0.4678614   -2.4246635  0.015322584 0.9972360
MIMAT0004594    3.627429e+01    -1.7125782  0.7255511   -2.3603824  0.018256105 0.9972360

Thanks!

deseq2 RNA-seq microRNA • 2.2k views
ADD COMMENT
0
Entering edit mode
Kevin Blighe ★ 4.0k
@kevin
Last seen 3 days ago
Republic of Ireland

I think that it is literally as stated, i.e., no micro RNA has passed the p-value adjustment. This is not an issue with the DESeq2 package; rather, it says more about your experiment and its setup. There can be various reasons for this.

You should go back to review your experimental plan. Informatically, you can do things such as:

  • pre-filtering the raw counts for low-expressed micro RNAs (many of those listed in your output have very low base means)
  • checking the range of your raw counts per sample and across the dataset
  • checking the dispersion vs mean plot after normalsiation
  • checking a box-and-whiskers and PCA bi-plot of the transformed, normalised expression levels

What I am guessing is that your treatment groups will not appear as well separated on a PCA bi-plot for PC1 versus PC2 as you would hope, which alludes to a genuine lack of difference in the underlying biology. However, you have not informed us about any extraneous effects, e.g., batch effects, that may exist in your data that could also be biasing your result. The main issue that can be observed is that you have left low-expressed micro RNAs in your data.

By the way, I think that it's not great practice running this command in this way:

results(dds)

You should become accustomed to always specifying a contrast / name in order to ensure that you are generating the results table for the contrast / comparison that you want.

You should also read up on the different log-fold change shrinkage methods that are now recommended. Check the extensive vignette.

Kevin

ADD COMMENT
0
Entering edit mode

Dear Kevin,

Thanks for your advice! Like you said, the two groups didn't separate well on PCA bi-plot. I was trying to find some potential biomarkers and the microRNAs were extracted from plasma. As plasma is a mix of microRNA from all kinds of tissue, difference between treatment and control group may not as significant as some specific tissue.

As for your advice, I still have some questions,

  • You mentioned that I should pre-filtering the raw counts for low-expressed micro RNAs, is there any standard of low-expressed microRNAs? I've checked my count matrix, the highest base mean is nearly 450000. Should I filter out the microRNAs with base mean below 10 or other standard?
  • I've simulated a NB distribution dataset (cnts <- matrix(rnbinom(n=10000, mu=100, size=1/0.5), ncol=10)). Even for a randomly generated dataset, the padj could be as low as 0.11. Does that mean that my data is more random than a randomly generated dataset?

Thanks again! Looking forward to your reply.

ADD REPLY
0
Entering edit mode

Hey, I actually recently analysed a micro RNA-seq dataset from plasma and I found good signal between the diseased and healthy condition. I do think that pre-filtering the raw counts will help; so, try the value of 10, to start.

Regarding the generation of a random negative binomial, you would have to bootstrap that many times in order to find statistically significant findings. Try it briefly on your terminal by running it a few times - you should eventually notice one or more variables passing FDR.

Does that mean that my data is more random than a randomly generated dataset?

I would be hesitant to make any assumption here. I will leave it open for anyone else to comment regarding this part.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I've tried your suggestion of pre-filtering low-expressed microRNAs, and I chose a cut-off of 10 counts in each group (which means a miRNA will be filtered out if its expression is below 10 counts in both treatment and control group). After filtering, I have 379 microRNAs left, originally it was 1228. The situation seems better, but still no padj below 0.05. The smallest padj is 0.689.

I'm wondering may I just use the unadjusted p-value instead of padj? Maybe FDR is too strict for my data, and I'll do more qPCR to validate the microRNA.

By the way, may I ask how many microRNAs did you get from your plasma sequencing experiment? And how many DE microRNAs did you find? Thanks a lot! :)

ADD REPLY
0
Entering edit mode

I'm wondering may I just use the unadjusted p-value instead of padj? Maybe FDR is too strict for my data, and I'll do more qPCR to validate the microRNA.

I can only reply from experience, in this case, and say that there will probably be some reviewer who focuses on this very part as being the 'crux of the matter', i.e., your work being published or not. It may be easier to relax the FDR threshold to 10% FDR - I have even seen people using 20% FDR.

You could also explore other avenues via PCA - see my own package, PCAtools.

For whatever reason, it may be that the capture from plasma did not work too well. In our case, we used a Qiagen kit and identified >100 differentially expressed miRs. Years ago, I was working in breast cancer cfDNA isolation from plasma, and we noted great differences between different capture kits / sample prep ( https://www.ncbi.nlm.nih.gov/pubmed/24205045 ).

ADD REPLY
1
Entering edit mode

Hi Kevin, Thanks for the package and article, it really helped a lot :) Since I have 16 replicates in each group, and there are many subgroups according to the patients' outcomes or other variants, I finally decided to analyze in subgroups. This time DESeq2 ran more reasonable. I've got microRNAs with padj less than 0.05. I think that the reason why I couldn't get significant padj may be the heterogeneity within each group, since the patients received different kind of cardiac surgery, which may lead to a giant disturb to circulating miRNAs. Really appreciate your help, wish you have a good day!

ADD REPLY
0
Entering edit mode

I see - yes - a large amount of un-controlled heterogeneity can affect the statistical inferences on the data, in which case a stratified analysis can help.

ADD REPLY

Login before adding your answer.

Traffic: 589 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6