The voom weights are intended to capture the magnitude of the Poisson noise, or any other heteroskedasticity in the data that is related to sequencing depth per sample, not the number of samples, so Increasing the number of samples while holding the sequencing depth per sample constant should not affect the weights very much. Regardless, as the sample size increases, I would expect voom to converge with the other methods in its estimation of p-values and log fold changes. The voom paper shows that voom and limma-trend are nearly equivalent when sample sequencing depth is consistent, and so voom is primarily necessary to account for variation in sequencing depth, which is of course a near inevitability in real data, hence the need for the method. You don't mention whether the methods resulted in similar p-values or fold changes.
As for edgeR_robust, I have less experience with this method. However, a large sample size doesn't necessarily eliminate outliers. If I recall correctly, the goal of edgeR_robust is to identify and down-weight samples that do not follow the distributional assumptions of edgeR. If there is some mechanism generating such outliers in the data, increasing the sample size will not make it disappear, and a method that successfully detects and compensates for these outliers will always perform better than the same method without outlier detection, no matter how many samples are used. The edgeR_robust paper claims that in the absence of outliers, edgeR_robust performs essentially the same as unweighted edgeR. Again, you don't mention how your comparisons performed at the level of p-values and fold changes, so I can't comment on any differences you may have seen there.
On a related note, my recent experience on real data has been that limma-voom and edgeR's quasi-likelihood method perform almost identically for most genes. When I plot their p-values against each other on a log-log plot, they cluster tightly around the identity line. The LRT (and DESeq2's Wald test), in contrast, is way too liberal with most genes: https://mneme.homenet.org/~ryan/resume/examples/Salomon/globin/pval-comparisons.pdf
Thanks for replying. I mentioned that I looked at p-values. Fold changes for voom and edgeR_Robust stand apart just as well.
Could you elaborate in your question on how limma-voom and edgeR_robust differ for p-values and logFC? You should also elaborate on how you are simulating the data. These methods were developed to take specific features of the data into account, and their agreement with other methods depends strongly on whether these features are present in your simulation.
I compute the average (across features) distance between p-values or ratios.
Presumably your simulation had some features differentially expressed and other not? So rather than seeing which methods agree most closely with each other, you can determine which methods agree most closely with the truth.
Ok, I'll try that.