For full disclosure I posted this question on the Phyloseq Github page but perhaps this forum is more appropriate.
I have a 16S dataset from gut mucosa and want to analyse differential abundance according to a factor. I have 200 samples: 10 cases and 190 controls.
Q1: is it valid to use DESeq2 to compare differential abundance with DESeq2 with such an a large imbalance between the numbers of cases and control? I know that Deseq2 is designed to deal with some imbalance in sample sizes but I'm unclear about whether this applies equally to 16S data as it does to RNAseq data. There is significant inter-individual variation in 16S data that I'm concerned would prevent
Q2: assuming the above is not a valid way to proceed (i.e. comparing 190 controls with 10 cases), how should this analysis be performed? Should I subsample from my controls (while trying to match other factors between cases and controls)? The problem with this approach is that performing comparisons with different subsamples produces different results (probably because of inherently large intersample variability in 16S data). Also, on what basis would you decide subsample control size? 10, 20, 30?
Q3: A further alternative could be to select x number of controls for comparison to cases but then to resample these controls n number of times and try to build a distribution of n fold changes for each taxa between my cases and controls. Is this statistically valid? How could such an approach be applied with DESeq2?
I'd be grateful for any insight anyone might have on this issue. I have researched the question but have not found it discussed anywhere.
Many thanks for your thoughts.