I'm dealing with data sets which come from patient data and I need some advice for the formula design to do a gene expression analysis. I don't want to compare the data sets between each other!
The data sets consist of two groups, sample which have a disease and samples which don't have it. It looks like this:
What I would do know is the following design
dds <- DESeqDataSetFromMatrix(countData = Jcounts, colData = coldata, design = ~ Status)
But I'm quite sure this is not the right way to go, because I'm dealing with human samples and they do not only differ by Status. They have other variances. A PCA confirms this as well. There is not real clustering between healthy and sick. The samples are spread all other the coordinate system and sick and healther cluster within each other. What I do see is a separation between samples on the x-axis. There is a gap between the samples and I have a cluster of healthy/sick samples, then the gap and again healthy/sick cluster. I assume it might be a batch effect. So here are my question.
1. Is it worth to account for unknown batches? Like it is described here: http://www.bioconductor.org/help/workflows/rnaseqGene/#batch
I'm not a big fan trying to see something which might not be there which is why I have my doubts about such procedures.
2. What kind of design can I do? And can I somehow account for samples differences? A design like Sample + Status doesn't work. Or should I do likelihood ratio test with DESeq2?
I have additional information for some of the data sets, like age of the patients, bmi index, disease status but I'm not sure how to work these into a design formula. For example how would I build I factor for Age (where would be the group separation age 25 or 40 or 60...)
One data set is from amplified samples and when I run DESeq2 with the design which I mentioned above, I'll get 1 differentially expressed gene at a FDR of 0.1. What I see is that a lot of genes get filtered due to independent filtering:
out of 52719 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 1, 0.0019% LFC < 0 (down) : 0, 0% outliers  : 0, 0% low counts  : 49725, 94% (mean count < 1012.9)
which seems quite high. If I turn independentFiltering off I get 0 genes. What I'd like to know is why the threshold is so high.