Question

Setting up Experiment Design in baySeq

1

Entering edit mode

adityabandla ▴ 10

@adityabandla-11584

Last seen 5.4 years ago

I have a metagenomics dataset, gene counts (rows) x samples (columns). I am trying to find out genes that are differentially abundant across different levels of my categorical variable of interest. I have already performed this analysis using DESeq2, however I would like to compare my results with baySeq and in addition get a log odds for each gene, for every pair-wise contrast

My experiment design is as follows: I have one grouping variable (Site) and one categorical variable (with three levels).

I have setup 5 models for the above case. However is there a way to factor in the grouping variable when defining my models?

bayseq • 1.1k views

ADD COMMENT • link updated 7.6 years ago by Thomas J Hardcastle ▴ 180 • written 7.6 years ago by adityabandla ▴ 10

score 1 · Answer 1 · 2016-10-05

No, there's no explicit way to consider a grouping variable in a standard baySeq analysis, as the philosophy underlying the baySeq models does not really allow for this - it's not clear to me that there is any reason to expect a (log?-)linear effect on gene expression from some grouping variable. If you include an interaction effect, then this removes the objection, but at this point you are equivalently constructing every possible model (see the 'allModels' function in baySeq and the consensus = TRUE option in the getPriors function).

There are two approaches that I think make sense here; and a third which will very rarely be the right thing to do. You can analyse the data for each site separately, and combine the posterior likelihoods. This will find data which behave similarly across sites; e.g., if a gene shows a high probability of increasing expression with categorical variable level in site A and a high probability of increasing expression with categorical variable in site B, then if you take the product of those probabilities, you will end up with a high probability of increasing expression in both sites - though the amplitude of increase may be considerably different between sites. This is the approach I would generally recommend.

Alternatively, you can construct all possible models for site/variable interaction, and run the analysis using consensus priors. This will probably work if you have three or fewer sites; more than that and you will have to find some way to filter the total number of models. This analysis will discriminate between cases where a gene's expression goes up more or less identically in site A and site B, and those cases where the gene's expression goes up in site A, and up in site B, but at different rates.

The last option is to create a new 'densityFunction' object (see the vignette at http://bioconductor.org/packages/release/bioc/vignettes/baySeq/inst/doc/baySeq_generic.pdf) which incorporates grouping variables. For the reasons I give above, I don't think this is the right route for this particular data set, but there may occasionally be times when it is the right approach.

Best wishes,

Tom H