I am having metagenomic data from soil samples, which were generated by a sequence capture method. That is, probes where designed based on desired genes that we wanted to capture from the micro-organisms in the samples. The reads were assembled and the contigs were functionally annotated by KEGG, thus I have a count table across the samples of contigs, a count table of Kegg Orthologies and finally one for pathways.

I decided to explore the clustering of the data with PCA plots, but since I was having count data consisting mostly of zeros, I looked for a transformation method and thus I tried rlog and vts from DESeq2. These methods couldn't be applied to the contig matrix since every contig had at least one zero in one of the samples, but this doesn't matter much because the PCA plots of KOs and especially Pathways seem to cluster the 2 soil sample groups somewhat nicely.

My problem though is that I find it challenging to figure out if these data (grouped contig counts for KOs and Pathways) are appropriate for the transformation methods of rlog and vts (being not so accustomed to statistics I though I would be okay if my data would follow a negative binomial distribution but after searching a bit more on forums I found out that this is not the case).

written 16 months ago by Earendil

