Search
Question: DESeq2 to clustered metagenomic data
0
19 months ago by
Earendil0
Earendil0 wrote:

I am having metagenomic data from soil samples, which were generated by a sequence capture method. That is, probes where designed based on desired genes that we wanted to capture from the micro-organisms in the samples. The reads were assembled and the contigs were functionally annotated by KEGG, thus I have a count table across the samples of contigs, a count table of Kegg Orthologies and finally one for pathways.

I decided to explore the clustering of the data with PCA plots, but since I was having count data consisting mostly of zeros, I looked for a transformation method and thus I tried rlog and vts from DESeq2. These methods couldn't be applied to the contig matrix since every contig had at least one zero in one of the samples, but this doesn't matter much because the PCA plots of KOs and especially Pathways seem to cluster the 2 soil sample groups somewhat nicely.

My problem though is that I find it challenging to figure out if these data (grouped contig counts for KOs and Pathways) are appropriate for the transformation methods of rlog and vts (being not so accustomed to statistics I though I would be okay if my data would follow a negative binomial distribution but after searching a bit more on forums I found out that this is not the case).

modified 19 months ago by Michael Love18k • written 19 months ago by Earendil0
1
19 months ago by
Michael Love18k
United States
Michael Love18k wrote:

Can you describe the distribution of the counts across samples more? The transformations are in fact designed for negative binomial-distributed data. Many zeros is not in itself a problem, but if you have, within a row, mostly zeros combined with a minority of very large counts, that's not consistent with a negative binomial distribution. So then I would look for some other transformations.

Dear Mike,

It looks like this for the Pathways (here are 20, in total are 91):

                                            G1 G2 G3 G4 G5 W1 W2 W3 W4 W5
ABC transporters                             0  0  0  0  0  0  0  4  0  2
Alanine, aspartate and glutamate metabolism  0  0  0  0  0  0  2  2  0  0
Alcoholism                                   0  0  0  0  0  4  0  0  0  0
Aminoacyl-tRNA biosynthesis                  2  0  0  2  0  4  0  0  0  2
Amino sugar and nucleotide sugar metabolism 10 24 18 13 15 45 15  5 19 31
Arginine and proline metabolism              0  0  0  0  0  9  0  0  0  0
Arginine biosynthesis                        0  0  2  0  0  3  0  0  0  9
Autophagy                                    0  0  0  2  4  3  0  0  0  0
Basal transcription factors                  4  0  0  0  0  0  0  0  0  0
Biosynthesis of unsaturated fatty acids      0  0  5  0  0  0  0  0  0  0
Calcium signaling pathway                    0  0  0  5  0  0  0  0  0  0
Carotenoid biosynthesis                      0  0  0  0  0  2  0  0  0  0
Cell cycle                                   0  0  0  0  0  0  0  0  0  7
Cell cycle - Caulobacter                     0  0  0  0  0  0  0  0  0  3
cGMP-PKG signaling pathway                   0  0  0  0  0  0  6  0  0  0
Citrate cycle (TCA cycle)                    0  0  0  0  0  0  4  0  0  0
Cyanoamino acid metabolism                   0  0  8  5  5 21  9  2  4 11
Cysteine and methionine metabolism           0  0  2  6  0  4  2  0  0  5
Dioxin degradation                           0  0  0  0  0  2  0  0  0  0
DNA replication                              0  0  0  0  0  4  0  0  0  3
1

That looks fine for the DESeq2 transformations. The transformations offered by DESeq2 simply put the data on the log2 scale but dealing with the problem that the log of small counts has a lot of unwanted sampling variability. You can look through the DESeq2 vignette section on transformation and see which looks the best in the diagnostic plots:

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

Dear Mike,

I think I need to mention that samples G1-G5 and W1-W5 are not treated as biological replicates in my analysis since the sample treatment varies within these to groups from sample to sample. However I would guess that this doesn't change anything since the default is BLIND = TRUE. I'm noting here that the vst transformation seems to separate the two groups better than the rlog. Is it safe to assume that the more loosely clustered matrix for the KOs (323 KOs instead of 91 Pathways in the previous matrix) is still suitable for these transformations?

Thank you!

1

I'll just say, the transformations are appropriate anywhere where a log transformation would be useful.

The only case in which I've seen the transformations not be useful is the combination of 0's AND very high counts within a row that I described above, and that this is the distribution for the majority of genes.