Question

DESeq2 to clustered metagenomic data

0

Entering edit mode

Earendil • 0

@earendil-11962

Last seen 9.1 years ago

I am having metagenomic data from soil samples, which were generated by a sequence capture method. That is, probes where designed based on desired genes that we wanted to capture from the micro-organisms in the samples. The reads were assembled and the contigs were functionally annotated by KEGG, thus I have a count table across the samples of contigs, a count table of Kegg Orthologies and finally one for pathways.

I decided to explore the clustering of the data with PCA plots, but since I was having count data consisting mostly of zeros, I looked for a transformation method and thus I tried rlog and vts from DESeq2. These methods couldn't be applied to the contig matrix since every contig had at least one zero in one of the samples, but this doesn't matter much because the PCA plots of KOs and especially Pathways seem to cluster the 2 soil sample groups somewhat nicely.

My problem though is that I find it challenging to figure out if these data (grouped contig counts for KOs and Pathways) are appropriate for the transformation methods of rlog and vts (being not so accustomed to statistics I though I would be okay if my data would follow a negative binomial distribution but after searching a bit more on forums I found out that this is not the case).

deseq2 metagenomics • 3.0k views

ADD COMMENT • link updated 9.1 years ago by Michael Love 43k • written 9.1 years ago by Earendil • 0

score 1 · Accepted Answer · 2016-12-02

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 2 days ago

United States

Can you describe the distribution of the counts across samples more? The transformations are in fact designed for negative binomial-distributed data. Many zeros is not in itself a problem, but if you have, within a row, mostly zeros combined with a minority of very large counts, that's not consistent with a negative binomial distribution. So then I would look for some other transformations.

ADD COMMENT • link 9.1 years ago Michael Love 43k

0

Entering edit mode

Dear Mike,

It looks like this for the Pathways (here are 20, in total are 91):

                                            G1 G2 G3 G4 G5 W1 W2 W3 W4 W5
ABC transporters                             0  0  0  0  0  0  0  4  0  2
Alanine, aspartate and glutamate metabolism  0  0  0  0  0  0  2  2  0  0
Alcoholism                                   0  0  0  0  0  4  0  0  0  0
Aminoacyl-tRNA biosynthesis                  2  0  0  2  0  4  0  0  0  2
Amino sugar and nucleotide sugar metabolism 10 24 18 13 15 45 15  5 19 31
Arginine and proline metabolism              0  0  0  0  0  9  0  0  0  0
Arginine biosynthesis                        0  0  2  0  0  3  0  0  0  9
Autophagy                                    0  0  0  2  4  3  0  0  0  0
Basal transcription factors                  4  0  0  0  0  0  0  0  0  0
Biosynthesis of unsaturated fatty acids      0  0  5  0  0  0  0  0  0  0
Calcium signaling pathway                    0  0  0  5  0  0  0  0  0  0
Carotenoid biosynthesis                      0  0  0  0  0  2  0  0  0  0
Cell cycle                                   0  0  0  0  0  0  0  0  0  7
Cell cycle - Caulobacter                     0  0  0  0  0  0  0  0  0  3
cGMP-PKG signaling pathway                   0  0  0  0  0  0  6  0  0  0
Citrate cycle (TCA cycle)                    0  0  0  0  0  0  4  0  0  0
Cyanoamino acid metabolism                   0  0  8  5  5 21  9  2  4 11
Cysteine and methionine metabolism           0  0  2  6  0  4  2  0  0  5
Dioxin degradation                           0  0  0  0  0  2  0  0  0  0
DNA replication                              0  0  0  0  0  4  0  0  0  3

ADD REPLY • link 9.1 years ago Earendil • 0

1

Entering edit mode

That looks fine for the DESeq2 transformations. The transformations offered by DESeq2 simply put the data on the log2 scale but dealing with the problem that the log of small counts has a lot of unwanted sampling variability. You can look through the DESeq2 vignette section on transformation and see which looks the best in the diagnostic plots:

https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#data-transformations-and-visualization

ADD REPLY • link 9.1 years ago Michael Love 43k

0

Entering edit mode

Dear Mike,

I think I need to mention that samples G1-G5 and W1-W5 are not treated as biological replicates in my analysis since the sample treatment varies within these to groups from sample to sample. However I would guess that this doesn't change anything since the default is BLIND = TRUE. I'm noting here that the vst transformation seems to separate the two groups better than the rlog. Is it safe to assume that the more loosely clustered matrix for the KOs (323 KOs instead of 91 Pathways in the previous matrix) is still suitable for these transformations?

Thank you!

ADD REPLY • link 9.1 years ago Earendil • 0

1

Entering edit mode

I'll just say, the transformations are appropriate anywhere where a log transformation would be useful.

The only case in which I've seen the transformations not be useful is the combination of 0's AND very high counts within a row that I described above, and that this is the distribution for the majority of genes.

ADD REPLY • link 9.1 years ago Michael Love 43k