Hello everyone,
I have 18 soil metagenomes sampled from three different conditions (6 samples each), for which I have assembled. I’d like to do a comparative study of the genes that could potentially be expressed in these samples. What I did was that I first estimated the presence of KOs using KEGG’s kofam database on the contigs, and I also got the coverage of each KO in my samples. Next, I calculated the total coverage of each KO (i.e. by summing the coverage of the same KOs on different contigs) in every individual sample. I am wondering if I can use these coverage values as input for DESeq2 as opposed to read counts mapped on each gene?
An example of my input data would be
sample | A1 | A2 | A3 | A4 | A5 | A6 | B1 | B2 | B3 | B4 | B5 | B6 | C1 | C2 | C3 | C4 | C5 | C6 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
K00001 | 36.3978 | 0 | 22.4602 | 74.0771 | 37.8024 | 27.4386 | 23.3456 | 41.8246 | 25.8649 | 51.336 | 27.8538 | 22.8763 | 26.036 | 49.5105 | 24.3947 | 34.8723 | 52.0351 | 33.0193 |
K00003 | 747.666 | 828.64 | 509.787 | 950.148 | 993.473 | 777.476 | 717.928 | 732.745 | 633.716 | 894.339 | 821.97 | 874.446 | 850.249 | 1226.15 | 1222.42 | 1011.75 | 1087.76 | 1050.81 |
K00004 | 10.3978 | 7.96612 | 0 | 9.25023 | 19.4712 | 7.21569 | 5.16952 | 11.9648 | 11.1345 | 29.6271 | 34.3835 | 23.7866 | 26.3798 | 34.4165 | 15.7022 | 29.2394 | 22.2378 | 16.9822 |
K00005 | 48.0351 | 66.329 | 14.0175 | 70.9348 | 12.1143 | 99.1202 | 46.5701 | 91.1882 | 61.639 | 65.1754 | 93.5285 | 135.657 | 290.069 | 315.314 | 279.189 | 211.366 | 318.911 | 200.309 |
K00009 | 8.69086 | 0 | 9.73114 | 0 | 19.1342 | 8.89908 | 0 | 0 | 0 | 0 | 0 | 0 | 15.148 | 27.9347 | 18.6585 | 22.022 | 26.684 | 31.8226 |
K00010 | 126.77 | 119.827 | 172.367 | 203.431 | 297.6 | 155.051 | 40.8878 | 98.7166 | 24.2119 | 63.5848 | 141.801 | 83.471 | 74.2136 | 144.451 | 76.1991 | 36.0218 | 72.0287 | 61.0815 |
K00012 | 1530.33 | 1601.31 | 1312.18 | 2208.2 | 2488.36 | 1669.64 | 992.256 | 1011.06 | 924.257 | 1240.32 | 1182.85 | 915.699 | 719.04 | 1180.61 | 1060.9 | 695.96 | 1201.49 | 1044.33 |
K00013 | 256.634 | 363.695 | 150.823 | 575.562 | 442.416 | 416.417 | 312.265 | 409.185 | 363.236 | 512.297 | 424.308 | 323.398 | 219.852 | 616.609 | 573.806 | 436.281 | 548.883 | 483.997 |
..... the list continues
Thank you for any insights!
Rui
Hello Dr.Love,
Thank you for your prompt reply! I suspect that DESeq2 would work if I can get the actual number of reads mapped to each gene, and use them as input?
Also, would you recommend some other tools to deal with coverage data, for example, I know there is a tool called MaAsLin2 that uses general linear models, but I am not sure if it’s appropriate for coverage data either. Or is it a bad idea, after all, to deal with coverage for gene abundance comparisons? Thanks again!
Yes, the recommended input to DESeq2 is the number of reads mapping to each gene.
I would use limma-voom if you have some normalized data, as it will assess the empirical relationship between mean and variance.
Thanks for the recommendation, I will get it a try!