Question

Can DESeq2 be applied on gene coverages, rather than read counts?

1

Entering edit mode

Rui ▴ 10

@15ef12dc

Last seen 4.3 years ago

United Kingdom

Hello everyone,

I have 18 soil metagenomes sampled from three different conditions (6 samples each), for which I have assembled. I’d like to do a comparative study of the genes that could potentially be expressed in these samples. What I did was that I first estimated the presence of KOs using KEGG’s kofam database on the contigs, and I also got the coverage of each KO in my samples. Next, I calculated the total coverage of each KO (i.e. by summing the coverage of the same KOs on different contigs) in every individual sample. I am wondering if I can use these coverage values as input for DESeq2 as opposed to read counts mapped on each gene?

An example of my input data would be

sample	A1	A2	A3	A4	A5	A6	B1	B2	B3	B4	B5	B6	C1	C2	C3	C4	C5	C6
K00001	36.3978	0	22.4602	74.0771	37.8024	27.4386	23.3456	41.8246	25.8649	51.336	27.8538	22.8763	26.036	49.5105	24.3947	34.8723	52.0351	33.0193
K00003	747.666	828.64	509.787	950.148	993.473	777.476	717.928	732.745	633.716	894.339	821.97	874.446	850.249	1226.15	1222.42	1011.75	1087.76	1050.81
K00004	10.3978	7.96612	0	9.25023	19.4712	7.21569	5.16952	11.9648	11.1345	29.6271	34.3835	23.7866	26.3798	34.4165	15.7022	29.2394	22.2378	16.9822
K00005	48.0351	66.329	14.0175	70.9348	12.1143	99.1202	46.5701	91.1882	61.639	65.1754	93.5285	135.657	290.069	315.314	279.189	211.366	318.911	200.309
K00009	8.69086	0	9.73114	0	19.1342	8.89908	0	0	0	0	0	0	15.148	27.9347	18.6585	22.022	26.684	31.8226
K00010	126.77	119.827	172.367	203.431	297.6	155.051	40.8878	98.7166	24.2119	63.5848	141.801	83.471	74.2136	144.451	76.1991	36.0218	72.0287	61.0815
K00012	1530.33	1601.31	1312.18	2208.2	2488.36	1669.64	992.256	1011.06	924.257	1240.32	1182.85	915.699	719.04	1180.61	1060.9	695.96	1201.49	1044.33
K00013	256.634	363.695	150.823	575.562	442.416	416.417	312.265	409.185	363.236	512.297	424.308	323.398	219.852	616.609	573.806	436.281	548.883	483.997

..... the list continues

Thank you for any insights!

Rui

Coverage DESeq2 • 1.7k views

ADD COMMENT • link 4.3 years ago Rui ▴ 10

score 1 · Answer 1 · 2021-09-28

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 1 day ago

United States

It's not ideal. Basically, DESeq2 is attempting to model the data given that it consists of observations of reads which have a certain variance property. Normalized counts, or coverage data, do not have that property exactly and so the method won't perform as well as it would with the proper input. I've seen this when people provide normalized counts to DESeq2 instead of original counts -- the method does not have the ability to associate higher variance on the estimated LFC from low counts, when those low counts have actually been scaled up to an artificially high value.

ADD COMMENT • link 4.3 years ago Michael Love 43k

0

Entering edit mode

Hello Dr.Love,

Thank you for your prompt reply! I suspect that DESeq2 would work if I can get the actual number of reads mapped to each gene, and use them as input?

Also, would you recommend some other tools to deal with coverage data, for example, I know there is a tool called MaAsLin2 that uses general linear models, but I am not sure if it’s appropriate for coverage data either. Or is it a bad idea, after all, to deal with coverage for gene abundance comparisons? Thanks again!

ADD REPLY • link 4.3 years ago Rui ▴ 10

1

Entering edit mode

Yes, the recommended input to DESeq2 is the number of reads mapping to each gene.

I would use limma-voom if you have some normalized data, as it will assess the empirical relationship between mean and variance.