Question

TnSeq as ZINB? Alternatives in R?

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 4 months ago

European Union

Hi,

I have to analyze some TnSeq data. Since I am quite new to this kind of data, I would like to post here a couple of questions, one more “conceptual” than the other. I hope this is the right forum for my question.

1- Why do we tend to analyze TnSeq data as ZINB-distributed data? What is it so different from RNASeq? I understand that in TnSeq there are loads of zero counts and, crucially, we do not know where those zeros come from, i.e. we do not know if it is a zero because (i) a given TA site has not been used in the library or (ii) because the gene is “essential” and, therefore, mutants with insertions in that genes have not survived. In RNASeq we also have loads of zeros, and we do not know if that is because the gene is not expressed or because the sample has not been sequenced deep enough. Therefore, I cannot tell the difference, to be honest.

2- Regarding practical issues. I know there is TRANSIT in Python to analyze TnSeq data, including multifactorial designs (which is my case). However, I have not seen similar tools in R. Am I wrong? Could DESeq2 or edgeR be used? Do they provide, for instance, appropriate normalization methods for this type of data

Thanks a lot for any help or hints on those two questions. Best regards, David R.

DESeq2 R edgeR Python TnSeq • 701 views

ADD COMMENT • link updated 16 months ago by chaco001 • 0 • written 17 months ago by David R ▴ 90

score 0 · Answer 1 · 2022-11-30

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 3 hours ago

United States

Sorry for the delay, I don't have much insight on the distribution, and whether zeros are truly inflated after accounting for experimental variables.

ADD COMMENT • link 16 months ago Michael Love 41k

score 0 · Answer 2 · 2022-12-14

Hi there,

Regarding your second question, I've been a part of a barcode TnSeq project and we went through this decision process as well. I think DeSeq2 / edgeR are both good tools for TnSeq analysis once you boil your data down to a counts table. The problem in our case, which led us down a different path entirely, is that we were not just interested in differential fitness but also in absolute fitness. In our case, we had a common "time zero" sample. We wanted to determine how the relative frequency of each strain changed between T0 and after treatments, as well as know how the treatments differed. Potentially there is a way to model this situation with a smart design matrix, but I couldn't figure it out.

So we went a route more akin to a limma type analysis, although we implemented it manually. Basically, (after doing depth / outlier corrections and other QC filtering), we calculated raw fitness values for each strain in each sample by doing log2 fold change of the counts after a treatment vs. in the T0. Then these continuous fitness values were the input into a standard parametric model, and we corrected for multiple hypotheses after the fact.

This largely followed the procedure outlined from the Deutschbauer lab's 2015 paper: https://pubmed.ncbi.nlm.nih.gov/25968644/

To reiterate, I think the reason DESeq/edgeR don't often get used for TNSeq is because there are often two levels of "differences" of interest: difference between T0 and treated samples, and then differences between these differences among the treated samples.