Question

TnSeq as ZINB? Alternatives in R?

0

Entering edit mode

David R ▴ 90

@david-rengel-6321

Last seen 15 months ago

European Union

Hi,

I have to analyze some TnSeq data. Since I am quite new to this kind of data, I would like to post here a couple of questions, one more “conceptual” than the other. I hope this is the right forum for my question.

1- Why do we tend to analyze TnSeq data as ZINB-distributed data? What is it so different from RNASeq? I understand that in TnSeq there are loads of zero counts and, crucially, we do not know where those zeros come from, i.e. we do not know if it is a zero because (i) a given TA site has not been used in the library or (ii) because the gene is “essential” and, therefore, mutants with insertions in that genes have not survived. In RNASeq we also have loads of zeros, and we do not know if that is because the gene is not expressed or because the sample has not been sequenced deep enough. Therefore, I cannot tell the difference, to be honest.

2- Regarding practical issues. I know there is TRANSIT in Python to analyze TnSeq data, including multifactorial designs (which is my case). However, I have not seen similar tools in R. Am I wrong? Could DESeq2 or edgeR be used? Do they provide, for instance, appropriate normalization methods for this type of data

Thanks a lot for any help or hints on those two questions. Best regards, David R.

DESeq2 R edgeR Python TnSeq • 1.4k views

ADD COMMENT • link updated 3.2 years ago by chaco001 • 0 • written 3.3 years ago by David R ▴ 90

score 0 · Answer 1 · 2022-11-30

0

Entering edit mode

Michael Love 43k

@mikelove

Last seen just now

United States

Sorry for the delay, I don't have much insight on the distribution, and whether zeros are truly inflated after accounting for experimental variables.

ADD COMMENT • link 3.3 years ago Michael Love 43k

score 0 · Answer 2 · 2022-12-14

Hi there,

Regarding your second question, I've been a part of a barcode TnSeq project and we went through this decision process as well. I think DeSeq2 / edgeR are both good tools for TnSeq analysis once you boil your data down to a counts table. The problem in our case, which led us down a different path entirely, is that we were not just interested in differential fitness but also in absolute fitness. In our case, we had a common "time zero" sample. We wanted to determine how the relative frequency of each strain changed between T0 and after treatments, as well as know how the treatments differed. Potentially there is a way to model this situation with a smart design matrix, but I couldn't figure it out.

So we went a route more akin to a limma type analysis, although we implemented it manually. Basically, (after doing depth / outlier corrections and other QC filtering), we calculated raw fitness values for each strain in each sample by doing log2 fold change of the counts after a treatment vs. in the T0. Then these continuous fitness values were the input into a standard parametric model, and we corrected for multiple hypotheses after the fact.

This largely followed the procedure outlined from the Deutschbauer lab's 2015 paper: https://pubmed.ncbi.nlm.nih.gov/25968644/

To reiterate, I think the reason DESeq/edgeR don't often get used for TNSeq is because there are often two levels of "differences" of interest: difference between T0 and treated samples, and then differences between these differences among the treated samples.