I am trying to analyze a barcode sequencing dataset. The experimental technique that generates the data is described here: Chemical genomic profiling via barcode sequencing to predict compound mode of action. Briefly, we are sequencing a pool of gene knock-out mutants of a yeast or bacterium growing all together. Each mutant has a gene knocked out and is labelled by a unique barcode sequence. We are counting those barcodes in each sample. The goal is to determine which knock-outs grow better or worse under different conditions, e.g. to assess a gene's effect on fitness. The result is much like an RNA-seq count matrix: we have a count for each gene in each sample. I have been trying to use limma to analyze the results. However there is a difference. The knockout library contains many genes that are represented more than once, i.e. there are two or more barcodes that map to the same gene. I am thinking of either summing or averaging these, so I would get a single count value. This could be done first thing before calculating normalization factors in edgeR and running voom. However, is this the right thing to do? And if I compute an average count for a gene, do I need to round it up to an integer?
It's probably fine to sum the counts for all barcodes corresponding to a single gene prior to running voom. This would be analogous to summing exon counts to get a single gene count per sample in a standard DE analysis. You'll end up with larger counts per gene, which should give you more power to detect differences upon culturing under different conditions. Don't take the mean of counts, as this would make it difficult to model the variances. (The variance of the mean depends on the number of barcodes you added together, which isn't something that voom knows about. While this dependence is also present in the variance of the sum, the size of the sum is proportional to the number of barcodes, so voom can figure it out based on the size of the summed counts.)
That said, summation assumes that the barcodes for each gene behave similarly and can be aggregated into a single value. If this isn't the case, you might be losing power when you add things together, e.g., because strong DE for one barcode is "diluted" by weak DE for another barcode. There's also the strange cases when two barcodes for the same gene respond in different directions upon culturing - I'm not sure how to interpret them. I don't know whether such inconsistencies are common in chemical genomics, so it'd be worth checking the behaviour of individual barcodes in your top set of DE genes.
Alternatively, you can keep each barcode separate, analyze them separately, and then aggregate their statistics at the end of the statistical analysis, e.g., using Simes' method (test for any barcodes for a gene being DE) or an intersection-union test (test for all barcodes for a gene being DE). This is analogous to testing for differential expression of individual exons.