Normalization and dispersion estimates in timeseries data with highly variable read numbers
Entering edit mode
Last seen 2.8 years ago

Hello all,

I'm unsure about the analysis of my data and I was hoping for some advice/feedback. Allow me to explain my experimental setup first, as it is important for the analysis. Shortly, I am investigating the chronological lifespan of S. cerevisiae making use of the barcoded S. cerevisiae deletion collection. To do so, I aged this collection in medium (duplicated) without any carbon source to limit/prevent growth . At multiple timepoints, I took a small sample of aging cells and put it into fresh medium for a specific amount of time to enrich for mutants that were still viable. Sequencing the barcodes resulted in the count matrix that I'm using for my analyses.

I'm most interested in the later timepoints, as this results in a smaller set of mutants that are still viable. However:

  • a small number of surviving strains also means a relatively higher enrichment for those during the regrowth. The result is that they take up a substantial portion of the library size. I was hoping that the TMM normalization would tackle this issue, but the scaling factors look quite extreme (1.4 for the early samples when most strains were viable, compared to 0.05 for the latest timepoints). Moreover, the MDSplot then squeezes all early timepoints together with only the 2 last timepoints being located somewhere else. When I don't do the TMM normalization, the MDS plot looks way better, with a clear, time-dependent pattern. Does anyone have an idea on how to best proceed with such data?

  • survival at the final timepoints seems (at least partly) random (cpm values for my replicates show poor correlations at these timepoints). I'm afraid that this inflates the estimated dispersions tremendously: plotBCV indeed shows that a substantial portion of tags has a very high BCV. Importantly, there is no clear trend visible in the plotBCV, which makes me wary of using the trended dispersion for further analyses. Would it in this case be better to use the Common Dispersion (which is still rather high for S. cerevisiae) to treat all tags the same? Or even the Tagwise dispersion? (although I did read in the Users guide that it does not make sense to use this under a QL framework)

Any suggestions welcome!

EdgeR dispersion normalization • 324 views
Entering edit mode
Last seen 9 hours ago
WEHI, Melbourne, Australia

I don't know what you mean by a "barcoded S. cerevisiae deletion collection", so I don't know what sort of data you are analysing and can't give much advice.

I will say:

  • your comments make it seem that the latest timepoints really are extreme so it seems correct that they should get extreme normalization factors.

  • trended dispersion is generally only for RNA-seq. For all other technologies, it is usually better to use a common dispersion or estimateDisp with trend.method="none".

  • you are right that tagwise NB dispersion cannot be used with quasi-likelihood.

Entering edit mode

Thank you for your reply - I will continue with the common dispersion. And I should have explained - the barcoded S. cerevisiae deletion collection is a collection of ~5000 strains in which each strain has a different gene replaced by a unique 20bp DNA barcode. The barcodes have common primer sites so that a single PCR reaction is sufficient to amplify the barcodes of the entire pool. Sequencing the barcodes then allows easy tracking of all the strains in parallel. I use the number of reads for each barcode as a proxy for survival of the individual deletion strains throughout my experiment. Would you then still recommend that I use the TMM normalization? Should this info give rise to additional advice on my data analysis, I'd be glad to hear it :) Thanks!


Login before adding your answer.

Traffic: 548 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6