I have a very small shRNA-seq screen dataset that contains only 36 shRNAs (targeting a total of 17 genes) with one shRNA being a control (i.e. no change in its representation is expected over time). Most of the other shRNAs targeting genes that are thought to be essential for cell survival. It is a time course experiment with two replicates for each time points (4 time points) and we are interested to know if any of these shRNAs is consistently depleted over time.
I have read through the examples provided in http://bioinf.wehi.edu.au/shRNAseq/pooledScreenAnalysis.pdf and noticed that most of the examples do not apply any normalisation (with calcNormFactors) to account for compositional difference between the samples. So I was wondering if normalisation is not necessary for shRNA screen data ? And am I right to think that sequencing depth is automatically account for during the differential representation analysis step ?
I would appreciate if you could advise me on the best way to perform time course analysis with just 2 replicates on such a small shRNA-seq screen data ? And possibly the best way to normalise the data to correct for library size and compositional bias (if at all needed) ?
With a time-course design, the example in section 4 (see page 16 for the model) of the guide you refer to above is probably most relevant. In this example, the time-course was over 8 days and there were no replicates, so you will obviously need to adapt the design matrix to suit your experiment.
It would be a good idea to first plot your data to get an idea of what the control looks like versus the other hairpins over time and also what a sensible model to fit to the data might be (i.e. does the baseline (intercept) and/or slope of the regression line differ by replicate or can you assume a common value for each? I've seen screens where common values suffice and others where the baseline differs, but we could still assume a shared slope parameter. Note that the estimated slope will be reported as 'logFC' in the topTable of results, but it's not a log fold-change at all, as time was included in the model as numeric variable rather than a factor). In your experiment, you are looking for hairpins with a negative slope, i.e. there has been a loss in representation over time.
As far a normalization goes, library size differences are accounted for in the analysis by default, so it's not correct to say that there isn't any normalization in these examples (there's no additional normalization). For most analyses I've looked at, additional normalization such as TMM wasn't necessary.
Thank you for your reply. It was really helpful. If you don't mind, I have a few more questions, just to make sure that I understand your suggestions correctly.
- I had a read through section 4 again, and noticed that TMM was applied before model fitting in this example. So I was wondering what are indications to decide whether to correct for composition bias in the dataset ? Is there a function or plot that we can use to see if we need to apply calcNormFactors before model fitting step ?
- You mentioned plotting the data to see what the representation of control looks like over time compared to other hairpins. I presume the best values to plot are CPM values calculated using the cpm function ?
-Could you tell me at which step of the analysis that the library size differences were accounted for by default ? And is there a way to extracted counts after library size were account for? Are these values the same as CPM or they are more complicated than just CPM ?
-Would this be a correct design matrix for including replicates in the model ?