I have access to 5' nascent RNA-seq data that captures transcription initiation information at basepair resolution. For those of you familiar, the data produced is similar to CAGE-seq/5' PRO-seq/GRO-cap/csRNA-seq.
PROBLEM: I would like to compare a promoter in two different conditions and identify initiation sites that are differentially expressed/used between these two conditions.
BACKGROUND: I grabbed 300bp around all annotated TSS in the human genome. I then wrote a quick script to count the number of reads at each basepair for each gene. The count file for a single condition looks something like this:
PromoterID bp1 bp2 bp3 bp4 ... bp300
GeneA 0 2 0 9 ... 0
GeneB 0 10 2 0 ... 0
...
GeneZ 0 0 0 50 ... 0
I now want to take these count files and compare each bp in GeneA in condition 1 vs condition 2 and identify initiation sites/bps that are substantially 'different'. Where 'different' is defined as having a very large fold-change between two conditions. My initial instinct was to take a very simple approach. Normalize the counts using a transformation function that allows for cross-sample comparison like vst
or rle
and then simply calculate the fold-change of each bp between two conditions. The transformation should still make sense since the assumption that most initiation sites shouldn't be differentially expressed holds true.
I spent a while looking around, but most of what I found either does DGE using the sum of counts, or applies a simpler transformation such as tpm
which I have read is not to be used for cross-sample comparisons such as this.
Thus my question is three-fold:
Does this approach make sense?
And if it does, how do you apply, lets say,
vst
transformation to a matrix of count data like mine? Typically for DGE you have a single count per gene per sample, but in my case I technically have 300 different counts for each gene.What happens when I want to compare a new dataset of initiation sites to an old dataset? Do I have to re-do the transformation completely with the new data included? Is there a way to save the fitted
vst
model parameters to use on new data? I have sort of found the answer to my third question. Information below.
Question 3: The variance stabilizing transformation from a previous dataset can be frozen and reapplied to new samples. See the 'Data quality assessment' section of the vignette for strategies to see if new samples are sufficiently similar to previous datasets. The frozen VST is accomplished by saving the dispersion function accessible with dispersionFunction, assigning this to the DESeqDataSet with the new samples, and running varianceStabilizingTransformation with 'blind' set to FALSE (see example below). Then the dispersion function from the previous dataset will be used to transform the new sample(s).
Thanks in advance!
Cross-posted: https://www.biostars.org/p/9484105/ (posting this so that users do not duplicate efforts)