Question

Normalizing 5' Nascent RNA-seq data to identify differentially expressed transcription start sites

0

Entering edit mode

Carlos • 0

@2bf5e581

Last seen 2.7 years ago

United States

I have access to 5' nascent RNA-seq data that captures transcription initiation information at basepair resolution. For those of you familiar, the data produced is similar to CAGE-seq/5' PRO-seq/GRO-cap/csRNA-seq.

PROBLEM: I would like to compare a promoter in two different conditions and identify initiation sites that are differentially expressed/used between these two conditions.

BACKGROUND: I grabbed 300bp around all annotated TSS in the human genome. I then wrote a quick script to count the number of reads at each basepair for each gene. The count file for a single condition looks something like this:

PromoterID bp1 bp2 bp3 bp4 ... bp300
GeneA      0   2   0   9   ... 0
GeneB      0   10  2   0   ... 0
...
GeneZ      0   0   0   50  ... 0

I now want to take these count files and compare each bp in GeneA in condition 1 vs condition 2 and identify initiation sites/bps that are substantially 'different'. Where 'different' is defined as having a very large fold-change between two conditions. My initial instinct was to take a very simple approach. Normalize the counts using a transformation function that allows for cross-sample comparison like vst or rle and then simply calculate the fold-change of each bp between two conditions. The transformation should still make sense since the assumption that most initiation sites shouldn't be differentially expressed holds true.

I spent a while looking around, but most of what I found either does DGE using the sum of counts, or applies a simpler transformation such as tpm which I have read is not to be used for cross-sample comparisons such as this.

Thus my question is three-fold:

Does this approach make sense?
And if it does, how do you apply, lets say, vst transformation to a matrix of count data like mine? Typically for DGE you have a single count per gene per sample, but in my case I technically have 300 different counts for each gene.
What happens when I want to compare a new dataset of initiation sites to an old dataset? Do I have to re-do the transformation completely with the new data included? Is there a way to save the fitted vst model parameters to use on new data? I have sort of found the answer to my third question. Information below.

Question 3: The variance stabilizing transformation from a previous dataset can be frozen and reapplied to new samples. See the 'Data quality assessment' section of the vignette for strategies to see if new samples are sufficiently similar to previous datasets. The frozen VST is accomplished by saving the dispersion function accessible with dispersionFunction, assigning this to the DESeqDataSet with the new samples, and running varianceStabilizingTransformation with 'blind' set to FALSE (see example below). Then the dispersion function from the previous dataset will be used to transform the new sample(s).

Thanks in advance!

DESeq2 tpm Normalization • 1.0k views

ADD COMMENT • link updated 2.7 years ago by Michael Love 41k • written 2.7 years ago by Carlos • 0

0

Entering edit mode

Cross-posted: https://www.biostars.org/p/9484105/ (posting this so that users do not duplicate efforts)

ADD REPLY • link 2.7 years ago Kevin Blighe ★ 3.9k

score 2 · Accepted Answer · 2021-08-10

If you want to compare at every basepair / sliding window, this sounds more like you should try out derfinder or csaw style approaches, as you will want to merge together adjacent bp with similar signal. You can probably use DESeq2 to find the differences but then other methods provide the merging steps, which you can read about in those other method papers/vignettes.

You can apply the VST regardless here. It doesn't have any assumptions about the independence of the data, it just finds a approximately variance stabilizing function and applies it.

Yes, you can save the VST parameters.