Question: How to handle multi-TSS genes in Repitools?
0
12 months ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.2k wrote:

I am looking to analyze my histone modification ChIP-seq data in promoter regions using Repitools, but I am running into some issues with how to properly define promoter regions for genes with multiple isoforms with different transcription start sites. I'm working with the human hg38 Ensembl transcript annotations. Suppose a particular gene has 3 isoforms, A, B, and C. Isoforms A and B have the same TSS and differ only in the splicing of exons farther downstream. Isoform C has a different TSS. The point of Repitools is to characterize ChIP-seq coverage at different offsets relative to the position of the gene's TSS, but this gene has more than one TSS. I can think of a few possible ways to resolve this issue, all of which seem to have some merit:

1. Handle each isoform independently, generating a 3 separate promoter coverage profiles for this gene, based on the offset from each isoform's TSS. (The profiles for isoforms A and B will be identical, of course.)
2. Handle each unique TSS independently as above, generating 2 separate coverage profiles, one for isoforms A&B and the other for isoform C.
3. Choose one TSS as the "representative" TSS for this gene and ignore the others, generating a single coverage profile for the gene. This might be the farthest upstream TSS, the TSS with the most nearby ChIP-seq reads, the TSS of the highest-expressed isoform, or based on some other criterion.

Can anyone provide some guidance as to whether one of these is the most correct or reasonable way to handle things, or whether each one is potentially valid depending on what I want to achieve? Is there a 4th alternative that I haven't thought of that works better than any of the 3 I've suggested here? Does the answer change if the distance between TSS positions is 10 bp or 1 kbp or 10 kbp?

In case it matters, I also have RNA-seq data for the same samples, but the coverage is only deep enough to measure gene expression, not individual isoform expression.

modified 12 months ago • written 12 months ago by Ryan C. Thompson7.2k

Try processing one sample using RSEM and look at a few genes in IGV to see if the transcript with the majority of assigned reads seems plausible. I'm not convinced of the assertion that there are not enough reads to identify the most highly expressed transcript for each gene.

That's a good point. Even if I can't do a full transcript-level analysis, I can still probably identify the dominant TSS for each gene pretty reliably. However, are you saying that the correct strategy is to pick a single TSS for each gene, based on transcript expression levels?

I wouldn't say that it's the correct approach, but if you are creating a heatmap plot with binPlots, it is best to reduce the number of rows in the heatmap. Having one row for each transcript would make its appearance more complex than necessary. I think including TSS unsupported by reads also could add noise to the result. Do they have no read coverage because they are not transcribed, or for other reasons, such as not being poly-A tailed (e.g. many modern gene databases contain thousands of lincRNA and miRNA) ? I assume that you will plot the RNA-seq abundance value alongside the heatmap, as shown in the vignette.