Question

Transcript abundance estimation for time-course RNA-seq data

0

Entering edit mode

relathman ▴ 20

@relathman-11472

Last seen 5.8 years ago

Germany

Dear community,

I am currently analysing a set of time-course (~10 time points) stranded paired-end RNA-seq data with the particular objective of identifying time-dependent changes in alternative splicing.

However, I am still undecided whether alpine or Salmon or another method would be better suited for the estimation of transcript abundance. Up to this point, I used STAR to align the data to the reference genome for each individual time point. Is there an optimal work flow that you would recommend for read-depth normalization and isoform-specific transcript abundance estimation for time-course data?

Concerning read-depth normalisation, I have read the recommendation of "downsampling" reads in order to get comparable numbers for all samples, assuming that the number of reads is roughly equal across samples. However, the number of reads from my samples differs substantially (50-125 m reads) and I do not want to throw away so much data.

I would be very grateful to hear your thoughts on the matter.

rna-seq salmon alpine transcript expression • 1.6k views

ADD COMMENT • link updated 5.8 years ago by Michael Love 41k • written 5.8 years ago by relathman ▴ 20

score 2 · Accepted Answer · 2018-07-09

2

Entering edit mode

Michael Love 41k

@mikelove

Last seen 20 hours ago

United States

alpine is mostly a tool for researching bias itself, for production-level quantification, please use Salmon instead. They implement the same GC bias model essentially, but Salmon is higher quality software and is massively faster and more efficient.

Salmon provides accurate estimates even if there is a dependence of coverage on fragment GC content, which is one of the most common biases in Illumina sequencing data. I recommend to therefore always run Salmon with the flag --gcBias. See the manuscript and Salmon home page for more details.

Read depth is normalized if you load Salmon data into R using tximport, and then follow the steps for downstream analysis. The downstream packages will then normalize for depth. It is not recommended to downsample, this leads to loss of information, and parametric models can handle the change in depth with an offset. edgeR, DESeq2, limma-voom all go this route rather than throw away some of the data.

You can perform testing at the isoform level if you like. From our latest preprint, there are many Bioconductor packages which work well for "DTE" (differential transcript expression). See Figure 13 for comparison.

https://f1000research.com/articles/7-952/v1

ADD COMMENT • link 5.8 years ago Michael Love 41k

0

Entering edit mode

Many thanks for your fast reply and your explanation. Since Salmon can either map reads itself as well as work with precomputed read alignments, I am unsure what the best approach would be - does quantification with Salmon work "better" with Salmon-mapped reads or are these two independent steps?

ADD REPLY • link 5.8 years ago relathman ▴ 20

0

Entering edit mode

Take a look at the Salmon paper.

I think alignment free was a bit better on simulated data without bias.

ADD REPLY • link 5.8 years ago Michael Love 41k

0

Entering edit mode

I've followed your advice and used tximport for read depth normalization of the Salmon TPM values. In the next step, I want to normalize between samples using TMM from edgeR. I am not necessarily interested in performing classical DTU/DGE analysis but rather want to visualize and compare the expression changes over time. Would you recommend using scaledTPM transcript counts or lengthscaledTPM transcript counts as an input for the edgeR TMM normalization?

Many thanks again!

ADD REPLY • link 5.8 years ago relathman ▴ 20

0

Entering edit mode

I'm more familiar with the DESeq2 normalization functions, so I'll show that. If you had TPMs and wanted to normalize them so they are more comparable across samples, you can use the median ratio method:

sf <- estimateSizeFactorsForMatrix(tpm.mat)
tpm.norm <- t( t(tpm.mat) / sf )

ADD REPLY • link 5.8 years ago Michael Love 41k

0

Entering edit mode

Thanks a lot! Would tpm.mat then be the matrix of the originally estimated transcript counts (unscaled TPMs = countsFromAbundance="no" option from tximport) as output by Salmon?

ADD REPLY • link 5.8 years ago relathman ▴ 20

0

Entering edit mode

Abundances are never modified by tximport, only counts. So it’s the abundance matrix regardless of options

ADD REPLY • link 5.8 years ago Michael Love 41k