Import Salmon data with tximport and pool technical replicates
Entering edit mode
Vitor • 0
Last seen 9 days ago
United States


I'm following the vignette on how to import expression estimates from Salmon with tximport and create an offset matrix. I also want to implement the pooling of technical replicates with sumTechReps(). Technical replicates were analyzed separate with Salmon and will have different offsets.

I believe edgeR doesn't modify the raw counts and instead use the offsets in the GLM. And inspecting sumTechReps code, it seems that the function sums the counts and computes average normalization factors if you pass a DGEList object. Is that the correct way of doing this?

EDIT: When using sumTechReps(ID = sample_ids), the column names of y$counts and the row names of y$samples are converted to corresponding sample_ids, but column names of y$offset still refer to the original replicate IDs (although matrix is reduced to the same dimensions as the pooled count matrix). Can that be a problem?

Thank you!

tximport edgeR • 331 views
Entering edit mode
Last seen 1 day ago
United States

Sorry I missed this because it wasn't tagged with tximport.

I think summing the raw counts and averaging the normalization factors is appropriate. But I don't know about the mechanics of sumTechReps.

Entering edit mode
Last seen 2 hours ago
WEHI, Melbourne, Australia

I would prefer that you combined the technical replicates before you ran Salmon, i.e., by merging the FASTQ files. The EM algorithm by which Salmon assigns reads probabilistically to transcripts will work better if it has all the reads for each sample at once. That would help to resolve read assigment ambiguities, reduce the variability of the TPM estimates that are input to tximport from Salmon and hence improve the reliability of the edgeR offset matrix.

Salmon's probabilistic algorithm means that merging FASTQ files at the beginning is better than (not the same as) summing the counts from technical replicates downstream.

edgeR's sumTechReps() function is not currently designed to work optimally on a DGEList object with an offset matrix set. It will just use the offsets for the first technical replicate for each sample instead of combining them. I will look into it and improve the function for this usage case. However, if you have true technical replicates (repeat sequencing runs of the same RNA sample) it will always be better to combine them before running Salmon.


Login before adding your answer.

Traffic: 229 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6