Hello,
I have counts from Salmon that I have imported using Tximport for WT and KO samples. I want to use ERCC spike-in data for normalizing these counts. To do this, my strategy involves estimating size factors using only the spike-in data (function: Deseq2::estimateSizefactors) and using those normalize my counts (Mus musculus transcripts).
What I have done so far: step1: I used Salmon for quantification of both Mus Musculus and ERCC RNA simultaneously (using a concatenated cDNA file). I imported these counts into R using the package "Tximport".
step2: I made two ddsTxi objects, one with ERCC spike-in tx2gene file and the other with mouse Ensembl tx2gene file.
step3: I then used the Deseq2::estimateSizeFactors with the Spike-in_ddsTxi object to get size factors. However, I am unable to get sample-wise size factors:
sizeFactors(Spike-in_ddsTxi) = null
I am aware that normalizationFactors(Spike-in_ddsTxi)
gives me a matrix but I am not sure how to use this for normalization.
Can I please get advice on the following: Question1: Is my method above the correct way of going about normalization with spike-in data? Question2: If the answer to question 1 is no, then what method should I use? Question3: What is the difference between using estimateSizeFactors with Salmon data imported using Tximport vs some other count data e.g. from featureCounts? Question4: Lastly, what is the "control genes" parameter in estimateSizeFactors function? Is that what I am supposed to use?
Thank you.
Hi Michael,
Thanks for your response!
To make the dds object, I am having problems with making a tx2gene file that contains information for both my Mouse Ensembl transcripts (EnsDb.Mmusculus.v79) and the spike-in RNA (gtf file for the spike-in transcripts). My current strategy was to rbind the two tx2gene objects into one and using that to make my txi object using Tximport:
but pseudocounts for my spike-in transcripts don't appear in the resulting txi object. I am not really sure what is going on here given that it works when I use the tx2gene files separately.
I have double-checked all my files as well: 1- My quants.sf files contain both spike-in transcripts and the Ensembl transcripts 2- my tx2gene contains both spike-in transcripts and the Ensembl transcripts
What does
Combined_tx2gene
look like for the spike ins?spike-in data
txid geneid
104216 DQ668359 ERCC-00163
104217 DQ516779 ERCC-00164
104218 DQ668363 ERCC-00165
104219 DQ516776 ERCC-00168
104220 DQ516773 ERCC-00170
104221 DQ854994 ERCC-00171
Mouse Ensembl data
1 ENSMUST00000082387 ENSMUSG00000064336
2 ENSMUST00000179436 ENSMUSG00000095742
3 ENSMUST00000082388 ENSMUSG00000064337
4 ENSMUST00000177695 ENSMUSG00000094121
5 ENSMUST00000082389 ENSMUSG00000064338
6 ENSMUST00000082390 ENSMUSG00000064339
And can you confirm that e.g. DQ668359 is the name of a transcript in
quant.sf
?tximport()
is really just looking up rows in the quants and collapsing them based on the table, there isn't much tricky going on. You can poke around to make sure the matching is correct.You are correct.
Basically the quants.sf file is using gene_id for spike-in data and tx_id for Ensemble data. Not sure how to fix that, but maybe I can try re-labelling the gene_id_spikein column as tx_id_spikein.