Question: RNA Seq data analysis using Salmon and tximport
0
2.0 years ago by
tanyabioinfo20
tanyabioinfo20 wrote:

Hi

I am following the tutorial to analyze my RNAseq data with Salmon followed by using tximport.

https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html

My question is why the

txi$counts shows decimal values. Though what I understand is these are number of read counts corresponding to the gene in ith row and the sample in jth column. Should this not be a whole number. rnaseq tximport • 2.6k views ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by tanyabioinfo20 Hi Thanks for your replies. I have one more question. While using the tximport package I am getting follwoing message: transcripts missing from tx2gene: 22632 I am pasting my R script: library("DESeq2") library("tximport") library("readr") library("ReportingTools") library("AnnotationDbi") library(ensembldb) library(EnsDb.Mmusculus.v79) txdf <- transcripts(EnsDb.Mmusculus.v79, return.type="DataFrame") tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")]) samples <- read.table(file.path(dir,"samples.txt"), header=TRUE) rownames(samples) <- samples$run
files <- file.path(dir,samples$run, "quant.sf") samples$id<-substring(samples$run, 1, 7) samples$finalid<-paste(samples$id,samples$condition,samples$time,samples$replicate)
names(files) <- samples\$finalid
all(file.exists(files))
txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE,dropInfReps=TRUE)

I generated the quant.sf file from salmon. Does this mean that the transcripts are missing in the tx2gene. How can I rectify this.

Thanks

Tanya

You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.

If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.

Hi Michael

Thanks a lot for your replies. I did use the same source: I am working on Mus Musculus

I used ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/ to get the transcriptome data to run salmon

and

library(EnsDb.Mmusculus.v79) to generate the tx2gene file. It does gives me a file but reporting this message for few of them 22632

transcripts missing from tx2gene: 22632

I was wondering if it is the version of the database which is creating a problem?

Thanks

Tanya

They have to be exactly the same. You have version 90 and version 79 here, right?

Hi Michael

Yes it works now fine with same versions. Thanks for your help.

Tanya

Hi Michael

It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat  I get the gene names instead of transcript ids in the count matrix.

Thanks

Tanya

Answer: RNA Seq data analysis using Salmon and tximport
1
2.0 years ago by
Michael Love25k
United States
Michael Love25k wrote:

It's the estimated counts, where reads are probabilistically assigned (see Salmon paper for details). There have been a few papers showing that it's useful to work with the estimated counts, offering advantages over working with unique counts (cited in the vignette and DESeq2 workflow). They are rounded by DESeq2 when importing, without loss in precision (consider that variance from sampling greatly exceeds fractions of a count). edgeR also works with fractional counts without rounding.

Hi

I have two more questions

1. I have data with paired end reads from four different lanes which has to be given input to Salmon. Should I concatenate the fasta files of all the forward reads from different lanes  and do the same with backward reads and then give it as an input to Salmon.

2. Salmon gives an ouput as TPM. How can I convert this to RPKM

Thanks

Tanya

1) this is a Salmon question, so you can check out the documentation at the link below. The answer is no, you provide the forward and reverse read as separate arguments -1 and -2 to salmon quant.

TPM is a replacement for FPKM, and you shouldn't have problem using it in the place of FPKM in plots or otherwise. You can't directly convert TPM to FPKM, although you can easily go the other way, TPM = 1e6 * FPKM / sum(FPKM).

HI Michael

Sorry for not being clear in my question. Yes I am giving the forward and backward reads as different arguments. My question is I have data from different lanes. They look like this for forward reads

xxx_L001_R1_001.fastq.gz

xxx_L002_R1_001.fastq.gz

xxx_L003_R1_001.fastq.gz

xxx_L004_R1_001.fastq.gz

I am concatenating all the forward reads from different lanes as an input to Salmon. Am I doing it correct.

Thanks

Tanya

Answer: RNA Seq data analysis using Salmon and tximport
0
2.0 years ago by
United States
James W. MacDonald51k wrote:

No, the NumReads column from salmon quantification is the estimated count, and as such is not necessarily an integer.