Question: RNA Seq data analysis using Salmon and tximport
0
gravatar for tanyabioinfo
19 months ago by
tanyabioinfo20
tanyabioinfo20 wrote:

Hi

 

I am following the tutorial to analyze my RNAseq data with Salmon followed by using tximport.

https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html

 

My question is why the

txi$counts shows decimal values. Though what I understand is these are number of read counts corresponding to the gene in ith row and the sample in jth column. Should this not be a whole number.

 

 

 

rnaseq tximport • 2.2k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by tanyabioinfo20

Hi

Thanks for your replies. I have one more question. While using the tximport package I am getting follwoing message:

transcripts missing from tx2gene: 22632

I am pasting my R script:

library("DESeq2")
library("tximport")
library("readr")
library("ReportingTools")
library("AnnotationDbi")
library(ensembldb)
library(EnsDb.Mmusculus.v79)
txdf <- transcripts(EnsDb.Mmusculus.v79, return.type="DataFrame")
tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
rownames(samples) <- samples$run
files <- file.path(dir,samples$run, "quant.sf")
samples$id<-substring(samples$run, 1, 7)
samples$finalid<-paste(samples$id,samples$condition,samples$time,samples$replicate)
names(files) <- samples$finalid
all(file.exists(files))
txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE,dropInfReps=TRUE)

I generated the quant.sf file from salmon. Does this mean that the transcripts are missing in the tx2gene. How can I rectify this.

Thanks

Tanya

ADD REPLYlink modified 19 months ago by Michael Love23k • written 19 months ago by tanyabioinfo20

You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.

If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.

ADD REPLYlink written 19 months ago by Michael Love23k

Hi Michael

 

Thanks a lot for your replies. I did use the same source: I am working on Mus Musculus

 

I used ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/ to get the transcriptome data to run salmon

and

library(EnsDb.Mmusculus.v79) to generate the tx2gene file. It does gives me a file but reporting this message for few of them 22632

transcripts missing from tx2gene: 22632

I was wondering if it is the version of the database which is creating a problem?

Thanks

Tanya

 

ADD REPLYlink written 19 months ago by tanyabioinfo20

They have to be exactly the same. You have version 90 and version 79 here, right?

ADD REPLYlink written 19 months ago by Michael Love23k

Hi Michael

 

Yes it works now fine with same versions. Thanks for your help.

 

Tanya

ADD REPLYlink written 19 months ago by tanyabioinfo20

Hi Michael

 

It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat  I get the gene names instead of transcript ids in the count matrix.

 

Thanks

Tanya

 

ADD REPLYlink written 19 months ago by tanyabioinfo20
Answer: RNA Seq data analysis using Salmon and tximport
1
gravatar for Michael Love
19 months ago by
Michael Love23k
United States
Michael Love23k wrote:

It's the estimated counts, where reads are probabilistically assigned (see Salmon paper for details). There have been a few papers showing that it's useful to work with the estimated counts, offering advantages over working with unique counts (cited in the vignette and DESeq2 workflow). They are rounded by DESeq2 when importing, without loss in precision (consider that variance from sampling greatly exceeds fractions of a count). edgeR also works with fractional counts without rounding.

ADD COMMENTlink written 19 months ago by Michael Love23k

Hi

I have two more questions

1. I have data with paired end reads from four different lanes which has to be given input to Salmon. Should I concatenate the fasta files of all the forward reads from different lanes  and do the same with backward reads and then give it as an input to Salmon.

2. Salmon gives an ouput as TPM. How can I convert this to RPKM

 

Thanks

 

Tanya

ADD REPLYlink written 19 months ago by tanyabioinfo20

1) this is a Salmon question, so you can check out the documentation at the link below. The answer is no, you provide the forward and reverse read as separate arguments -1 and -2 to salmon quant.

https://salmon.readthedocs.io/en/latest/

TPM is a replacement for FPKM, and you shouldn't have problem using it in the place of FPKM in plots or otherwise. You can't directly convert TPM to FPKM, although you can easily go the other way, TPM = 1e6 * FPKM / sum(FPKM).

ADD REPLYlink written 19 months ago by Michael Love23k

HI Michael

Sorry for not being clear in my question. Yes I am giving the forward and backward reads as different arguments. My question is I have data from different lanes. They look like this for forward reads

xxx_L001_R1_001.fastq.gz

xxx_L002_R1_001.fastq.gz

xxx_L003_R1_001.fastq.gz

xxx_L004_R1_001.fastq.gz

I am concatenating all the forward reads from different lanes as an input to Salmon. Am I doing it correct.

Thanks

 

Tanya

 

 

ADD REPLYlink written 19 months ago by tanyabioinfo20
Answer: RNA Seq data analysis using Salmon and tximport
0
gravatar for James W. MacDonald
19 months ago by
United States
James W. MacDonald50k wrote:

No, the NumReads column from salmon quantification is the estimated count, and as such is not necessarily an integer.

ADD COMMENTlink written 19 months ago by James W. MacDonald50k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 154 users visited in the last hour