RNA Seq data analysis using Salmon and tximport
2
0
Entering edit mode
tanyabioinfo ▴ 20
@tanyabioinfo-14091
Last seen 2.8 years ago

Hi

 

I am following the tutorial to analyze my RNAseq data with Salmon followed by using tximport.

https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html

 

My question is why the

txi$counts shows decimal values. Though what I understand is these are number of read counts corresponding to the gene in ith row and the sample in jth column. Should this not be a whole number.

 

 

 

RNAseq tximport • 4.8k views
ADD COMMENT
0
Entering edit mode

Hi

Thanks for your replies. I have one more question. While using the tximport package I am getting follwoing message:

transcripts missing from tx2gene: 22632

I am pasting my R script:

library("DESeq2")
library("tximport")
library("readr")
library("ReportingTools")
library("AnnotationDbi")
library(ensembldb)
library(EnsDb.Mmusculus.v79)
txdf <- transcripts(EnsDb.Mmusculus.v79, return.type="DataFrame")
tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
rownames(samples) <- samples$run
files <- file.path(dir,samples$run, "quant.sf")
samples$id<-substring(samples$run, 1, 7)
samples$finalid<-paste(samples$id,samples$condition,samples$time,samples$replicate)
names(files) <- samples$finalid
all(file.exists(files))
txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE,dropInfReps=TRUE)

I generated the quant.sf file from salmon. Does this mean that the transcripts are missing in the tx2gene. How can I rectify this.

Thanks

Tanya

ADD REPLY
0
Entering edit mode

You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.

If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.

ADD REPLY
0
Entering edit mode

Hi Michael

 

Thanks a lot for your replies. I did use the same source: I am working on Mus Musculus

 

I used ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/ to get the transcriptome data to run salmon

and

library(EnsDb.Mmusculus.v79) to generate the tx2gene file. It does gives me a file but reporting this message for few of them 22632

transcripts missing from tx2gene: 22632

I was wondering if it is the version of the database which is creating a problem?

Thanks

Tanya

 

ADD REPLY
0
Entering edit mode

They have to be exactly the same. You have version 90 and version 79 here, right?

ADD REPLY
0
Entering edit mode

Hi Michael

 

Yes it works now fine with same versions. Thanks for your help.

 

Tanya

ADD REPLY
0
Entering edit mode

Hi Michael

 

It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat  I get the gene names instead of transcript ids in the count matrix.

 

Thanks

Tanya

 

ADD REPLY
1
Entering edit mode
@mikelove
Last seen 8 hours ago
United States

It's the estimated counts, where reads are probabilistically assigned (see Salmon paper for details). There have been a few papers showing that it's useful to work with the estimated counts, offering advantages over working with unique counts (cited in the vignette and DESeq2 workflow). They are rounded by DESeq2 when importing, without loss in precision (consider that variance from sampling greatly exceeds fractions of a count). edgeR also works with fractional counts without rounding.

ADD COMMENT
0
Entering edit mode

Hi

I have two more questions

1. I have data with paired end reads from four different lanes which has to be given input to Salmon. Should I concatenate the fasta files of all the forward reads from different lanes  and do the same with backward reads and then give it as an input to Salmon.

2. Salmon gives an ouput as TPM. How can I convert this to RPKM

 

Thanks

 

Tanya

ADD REPLY
0
Entering edit mode

1) this is a Salmon question, so you can check out the documentation at the link below. The answer is no, you provide the forward and reverse read as separate arguments -1 and -2 to salmon quant.

https://salmon.readthedocs.io/en/latest/

TPM is a replacement for FPKM, and you shouldn't have problem using it in the place of FPKM in plots or otherwise. You can't directly convert TPM to FPKM, although you can easily go the other way, TPM = 1e6 * FPKM / sum(FPKM).

ADD REPLY
0
Entering edit mode

HI Michael

Sorry for not being clear in my question. Yes I am giving the forward and backward reads as different arguments. My question is I have data from different lanes. They look like this for forward reads

xxx_L001_R1_001.fastq.gz

xxx_L002_R1_001.fastq.gz

xxx_L003_R1_001.fastq.gz

xxx_L004_R1_001.fastq.gz

I am concatenating all the forward reads from different lanes as an input to Salmon. Am I doing it correct.

Thanks

 

Tanya

 

 

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 7 hours ago
United States

No, the NumReads column from salmon quantification is the estimated count, and as such is not necessarily an integer.

ADD COMMENT

Login before adding your answer.

Traffic: 269 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6