Question

RNA Seq data analysis using Salmon and tximport

0

Entering edit mode

tanyabioinfo ▴ 20

@tanyabioinfo-14091

Last seen 5.8 years ago

Hi

I am following the tutorial to analyze my RNAseq data with Salmon followed by using tximport.

https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html

My question is why the

txi$counts shows decimal values. Though what I understand is these are number of read counts corresponding to the gene in ith row and the sample in jth column. Should this not be a whole number.

RNAseq tximport • 7.7k views

ADD COMMENT • link 7.2 years ago tanyabioinfo ▴ 20

0

Entering edit mode

Hi

Thanks for your replies. I have one more question. While using the tximport package I am getting follwoing message:

transcripts missing from tx2gene: 22632

I am pasting my R script:

library("DESeq2")
library("tximport")
library("readr")
library("ReportingTools")
library("AnnotationDbi")
library(ensembldb)
library(EnsDb.Mmusculus.v79)
txdf <- transcripts(EnsDb.Mmusculus.v79, return.type="DataFrame")
tx2gene <- as.data.frame(txdf[,c("tx_id", "gene_id")])
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
rownames(samples) <- samples$run
files <- file.path(dir,samples$run, "quant.sf")
samples$id<-substring(samples$run, 1, 7)
samples$finalid<-paste(samples$id,samples$condition,samples$time,samples$replicate)
names(files) <- samples$finalid
all(file.exists(files))
txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE,dropInfReps=TRUE)

I generated the quant.sf file from salmon. Does this mean that the transcripts are missing in the tx2gene. How can I rectify this.

Thanks

Tanya

ADD REPLY • link updated 7.2 years ago by Michael Love 43k • written 7.2 years ago by tanyabioinfo ▴ 20

0

Entering edit mode

You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.

If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael

Thanks a lot for your replies. I did use the same source: I am working on Mus Musculus

I used ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/ to get the transcriptome data to run salmon

and

library(EnsDb.Mmusculus.v79) to generate the tx2gene file. It does gives me a file but reporting this message for few of them 22632

transcripts missing from tx2gene: 22632

I was wondering if it is the version of the database which is creating a problem?

Thanks

Tanya

ADD REPLY • link 7.2 years ago tanyabioinfo ▴ 20

0

Entering edit mode

They have to be exactly the same. You have version 90 and version 79 here, right?

ADD REPLY • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Hi Michael

Yes it works now fine with same versions. Thanks for your help.

Tanya

ADD REPLY • link 7.1 years ago tanyabioinfo ▴ 20

0

Entering edit mode

Hi Michael

It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat I get the gene names instead of transcript ids in the count matrix.

Thanks

Tanya

ADD REPLY • link 7.1 years ago tanyabioinfo ▴ 20

score 1 · Answer 1 · 2017-10-04

1

Entering edit mode

Michael Love 43k

@mikelove

Last seen 5 days ago

United States

It's the estimated counts, where reads are probabilistically assigned (see Salmon paper for details). There have been a few papers showing that it's useful to work with the estimated counts, offering advantages over working with unique counts (cited in the vignette and DESeq2 workflow). They are rounded by DESeq2 when importing, without loss in precision (consider that variance from sampling greatly exceeds fractions of a count). edgeR also works with fractional counts without rounding.

ADD COMMENT • link 7.2 years ago Michael Love 43k

0

Entering edit mode

Hi

I have two more questions

1. I have data with paired end reads from four different lanes which has to be given input to Salmon. Should I concatenate the fasta files of all the forward reads from different lanes and do the same with backward reads and then give it as an input to Salmon.

2. Salmon gives an ouput as TPM. How can I convert this to RPKM

Thanks

Tanya

ADD REPLY • link 7.1 years ago tanyabioinfo ▴ 20

0

Entering edit mode

1) this is a Salmon question, so you can check out the documentation at the link below. The answer is no, you provide the forward and reverse read as separate arguments -1 and -2 to salmon quant.

https://salmon.readthedocs.io/en/latest/

TPM is a replacement for FPKM, and you shouldn't have problem using it in the place of FPKM in plots or otherwise. You can't directly convert TPM to FPKM, although you can easily go the other way, TPM = 1e6 * FPKM / sum(FPKM).

ADD REPLY • link 7.1 years ago Michael Love 43k

0

Entering edit mode

HI Michael

Sorry for not being clear in my question. Yes I am giving the forward and backward reads as different arguments. My question is I have data from different lanes. They look like this for forward reads

xxx_L001_R1_001.fastq.gz

xxx_L002_R1_001.fastq.gz

xxx_L003_R1_001.fastq.gz

xxx_L004_R1_001.fastq.gz

I am concatenating all the forward reads from different lanes as an input to Salmon. Am I doing it correct.

Thanks

Tanya

ADD REPLY • link 7.1 years ago tanyabioinfo ▴ 20

score 0 · Answer 2 · 2017-10-04

0

Entering edit mode

James W. MacDonald 67k

@james-w-macdonald-5106

Last seen 3 days ago

United States

No, the NumReads column from salmon quantification is the estimated count, and as such is not necessarily an integer.

ADD COMMENT • link 7.2 years ago James W. MacDonald 67k