txi$counts shows decimal values. Though what I understand is these are number of read counts corresponding to the gene in ith row and the sample in jth column. Should this not be a whole number.
You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.
If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.
It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat I get the gene names instead of transcript ids in the count matrix.
It's the estimated counts, where reads are probabilistically assigned (see Salmon paper for details). There have been a few papers showing that it's useful to work with the estimated counts, offering advantages over working with unique counts (cited in the vignette and DESeq2 workflow). They are rounded by DESeq2 when importing, without loss in precision (consider that variance from sampling greatly exceeds fractions of a count). edgeR also works with fractional counts without rounding.
1. I have data with paired end reads from four different lanes which has to be given input to Salmon. Should I concatenate the fasta files of all the forward reads from different lanes and do the same with backward reads and then give it as an input to Salmon.
2. Salmon gives an ouput as TPM. How can I convert this to RPKM
1) this is a Salmon question, so you can check out the documentation at the link below. The answer is no, you provide the forward and reverse read as separate arguments -1 and -2 to salmon quant.
TPM is a replacement for FPKM, and you shouldn't have problem using it in the place of FPKM in plots or otherwise. You can't directly convert TPM to FPKM, although you can easily go the other way, TPM = 1e6 * FPKM / sum(FPKM).
Sorry for not being clear in my question. Yes I am giving the forward and backward reads as different arguments. My question is I have data from different lanes. They look like this for forward reads
xxx_L001_R1_001.fastq.gz
xxx_L002_R1_001.fastq.gz
xxx_L003_R1_001.fastq.gz
xxx_L004_R1_001.fastq.gz
I am concatenating all the forward reads from different lanes as an input to Salmon. Am I doing it correct.
Hi
Thanks for your replies. I have one more question. While using the tximport package I am getting follwoing message:
transcripts missing from tx2gene: 22632
I am pasting my R script:
I generated the quant.sf file from salmon. Does this mean that the transcripts are missing in the tx2gene. How can I rectify this.
Thanks
Tanya
You need to use the same source and version of transcripts that you used to create the Salmon index (wherever was the soruce of the transcript FASTA), as what you use to build tx2gene.
If you get the FASTA and GTF from the same source, you can build a TxDb using GenomicFeatures::makeTxDbFromGFF() then using the TxDb example code provided in the tximport vignette, if the FASTA and GTF are not from Ensembl.
Hi Michael
Thanks a lot for your replies. I did use the same source: I am working on Mus Musculus
I used ftp://ftp.ensembl.org/pub/release-90/fasta/mus_musculus/cdna/ to get the transcriptome data to run salmon
and
transcripts missing from tx2gene: 22632
I was wondering if it is the version of the database which is creating a problem?
Thanks
Tanya
They have to be exactly the same. You have version 90 and version 79 here, right?
Hi Michael
Yes it works now fine with same versions. Thanks for your help.
Tanya
Hi Michael
It is now working fine for me. The count matrix which I have has transcript ids in the first column. Is there a wat I get the gene names instead of transcript ids in the count matrix.
Thanks
Tanya