Question: RNA-seq with Salmon and tximport
0
7 weeks ago by
woojoy140
woojoy140 wrote:

Hi!

I'm trying to get the TPM matrix with the output of salmon. I've been following this tutorial: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#salmon-sailfish.

I ran the code below and got the results of 'txi' which includes '"abundance" "counts" "length" "countsFromAbundance". Does the 'counts' mean 'TPM values'?

setwd('C:/Users/wj/Documents/Projects/RNA pipeline/salmon')

library(tximport)
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

dir <- getwd()

samples

files <- file.path(dir, "results", samples$run, "quant.sf") names(files) <- paste0("sample", 1:10) all(file.exists(files)) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene k <- keys(txdb, keytype = "TXNAME") tx2gene <- select(txdb, k, "GENEID", "TXNAME") tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv")) head(tx2gene) txi <- tximport(files, type = "salmon", tx2gene = tx2gene) txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE) txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE) names(txi) head(txi) head(txi$counts)

sample1   sample2  sample3  sample4  sample5   sample6   sample7
ENSG00000000003.14 1013.0136 655.57232 538.1663 321.2297 422.5202 244.36843 229.24551
ENSG00000000005.5     0.0000   0.00000   0.0000   0.0000   0.0000   0.00000   0.00000
ENSG00000000419.12  953.9996 758.00019 599.0003 690.0004 462.0001 369.00012 823.99990
ENSG00000000457.13  209.3692 202.10540 108.3376 210.0156 140.6283 182.69919  68.18909
ENSG00000000460.16  132.8974  75.81567 100.3391 176.4574 125.2161  89.09295  62.00237
ENSG00000000938.12  119.6111 129.81950 131.9999 131.4528 268.7965 135.39350  66.88793
sample8  sample9  sample10
ENSG00000000003.14 305.81635 630.4115 388.70197
ENSG00000000005.5    0.00000   0.0000   6.93319
ENSG00000000419.12 628.00010 582.9998 686.99977
ENSG00000000457.13 176.91460 298.7727 196.90257
ENSG00000000460.16  68.84412 150.1030 130.63998
ENSG00000000938.12  45.17055 161.8891 244.72190


After this, the next step in the tutorial is the code below, and I guess with this step, we can avoid gene-level summarization by setting txOut=TRUE, giving the original transcript level estimates as a list of matrices.

txi.tx <- tximport(files, type = "salmon", txOut = TRUE)

txi.sum <- summarizeToGene(txi.t tx, tx2gene, ignoreAfterBar = TRUE)
all.equal(txi$counts, txi.sum$counts)


I wonder what's the gene-level summarization mean? I see there are 'gene id; ENSG00000000938.12' and should I run this step? I would like to use just TPM values as an expression data to compare this with the GDSC expression data.

rna-seq salmon tximport • 100 views
modified 7 weeks ago by James W. MacDonald51k • written 7 weeks ago by woojoy140
Answer: RNA-seq with Salmon and tximport
2
7 weeks ago by
United States
James W. MacDonald51k wrote:

You could always just check this sort of stuff for yourself. Using the example for tximport:


> txi2 <- tximport(files, "salmon", txOut = TRUE)
1 2 3 4 5 6
> head(txi2$counts) sample1 sample2 sample3 sample4 sample5 sample6 ENST00000456328.2 1.73216 0.000 2.79658 0.230058 7.02211 2.46885 ENST00000450305.2 0.00000 0.000 0.00000 0.000000 0.00000 0.00000 ENST00000488147.1 120.11300 201.888 133.74200 99.882200 166.09100 225.07300 ENST00000619216.1 0.00000 0.000 0.00000 0.000000 0.00000 0.00000 ENST00000473358.1 0.64566 0.000 0.00000 0.000000 0.00000 0.00000 ENST00000469289.1 0.00000 0.000 0.00000 0.000000 0.00000 0.00000 > head(txi2$abundance)
sample1 sample2   sample3    sample4  sample5   sample6
ENST00000456328.2 0.0591243  0.0000 0.0631446 0.00681297 0.199672  0.083908
ENST00000450305.2 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000488147.1 5.4642800  8.1588 3.9969600 3.94843000 6.208390 10.417400
ENST00000619216.1 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000473358.1 0.0679640  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000469289.1 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
Parsed with column specification:
cols(
Name = col_character(),
Length = col_double(),
EffectiveLength = col_double(),
TPM = col_double(),
)
|=================================================================| 100%    7 MB
1 ENST00000456328.2   1657        1459.420 0.0591243   1.73216
2 ENST00000450305.2    632         433.851 0.0000000   0.00000
3 ENST00000488147.1   1351        1095.010 5.4642800 120.11300
4 ENST00000619216.1     68          22.000 0.0000000   0.00000
5 ENST00000473358.1    712         473.242 0.0679640   0.64566
6 ENST00000469289.1    535         345.627 0.0000000   0.00000


As you can see, the TPM values are (in tximport lingo) the abundances, and the NumReads are the counts.

As for your second question, salmon aligns to the transcriptome, so all the counts and abundances are estimates of the transcript abundances. Each gene can have one or more transcripts, and if you care to do things at the gene level, you need to add up all the transcript counts or abundances from all of the transcripts for each gene.

Agree with James, these simple checks are always easy to try on your data.

It’s also listed in ?tximport:

“abundanceCol = name of column with abundances (e.g. TPM or FPKM)”

Thank you for the help! I tried the code to parse with column specification and didn't get any results. How can I get it?

I ran the code below

txi <- tximport(files, type = "salmon", txOut = TRUE, txIdCol = "Name", abundanceCol = "TPM",
countsCol = "NumReads", lengthCol = "Length", importer = function(x) read_tsv(x, skip = 8), ignoreAfterBar = TRUE)


and got the error as

Error in tximport(files, type = "none", txOut = TRUE, txIdCol = "Name",  :
all(file.exists(files)) is not TRUE


How can I fix the code?

A couple of things:

You don't need to specify the columns, you can just use the code in the vignette. I.e. type="salmon" means you don't have to specify these columns.

Before posting to the support site, please try a little harder to figure out what is going wrong. It will save you time, and save others time as well.

The error says that the statement "all files exist" does not evaluate to TRUE.

This means that the files in files are not where you say they are. R can't find them there.

I wonder why I can only see Gene ID like ENSG00000000003.14 instead of transcript ID like ENST00000456328.2?