Question

RNA-seq with Salmon and tximport

0

Entering edit mode

woojoy14 • 0

@woojoy14-21733

Last seen 4.7 years ago

Hi!

I'm trying to get the TPM matrix with the output of salmon. I've been following this tutorial: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#salmon-sailfish.

I ran the code below and got the results of 'txi' which includes '"abundance" "counts" "length" "countsFromAbundance". Does the 'counts' mean 'TPM values'?

setwd('C:/Users/wj/Documents/Projects/RNA pipeline/salmon')

library(tximport)
library(GenomicFeatures)
library(readr)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

dir <- getwd()

samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples

files <- file.path(dir, "results", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:10)
all(file.exists(files))


txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")

tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
head(tx2gene)

txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE)


names(txi)

head(txi)

head(txi$counts)


                     sample1   sample2  sample3  sample4  sample5   sample6   sample7
ENSG00000000003.14 1013.0136 655.57232 538.1663 321.2297 422.5202 244.36843 229.24551
ENSG00000000005.5     0.0000   0.00000   0.0000   0.0000   0.0000   0.00000   0.00000
ENSG00000000419.12  953.9996 758.00019 599.0003 690.0004 462.0001 369.00012 823.99990
ENSG00000000457.13  209.3692 202.10540 108.3376 210.0156 140.6283 182.69919  68.18909
ENSG00000000460.16  132.8974  75.81567 100.3391 176.4574 125.2161  89.09295  62.00237
ENSG00000000938.12  119.6111 129.81950 131.9999 131.4528 268.7965 135.39350  66.88793
                     sample8  sample9  sample10
ENSG00000000003.14 305.81635 630.4115 388.70197
ENSG00000000005.5    0.00000   0.0000   6.93319
ENSG00000000419.12 628.00010 582.9998 686.99977
ENSG00000000457.13 176.91460 298.7727 196.90257
ENSG00000000460.16  68.84412 150.1030 130.63998
ENSG00000000938.12  45.17055 161.8891 244.72190

After this, the next step in the tutorial is the code below, and I guess with this step, we can avoid gene-level summarization by setting txOut=TRUE, giving the original transcript level estimates as a list of matrices.

txi.tx <- tximport(files, type = "salmon", txOut = TRUE)

txi.sum <- summarizeToGene(txi.t tx, tx2gene, ignoreAfterBar = TRUE)
all.equal(txi$counts, txi.sum$counts)

I wonder what's the gene-level summarization mean? I see there are 'gene id; ENSG00000000938.12' and should I run this step? I would like to use just TPM values as an expression data to compare this with the GDSC expression data.

RNA-seq Salmon tximport • 2.0k views

ADD COMMENT • link updated 4.7 years ago by James W. MacDonald 65k • written 4.7 years ago by woojoy14 • 0

score 2 · Accepted Answer · 2019-08-27

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

You could always just check this sort of stuff for yourself. Using the example for tximport:


> txi2 <- tximport(files, "salmon", txOut = TRUE)
reading in files with read_tsv
1 2 3 4 5 6 
> head(txi2$counts)
                    sample1 sample2   sample3   sample4   sample5   sample6
ENST00000456328.2   1.73216   0.000   2.79658  0.230058   7.02211   2.46885
ENST00000450305.2   0.00000   0.000   0.00000  0.000000   0.00000   0.00000
ENST00000488147.1 120.11300 201.888 133.74200 99.882200 166.09100 225.07300
ENST00000619216.1   0.00000   0.000   0.00000  0.000000   0.00000   0.00000
ENST00000473358.1   0.64566   0.000   0.00000  0.000000   0.00000   0.00000
ENST00000469289.1   0.00000   0.000   0.00000  0.000000   0.00000   0.00000
> head(txi2$abundance)
                    sample1 sample2   sample3    sample4  sample5   sample6
ENST00000456328.2 0.0591243  0.0000 0.0631446 0.00681297 0.199672  0.083908
ENST00000450305.2 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000488147.1 5.4642800  8.1588 3.9969600 3.94843000 6.208390 10.417400
ENST00000619216.1 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000473358.1 0.0679640  0.0000 0.0000000 0.00000000 0.000000  0.000000
ENST00000469289.1 0.0000000  0.0000 0.0000000 0.00000000 0.000000  0.000000
> head(as.data.frame(readr::read_tsv(files[1])))
Parsed with column specification:
cols(
  Name = col_character(),
  Length = col_double(),
  EffectiveLength = col_double(),
  TPM = col_double(),
  NumReads = col_double()
)
|=================================================================| 100%    7 MB
               Name Length EffectiveLength       TPM  NumReads
1 ENST00000456328.2   1657        1459.420 0.0591243   1.73216
2 ENST00000450305.2    632         433.851 0.0000000   0.00000
3 ENST00000488147.1   1351        1095.010 5.4642800 120.11300
4 ENST00000619216.1     68          22.000 0.0000000   0.00000
5 ENST00000473358.1    712         473.242 0.0679640   0.64566
6 ENST00000469289.1    535         345.627 0.0000000   0.00000

As you can see, the TPM values are (in tximport lingo) the abundances, and the NumReads are the counts.

As for your second question, salmon aligns to the transcriptome, so all the counts and abundances are estimates of the transcript abundances. Each gene can have one or more transcripts, and if you care to do things at the gene level, you need to add up all the transcript counts or abundances from all of the transcripts for each gene.

ADD COMMENT • link 4.7 years ago James W. MacDonald 65k

0

Entering edit mode

Agree with James, these simple checks are always easy to try on your data.

It’s also listed in ?tximport:

“abundanceCol = name of column with abundances (e.g. TPM or FPKM)”

ADD REPLY • link 4.7 years ago Michael Love 42k

0

Entering edit mode

Thank you for the help! I tried the code to parse with column specification and didn't get any results. How can I get it?

ADD REPLY • link 4.7 years ago woojoy14 • 0

0

Entering edit mode

I ran the code below

txi <- tximport(files, type = "salmon", txOut = TRUE, txIdCol = "Name", abundanceCol = "TPM", 
countsCol = "NumReads", lengthCol = "Length", importer = function(x) read_tsv(x, skip = 8), ignoreAfterBar = TRUE)

and got the error as

Error in tximport(files, type = "none", txOut = TRUE, txIdCol = "Name",  : 
  all(file.exists(files)) is not TRUE

How can I fix the code?

ADD REPLY • link 4.7 years ago woojoy14 • 0

0

Entering edit mode

A couple of things:

You don't need to specify the columns, you can just use the code in the vignette. I.e. type="salmon" means you don't have to specify these columns.

Before posting to the support site, please try a little harder to figure out what is going wrong. It will save you time, and save others time as well.

The error says that the statement "all files exist" does not evaluate to TRUE.

This means that the files in files are not where you say they are. R can't find them there.

ADD REPLY • link 4.7 years ago Michael Love 42k

0

Entering edit mode

I wonder why I can only see Gene ID like ENSG00000000003.14 instead of transcript ID like ENST00000456328.2?

ADD REPLY • link 4.7 years ago woojoy14 • 0

0

Entering edit mode

If you summarize at the gene level you have summed up data for all the underlying transcripts. Why would you then expect to see transcript data?

ADD REPLY • link 4.7 years ago James W. MacDonald 65k