Hi!
I'm trying to get the TPM matrix with the output of salmon. I've been following this tutorial: https://bioconductor.org/packages/release/bioc/vignettes/tximport/inst/doc/tximport.html#salmon-sailfish.
I ran the code below and got the results of 'txi' which includes '"abundance" "counts" "length" "countsFromAbundance". Does the 'counts' mean 'TPM values'?
setwd('C:/Users/wj/Documents/Projects/RNA pipeline/salmon')
library(tximport)
library(GenomicFeatures)
library(readr)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
dir <- getwd()
samples <- read.table(file.path(dir, "samples.txt"), header = TRUE)
samples
files <- file.path(dir, "results", samples$run, "quant.sf")
names(files) <- paste0("sample", 1:10)
all(file.exists(files))
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
k <- keys(txdb, keytype = "TXNAME")
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
head(tx2gene)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreTxVersion = TRUE)
txi <- tximport(files, type = "salmon", tx2gene = tx2gene, ignoreAfterBar = TRUE)
names(txi)
head(txi)
head(txi$counts)
sample1 sample2 sample3 sample4 sample5 sample6 sample7
ENSG00000000003.14 1013.0136 655.57232 538.1663 321.2297 422.5202 244.36843 229.24551
ENSG00000000005.5 0.0000 0.00000 0.0000 0.0000 0.0000 0.00000 0.00000
ENSG00000000419.12 953.9996 758.00019 599.0003 690.0004 462.0001 369.00012 823.99990
ENSG00000000457.13 209.3692 202.10540 108.3376 210.0156 140.6283 182.69919 68.18909
ENSG00000000460.16 132.8974 75.81567 100.3391 176.4574 125.2161 89.09295 62.00237
ENSG00000000938.12 119.6111 129.81950 131.9999 131.4528 268.7965 135.39350 66.88793
sample8 sample9 sample10
ENSG00000000003.14 305.81635 630.4115 388.70197
ENSG00000000005.5 0.00000 0.0000 6.93319
ENSG00000000419.12 628.00010 582.9998 686.99977
ENSG00000000457.13 176.91460 298.7727 196.90257
ENSG00000000460.16 68.84412 150.1030 130.63998
ENSG00000000938.12 45.17055 161.8891 244.72190
After this, the next step in the tutorial is the code below, and I guess with this step, we can avoid gene-level summarization by setting txOut=TRUE, giving the original transcript level estimates as a list of matrices.
txi.tx <- tximport(files, type = "salmon", txOut = TRUE)
txi.sum <- summarizeToGene(txi.t tx, tx2gene, ignoreAfterBar = TRUE)
all.equal(txi$counts, txi.sum$counts)
I wonder what's the gene-level summarization mean? I see there are 'gene id; ENSG00000000938.12' and should I run this step? I would like to use just TPM values as an expression data to compare this with the GDSC expression data.
Agree with James, these simple checks are always easy to try on your data.
It’s also listed in ?tximport:
“abundanceCol = name of column with abundances (e.g. TPM or FPKM)”
Thank you for the help! I tried the code to parse with column specification and didn't get any results. How can I get it?
I ran the code below
and got the error as
How can I fix the code?
A couple of things:
You don't need to specify the columns, you can just use the code in the vignette. I.e.
type="salmon"
means you don't have to specify these columns.Before posting to the support site, please try a little harder to figure out what is going wrong. It will save you time, and save others time as well.
The error says that the statement "all files exist" does not evaluate to TRUE.
This means that the files in
files
are not where you say they are. R can't find them there.I wonder why I can only see Gene ID like ENSG00000000003.14 instead of transcript ID like ENST00000456328.2?
If you summarize at the gene level you have summed up data for all the underlying transcripts. Why would you then expect to see transcript data?