Hello everyone. I have been trying to follow a number of walk throughs on converting a salmon transcript TPM data frame into its gene counts. I think the issue is in the following line:
samples <- read.table("samples.txt", header = TRUE)
files <- file.path("quant", samples$sample, "quant.sf")
names(files) <- paste0(samples$sample)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)
I have successfully created the tx2gene file just using the EnsDb.Hsapiens.v86 package and the transcripts function seen below:
head(tx2gene)
DataFrame with 6 rows and 2 columns txid geneid <character> <character> 1 ENST00000000233 ENSG00000004059 2 ENST00000000412 ENSG00000003056 3 ENST00000000442 ENSG00000173153 4 ENST00000001008 ENSG00000004478 5 ENST00000001146 ENSG00000003137 6 ENST00000002125 ENSG00000003509
However, the salmon file that I have looks like this pasted below, as opposed to multiple separate files that have transcript TPM counts.
head(salmon_output)
X sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8
1 ENST00000000233 16.28690 20.910300 6.85988 4.889860 8.908700 0.000000 2.223280 8.952680 2 ENST00000000412 9.96427 12.695700 31.22860 37.437700 36.617700 16.729400 34.906800 30.086900 3 ENST00000000442 1.23997 0.847996 0.00000 0.000000 0.889783 0.321261 0.451316 0.000000 4 ENST00000001008 9.23394 11.012300 18.04590 20.538000 17.430500 10.035800 17.035700 17.519500 5 ENST00000001146 1.04069 1.508500 0.00000 0.165007 0.201487 0.000000 0.000000 0.390072 6 ENST00000002125 1.09016 2.310980 3.41563 8.428720 5.931020 4.875550 4.959320 5.771440
As you can see the sample names per column already exist in the file I was given, and the transcript ensemble ID's make up the first column so the "files <- file.path" and tximport(files) commands all fail for me. Not sure how to work around this to make a txi object to turn the transcript names into gene names and summarise the counts in a proper fashion.
Appreciate any help! thank you!
In that case, how does one go about summarizing the files to gene counts if that is the output file that I have currently. I was told from our bioinformatics core that they simply merged the TPM counts from salmon into one file with the samples and transcript IDs.
The pipeline that was used was
FastQC --> trimmonic --> salmon index using GENCODE GRCH38.p13 transcriptome --> Salmon Quant into TPM
and then I was given a merged file.
You need more data. Rather than me describe in words how the software works, you could instead ask the core for the Salmon output, or have them run this for you.
Yes, I did try asking them how they convert their files but they said that was the salmon output file. Will try asking again then. Thanks