Hi,
This question is in reference to the recently updated TxImport vignette: https://bioconductor.org/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html
I saw a related question from a few months back but wasn't able to follow the solution and am posting again, thanks in advance.
I ran Salmon on my fastq files from a recent mouse RNAseq experiment using the Transcript Sequences CHR Fasta file from the current Gencode Mouse release (https://www.gencodegenes.org/releases/current.html / ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.transcripts.fa.gz). Everything seemed to go smoothly and so now I am in the process of making a tx2gene table to do my feature counting with tximport.
I note in your example that you import a constructed tx2gene .csv table from a recent gencode release.
library(readr)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
head(tx2gene)
I'm trying to get to this point with the gencode.vM16.annotation.gtf file I downloaded from the most recent release, but am not completely following how to go about this. Would you mind spelling this out for me a bit more? This is what I've done and want to make sure I haven't missed anything:
> TxDb <- makeTxDbFromGFF(file = "/Path/to/file/gencode.vM16.annotation.gtf")
> k <- keys(TxDb, keytype = "TXNAME")
> tx2gene <- select(TxDb, k, "GENEID", "TXNAME")
> head(tx2gene)
TXNAME GENEID
1 ENSMUST00000193812.1 ENSMUSG00000102693.1
2 ENSMUST00000082908.1 ENSMUSG00000064842.1
3 ENSMUST00000192857.1 ENSMUSG00000102851.1
4 ENSMUST00000161581.1 ENSMUSG00000089699.1
5 ENSMUST00000192183.1 ENSMUSG00000103147.1
6 ENSMUST00000193244.1 ENSMUSG00000102348.
I wasn't sure if this accomplished the same thing as indexing the GTF to a CSV and then reading that in, because of the error I get when running tximport:
> sample_list <- read.delim("/Path/to/sample/list/20180214_fastq_ID_list copy.txt", sep = "\t", header = F)
> head(sample_list)
V1
1 1_S18_Quants
2 2_S19_Quants
3 3_S20_Quants
4 4_S21_Quants
> files <- file.path("/Path/to/Quantsoutputfolder/SalmonQuants",sample_list[,"V1"], "quant.sf")
When I next tried to run tximport I got an error saying:
> txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :
None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.
So since I know I'm using the same annotation, I figure that somewhere along the way in making the tx2gene table I've done something wrong. I'm not sure the exact steps from the vignette and if I should be trying to use the ensembldb package. I were to use EnsemblDB would loading the most recent EnsDb.MMusculus.v79 accomplish the same goal after making a data frame including the Gene IDs?
Thanks a lot, really appreciate the help!
Paul
___________________________________________________________________-
As an aside, when i first tried to run tximport I got the following error. I installed biocLite("rjson") and then ran the samples again. I only have biological replicates in my data files, so I am a little confused by the expectation that I have inferential replicates.
reading in files with read_tsv
1 Error in readInfRepFish(x, type) :
importing inferential replicates for Salmon or Sailfish requires package `rjson`.
to skip this step, set dropInfReps=TRUE
______________________________________
Also, as another aside, from my reading it sounds like Gencode and Ensembl are supposed to be the same annotation, but I tried to run Salmon on my data in parallel using the most recent Ensembl mouse transcriptome and got a much lower number of mappings that way. Anyone have an idea why that might be?
I just changed the development branch to print out some of the IDs so users can more easily see the descrepancy.
Got it, I think this is definitely where my issue is coming from. Below is a head of the first few transcript quantifications. The first part, the transcript name, is identical to the TXNAME in tx2gene. It's just that it also seems to have a lot of additional gene level inforamtion delimited by |. I can try using a regular expression in R to remove everything in the name after the transcript name and give that a go.
Pauls-MacBook-Pro:1_S18_Quants pdellorusso$ head quant.sf
Name Length EffectiveLength TPM NumReads
ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC| 1070 821 0 0
ENSMUST00000082908.1|ENSMUSG00000064842.1|-|-|Gm26206-201|Gm26206|110|snRNA| 110 4.74936 0 0
ENSMUST00000162897.1|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000086625.1|Xkr4-203|Xkr4|4153|processed_transcript| 4153 3904 0 0
ENSMUST00000159265.1|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000086624.1|Xkr4-202|Xkr4|2989|processed_transcript| 2989 2740 0.0113983 2
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000065166.1|Xkr4-201|Xkr4|3634|protein_coding| 3634 3385 0 0
ENSMUST00000192857.1|ENSMUSG00000102851.1|OTTMUSG00000049958|OTTMUST00000127143.1|Gm18956-201|Gm18956|480|processed_pseudogene| 480 231 0 0
ENSMUST00000195335.1|ENSMUSG00000103377.1|OTTMUSG00000049960|OTTMUST00000127145.1|Gm37180-201|Gm37180|2819|TEC| 2819 2570 0 0
ENSMUST00000192336.1|ENSMUSG00000104017.1|OTTMUSG00000049961|OTTMUST00000127146.1|Gm37363-201|Gm37363|2233|TEC| 2233 1984 0 0
ENSMUST00000194099.1|ENSMUSG00000103025.1|OTTMUSG00000049930|OTTMUST00000127101.1|Gm37686-201|Gm37686|2309|TEC| 2309 2060 0 0
tx2gene:
> head(tx2gene)
TXNAME GENEID
1 ENSMUST00000193812.1 ENSMUSG00000102693.1
2 ENSMUST00000082908.1 ENSMUSG00000064842.1
3 ENSMUST00000192857.1 ENSMUSG00000102851.1
4 ENSMUST00000161581.1 ENSMUSG00000089699.1
5 ENSMUST00000192183.1 ENSMUSG00000103147.1
6 ENSMUST00000193244.1 ENSMUSG00000102348.1
Also thanks for the clarification on Gencode vs Ensembl, that was confusing to me for a while. Sounds like Gencode > Ensembl for transcript quantification purposes. Do you have a personal preference of Gencode vs RefSeq for running Salmon? I don't have any apriori preference for my dataset. Really appreciate the help and quick reply.
You can use the development branch (release in April) of tximport, and use the argument ignoreAfterBar. For Salmon, you would use the “gencode” flag to clean up the txp names.
I use Gencode, because I usually work with Ensembl txps downstream. Also they have made it easy to download the various files for different versions and organisms. I find it hard to navigate RefSeq.
Hi Michael,
I would also really like to use the "ignoreAfterBar", but I receive an error saying it is an used argument... I do not know how long this parameter has been around, so this may be partially due to the fact that my current tximport version is set up as 1.6.0 but I can't seem to update the package successfully.
Here is my session info:
It's only in version >= 1.8
See here how to install the latest version of Bioconductor:
http://bioconductor.org/install
Thanks! Is this argument using strsplit? I want to use a wrapper function that utilizes tximport with gencode annotations, but it is not updated for the "ignoreAfterBar" argument..
Update: Your function uses
sub("\\|.*", "",)
If you cannot get the current version of R and Bioconductor, another solution would be to use the --gencode argument when you build the Salmon index, and then re-quantify the files. It only takes a few minutes per sample.
Yes, after using the
sub(".*", "",
\\|)
workaround I ended up re-quantifying my files, this time with the--gencode
argument as well as--keepDuplicates
since I am interested in transcript analysis. Thanks!