Search
Question: Constructing tx2gene for Salmon txImport Quantification using Gencode Mouse Transcript Annotation
0
7 months ago by
pvd21070
pvd21070 wrote:

Hi,

This question is in reference to the recently updated TxImport vignette: https://bioconductor.org/packages/3.7/bioc/vignettes/tximport/inst/doc/tximport.html

I saw a related question from a few months back but wasn't able to follow the solution and am posting again, thanks in advance.

I ran Salmon on my fastq files from a recent mouse RNAseq experiment using the Transcript Sequences CHR Fasta file from the current Gencode Mouse release (https://www.gencodegenes.org/releases/current.html / ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.transcripts.fa.gz). Everything seemed to go smoothly and so now I am in the process of making a tx2gene table to do my feature counting with tximport.

I note in your example that you import a constructed tx2gene .csv table from a recent gencode release.

library(readr)
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
head(tx2gene)

I'm trying to get to this point with the gencode.vM16.annotation.gtf file I downloaded from the most recent release, but am not completely following how to go about this. Would you mind spelling this out for me a bit more? This is what I've done and want to make sure I haven't missed anything:

> TxDb <- makeTxDbFromGFF(file = "/Path/to/file/gencode.vM16.annotation.gtf")
> k <- keys(TxDb, keytype = "TXNAME")
> tx2gene <- select(TxDb, k, "GENEID", "TXNAME")

TXNAME               GENEID
1 ENSMUST00000193812.1 ENSMUSG00000102693.1
2 ENSMUST00000082908.1 ENSMUSG00000064842.1
3 ENSMUST00000192857.1 ENSMUSG00000102851.1
4 ENSMUST00000161581.1 ENSMUSG00000089699.1
5 ENSMUST00000192183.1 ENSMUSG00000103147.1
6 ENSMUST00000193244.1 ENSMUSG00000102348.

I wasn't sure if this accomplished the same thing as indexing the GTF to a CSV and then reading that in, because of the error I get when running tximport:

> sample_list <- read.delim("/Path/to/sample/list/20180214_fastq_ID_list copy.txt", sep = "\t", header = F)
V1
1   1_S18_Quants
2   2_S19_Quants
3   3_S20_Quants
4   4_S21_Quants

> files <- file.path("/Path/to/Quantsoutputfolder/SalmonQuants",sample_list[,"V1"], "quant.sf")

When I next tried to run tximport I got an error saying:

> txi <- tximport(files, type = "salmon", tx2gene = tx2gene)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Error in summarizeToGene(txi, tx2gene, ignoreTxVersion, countsFromAbundance) :

None of the transcripts in the quantification files are present
in the first column of tx2gene. Check to see that you are using
the same annotation for both.

So since I know I'm using the same annotation, I figure that somewhere along the way in making the tx2gene table I've done something wrong. I'm not sure the exact steps from the vignette and if I should be trying to use the ensembldb package. I were to use EnsemblDB would loading the most recent EnsDb.MMusculus.v79 accomplish the same goal after making a data frame including the Gene IDs?

Thanks a lot, really appreciate the help!

Paul

___________________________________________________________________-

As an aside, when i first tried to run tximport I got the following error. I installed biocLite("rjson") and then ran the samples again. I only have biological replicates in my data files, so I am a little confused by the expectation that I have inferential replicates.

1 Error in readInfRepFish(x, type) :
importing inferential replicates for Salmon or Sailfish requires package rjson.
to skip this step, set dropInfReps=TRUE

______________________________________

Also, as another aside, from my reading it sounds like Gencode and Ensembl are supposed to be the same annotation, but I tried to run Salmon on my data in parallel using the most recent Ensembl mouse transcriptome and got a much lower number of mappings that way. Anyone have an idea why that might be?

modified 7 months ago by Michael Love19k • written 7 months ago by pvd21070
1
7 months ago by
Michael Love19k
United States
Michael Love19k wrote:

The best way to start diagnosing this is to look at the names of the transcripts in the quant.sf files. Can you show some of these? You should then compare to the transcript names in tx2gene. These need to be the same. If there is an extra version digit(s) in the quant files, you should use the ignoreTxVersion argument.

Gencode and Ensembl use the same base set of transcripts but are definitely not the same. Just a small example is that Ensembl has identical transcripts on haplotype chromosomes, while Gencode does not include these duplicate transcripts (the duplicate txps cause problems for quantification actually).

ADD COMMENTlink modified 7 months ago • written 7 months ago by Michael Love19k

I just changed the development branch to print out some of the IDs so users can more easily see the descrepancy.

Got it, I think this is definitely where my issue is coming from. Below is a head of the first few transcript quantifications. The first part, the transcript name, is identical to the TXNAME in tx2gene. It's just that it also seems to have a lot of additional gene level inforamtion delimited by |. I can try using a regular expression in R to remove everything in the name after the transcript name and give that a go.

Pauls-MacBook-Pro:1_S18_Quants pdellorusso\$ head quant.sf

Name    Length    EffectiveLength    TPM    NumReads

ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|    1070    821    0    0

ENSMUST00000082908.1|ENSMUSG00000064842.1|-|-|Gm26206-201|Gm26206|110|snRNA|    110    4.74936    0    0

ENSMUST00000162897.1|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000086625.1|Xkr4-203|Xkr4|4153|processed_transcript|    4153    3904    0    0

ENSMUST00000159265.1|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000086624.1|Xkr4-202|Xkr4|2989|processed_transcript|    2989    2740    0.0113983    2

ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353|OTTMUST00000065166.1|Xkr4-201|Xkr4|3634|protein_coding|    3634    3385    0    0

ENSMUST00000192857.1|ENSMUSG00000102851.1|OTTMUSG00000049958|OTTMUST00000127143.1|Gm18956-201|Gm18956|480|processed_pseudogene|    480    231    0    0

ENSMUST00000195335.1|ENSMUSG00000103377.1|OTTMUSG00000049960|OTTMUST00000127145.1|Gm37180-201|Gm37180|2819|TEC|    2819    2570    0    0

ENSMUST00000192336.1|ENSMUSG00000104017.1|OTTMUSG00000049961|OTTMUST00000127146.1|Gm37363-201|Gm37363|2233|TEC|    2233    1984    0    0

ENSMUST00000194099.1|ENSMUSG00000103025.1|OTTMUSG00000049930|OTTMUST00000127101.1|Gm37686-201|Gm37686|2309|TEC|    2309    2060    0    0

tx2gene:

TXNAME               GENEID

1 ENSMUST00000193812.1 ENSMUSG00000102693.1

2 ENSMUST00000082908.1 ENSMUSG00000064842.1

3 ENSMUST00000192857.1 ENSMUSG00000102851.1

4 ENSMUST00000161581.1 ENSMUSG00000089699.1

5 ENSMUST00000192183.1 ENSMUSG00000103147.1

6 ENSMUST00000193244.1 ENSMUSG00000102348.1

Also thanks for the clarification on Gencode vs Ensembl, that was confusing to me for a while. Sounds like Gencode > Ensembl for transcript quantification purposes. Do you have a personal preference of Gencode vs RefSeq for running Salmon? I don't have any apriori preference for my dataset. Really appreciate the help and quick reply.

ADD REPLYlink modified 7 months ago • written 7 months ago by pvd21070
2

You can use the development branch (release in April) of tximport, and use the argument ignoreAfterBar. For Salmon, you would use the “gencode” flag to clean up the txp names.

1

I use Gencode, because I usually work with Ensembl txps downstream. Also they have made it easy to download the various files for different versions and organisms. I find it hard to navigate RefSeq.

Hi Michael,

I would also really like to use the "ignoreAfterBar", but I receive an error saying it is an used argument... I do not know how long this parameter has been around, so this may be partially due to the fact that my current tximport version is set up as 1.6.0 but I can't seem to update the package successfully.

Here is my session info:

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04 LTS

...
other attached packages:
[1] BiocInstaller_1.28.0 tximport_1.6.0       cummeRbund_2.20.0    Gviz_1.22.3          rtracklayer_1.38.3
...
ADD REPLYlink modified 7 weeks ago by Michael Love19k • written 7 weeks ago by rbenel0
1

It's only in version >= 1.8

See here how to install the latest version of Bioconductor:

http://bioconductor.org/install

Thanks! Is this argument using strsplit? I want to use a wrapper function that utilizes tximport with gencode annotations, but it is not updated for the "ignoreAfterBar" argument..

Update: Your function uses sub("\\|.*", "",)

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by rbenel0

If you cannot get the current version of R and Bioconductor, another solution would be to use the --gencode argument when you build the Salmon index, and then re-quantify the files. It only takes a few minutes per sample.

Yes, after using the sub(".*", "",\\|) workaround I ended up re-quantifying my files, this time with the --gencode argument as well as --keepDuplicates since I am interested in transcript analysis. Thanks!