Question: Using 'tximport' library for downstream DGE after quantifying with Kallisto
1
gravatar for cguzman.bioinformatics
3.1 years ago by
cguzman.bioinformatics10 wrote:

Cross-posted on Biostars: https://www.biostars.org/p/187335/

Sorry for horrible formatting. I am not used to the markdown on this site.

I'm quite new to RNA-sequencing and am playing around with data to get a handle on it. I have quantified with `Kallisto` and am using `tximport` to summarize transcript counts for differential gene expression analysis.

I am running into a problem associating gene ID's with my transcripts for the summarization portion. I believe that the likely cause is the actual TxDb library I am using and that it may be different from the transcriptome file I used, but I am not sure and my attempts at solving this haven't been successful.

I am working with human samples. I quantified my transcripts using [this transcriptome file for homo sapiens][1]. I have 6 samples, 3 WT replicates, and 3 KO replicates.

1. I created a vector pointing to my kallisto files as detailed in the [tximport manual.][2] 

    `files <- file.path(dir, "kallisto", samples$run, "abundance.tsv")`

2. I created a data.frame from a TxDb object to construct the tx2gene table.

    `library(TxDb.Hsapiens.UCSC.hg38.knownGene)`

    `txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene`

    `k <- keys(txdb, keytype = "GENEID")`

    `df <- select(txdb, keys = k, keytype = "GENEID", columns = "TXNAME")`

    `tx2gene <- df[, 2:1]  # tx ID, then gene ID`

But `head(tx2gene)` produces:

    TXNAME GENEID
    1 uc002qsd.4      1
    2 uc002qsf.2      1
    3 uc003wyw.1     10
    4 uc002xmj.3    100
    5 uc010xbn.1   1000
    6 uc002kwg.2   1000

This obviously isn't right.

3. Using tximport's `tximport` function.

    `library(tximport)`

    `library(readr)`

    `txi <- tximport(files, type = "kallisto", tx2gene = tx2gene, reader = read_tsv)`

    `names(txi)`

Does the following:

> txi $abundance  
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6

> $counts  
>
sample 1 sample 2 sample 3 sample 4 sample 5 sample 6

> $length
>
 sample 1 sample 2 sample 3 sample 4 sample 5 sample 6

> $countsFromAbundance 
>
> [1] "no"

and `head(txi$counts)`:

> head(txi$counts)
>
> sample 1 sample 2 sample 3 sample 4 sample 5 sample 6

I'm not completely sure what i'm doing incorrectly. I'll give it another shot after lunch, it might just be the frustration at this point but any help is appreciated.


  [1]: http://bio.math.berkeley.edu/kallisto/transcriptomes/
  [2]: https://github.com/mikelove/tximport/blob/master/vignettes/tximport.md

rna-seq tximport kallisto • 3.0k views
ADD COMMENTlink modified 2.2 years ago by pablo_garcia0520 • written 3.1 years ago by cguzman.bioinformatics10

Hi, I don't wnat to open a new discussion so maybe here I can get help. 

When you are working with no identified transcripts from a de novo assembly (Trinity) how do you cover this step? I mean, I don't have a data source for this tx2gene requirement.

 

Thank you.

ADD REPLYlink written 2.2 years ago by pablo_garcia0520

I moved this post from an 'Answer' to a 'Comment'

You need to come up with a way to group transcripts into genes if you want to summarize your transcripts into genes.

I believe that new software from Alicia Oshlack's group can help with this: 

https://github.com/Oshlack/Lace/wiki

https://github.com/Oshlack/Lace/wiki/Algorithm

It groups transcripts by pairwise alignments.

ADD REPLYlink written 2.2 years ago by Michael Love23k

Thank you M. Love and one question, what about to use the output file with the suffix ".genes" produced by Trinity quantification pipeline? for example when I run :

$TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left AA_1.fastq  \
         --right AA_2.fastq  \
         --transcripts Trinity.fasta  \
         --output_prefix AA \
         --est_method kallisto \
         --trinity_mode  \
         --output_dir AA_kallisto

It generates abundance.tsv and abundance.tsv.genes. Could I continue from this ".genes" > tx2gene > DESeq2??

Thank you for your att.

ADD REPLYlink written 2.2 years ago by pablo_garcia0520

You should be able to, as long as you can specify the names of the columns in which to find the appropriate information. For example, when you tell tximport, type="salmon" or "sailfish" etc, it just populates the arguments which specify the column names. For gene-level quantification, you would follow the example of RSEM, which inside tximport populates the following arguments for the user:

  if (type == "rsem") {
    txIn <- FALSE
    geneIdCol <- "gene_id"
    abundanceCol <- "FPKM"
    countsCol <- "expected_count"
    lengthCol <- "effective_length"
  }
ADD REPLYlink written 2.2 years ago by Michael Love23k

Hello Mr Love, thank you for your suggestions to my last questions. I come again with new doubts.

I'm doing this DGE with a de novo assembly so I tried to apply your suggestion (in this case I have used EvidentialGene to cluster my transcript to reduce the redundancy). Finally I could manage the construction of this tx2genes file just blasting my transcriptome and generating this 2 columns file with: TranscriptID | blast hit. I think that's correct for those genes identified but I have a lot of transcript unidentified and now they're out of the DGE analysis because they can't be summarize to gene level.

What do you think about this approach? I have this doubt because report differences about "contigs" deferentially expressed when I don't know what they are or if they come from the same gene or different genes seems a bad choice. In this case my actual use of tximport>DESeq2 could be fine?

Thanks a lot,

(please excuse me if my English is not good enough or polite I try my best...)

Pablo GF

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by pablo_garcia0520

hi Pablo,

I don't follow the exact details but if you can come up with a reasonable clustering of de novo transcripts into likely genes, the rest of the tximport => DESeq2 pipeline should give you useful results. 

I think Corset from Alicia Oshlack's group is another software you might take a look at:

https://github.com/Oshlack/Corset/wiki

ADD REPLYlink written 2.1 years ago by Michael Love23k
Answer: Using 'tximport' library for downstream DGE after quantifying with Kallisto
6
gravatar for Michael Love
3.1 years ago by
Michael Love23k
United States
Michael Love23k wrote:

Hi, 

So that's a Ensembl transcriptome you used with kallisto. You can see the genome and Ensembl release info here: ...GRCh38.rel79...

Meanwhile, TxDb.Hsapiens.UCSC.hg38.knownGene is UCSC based and doesn't tell you anything about the Ensembl genes and transcripts.

Here is an annotation package which will give you access to the Ensembl Release 79 transcripts and genes:

http://bioconductor.org/packages/release/data/annotation/html/EnsDb.Hsapiens.v79.html

How to use the objects in this Ensembl annotation package is shown in this "parent" package:

http://bioconductor.org/packages/release/bioc/html/ensembldb.html

I just scanned the vignette for ensembldb and found some useful code. For example, this gives you a DataFrame linking transcripts and genes:

txdf <- transcripts(EnsDb.Hsapiens.v79, return.type="DataFrame")

Then:

tx2gene <- as.data.frame(txdf[,c("tx_id","gene_id")])

 

ADD COMMENTlink written 3.1 years ago by Michael Love23k

This is what I had thought was the problem, I just could not for the life of me find the TxDb of Ensembl.


Thank you!

ADD REPLYlink written 3.1 years ago by cguzman.bioinformatics10

in v0.99.9 I've added a more informative error if no transcripts in tx2gene are present in the quantification files, and I've added some information to the vignette regarding Ensembl transcriptomes and the use of transcripts() with the ensembldb packages.

ADD REPLYlink written 3.1 years ago by Michael Love23k
Answer: Using 'tximport' library for downstream DGE after quantifying with Kallisto
2
gravatar for James W. MacDonald
3.1 years ago by
United States
James W. MacDonald50k wrote:

The TxDb package you are using is based on UCSC's transcript IDs, not Ensembl, which is apparently the source of the transcripts you used. If you are using Ensembl-centric transcripts, then you should probably be using one of the EnsDbs that Johannes Rainer generates, for instance this one.

But what you are doing is sort of weird, using Ensembl transcript IDs, and mapping to NCBI's Gene IDs. There will always be issues mapping between the two, so you might want to use either Ensembl transcript and gene IDs, or NCBI's RefSeq and Gene IDs.

If you want to do the cross-annotation source mapping, you can use the org.Hs.eg.db package.

> d.f <- select(org.Hs.eg.db, keys(org.Hs.eg.db), "ENSEMBLTRANS")
'select()' returned 1:many mapping between keys and columns
> head(d.f[,2:1])
     ENSEMBLTRANS ENTREZID
1 ENST00000596924        1
2 ENST00000263100        1
3 ENST00000595014        1
4 ENST00000598345        1
5 ENST00000600966        1
6 ENST00000543436        2

 

ADD COMMENTlink written 3.1 years ago by James W. MacDonald50k

I completely agree, I was aware that I was using different annotations so that's why I mentioned that this was probably my problem. I could not find the TxDb for Ensemble, thank you so much!

ADD REPLYlink written 3.1 years ago by cguzman.bioinformatics10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 144 users visited in the last hour