tximport: kallisto + GENCODE transcript sanitization
2
0
Entering edit mode
@mjsteinbaugh
Last seen 13 months ago
Cambridge, MA

Hi all,

I ran into an edge case situation of kallisto not processing GENCODE transcript identifiers correctly, and this currently propagates into tximport. Ideally this should be fixed upstream in kallisto, but we should harden tximport against this situation.

Here's an example kallisto run aligned against GENCODE that is problematic: https://share.steinbaugh.com/kallisto-gencode-dmso.tar.gz

Can we add a step to check for identifiers still incorrectly pipe (|) delimited?

target_id   length  eff_length  est_counts  tpm
ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|lncRNA| 1657    1431.94 16.5617 0.57248

Here's a rough draft:

library(tximport)
files <- sort(list.files(
    path = "kallisto-gencode",
    pattern = "abundance.h5",
    full.names = TRUE,
    recursive = TRUE
))
names(files) <- make.names(basename(dirname(files)))
txi <- tximport(
    files = files,
    type = "kallisto",
    txIn = TRUE,
    txOut = TRUE,
    ignoreTxVersion = FALSE
)
extractGencodeIds <- function(x) {
    if (!all(grepl(pattern = "|", x = x, fixed = TRUE))) {
        return(x)
    }
    mat <- do.call(
        what = rbind,
        args = strsplit(x = x, split = "|", fixed = TRUE)
    )
    mat[, 1L]
}
rownames(txi[["abundance"]]) <- extractGencodeIds(rownames(txi[["abundance"]]))
rownames(txi[["counts"]]) <- extractGencodeIds(rownames(txi[["counts"]]))
rownames(txi[["length"]]) <- extractGencodeIds(rownames(txi[["length"]]))

See related:

Paging Michael Love

Best, Mike

tximport • 1.6k views
ADD COMMENT
1
Entering edit mode
@mikelove
Last seen 20 hours ago
United States

While tximport adapts to different upstream tools based on different column names, I don't want to also have the package responsible for guessing about processing the transcript names. We accommodate differences in transcript names currently to deal with tx2gene mapping (e.g. dropping version numbers), but for txOut=TRUE I don't want to be handling processing of the names of the transcripts were in the transcriptome reference FASTA. There could be important information that was added manually by the user, and then tximport may strip that out.

Why not use sub by the way?

sub("\\|.*", "", ...)
ADD COMMENT
0
Entering edit mode

Thanks for the update Mike, I'll see if we can fix this upstream in kallisto. salmon has a similar mode with the --gencode flag to handle this situation with the FASTA file.

ADD REPLY
0
Entering edit mode

PS here's a modified version of the code above that works using sub instead:

extractGencodeIds <- function(x) {
    if (!all(grepl(pattern = "|", x = x, fixed = TRUE))) {
        return(x)
    }
    x <- sub(pattern = "^([^\\|]+)\\|.+$", replacement = "\\1", x = x)
    x
}
ADD REPLY
0
Entering edit mode
@mjsteinbaugh
Last seen 13 months ago
Cambridge, MA

Safe to close this issue. In case anybody else comes across this problem in the future, see related fix in my bcbioRNASeq package.

ADD COMMENT
0
Entering edit mode

Awesome thanks.

ADD REPLY

Login before adding your answer.

Traffic: 489 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6