Question

tximport: kallisto + GENCODE transcript sanitization

0

Entering edit mode

Michael Steinbaugh ▴ 90

@mjsteinbaugh

Last seen 5 months ago

Cambridge, MA

Hi all,

I ran into an edge case situation of kallisto not processing GENCODE transcript identifiers correctly, and this currently propagates into tximport. Ideally this should be fixed upstream in kallisto, but we should harden tximport against this situation.

Here's an example kallisto run aligned against GENCODE that is problematic: https://share.steinbaugh.com/kallisto-gencode-dmso.tar.gz

Can we add a step to check for identifiers still incorrectly pipe (|) delimited?

target_id   length  eff_length  est_counts  tpm
ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|lncRNA| 1657    1431.94 16.5617 0.57248

Here's a rough draft:

library(tximport)
files <- sort(list.files(
    path = "kallisto-gencode",
    pattern = "abundance.h5",
    full.names = TRUE,
    recursive = TRUE
))
names(files) <- make.names(basename(dirname(files)))
txi <- tximport(
    files = files,
    type = "kallisto",
    txIn = TRUE,
    txOut = TRUE,
    ignoreTxVersion = FALSE
)
extractGencodeIds <- function(x) {
    if (!all(grepl(pattern = "|", x = x, fixed = TRUE))) {
        return(x)
    }
    mat <- do.call(
        what = rbind,
        args = strsplit(x = x, split = "|", fixed = TRUE)
    )
    mat[, 1L]
}
rownames(txi[["abundance"]]) <- extractGencodeIds(rownames(txi[["abundance"]]))
rownames(txi[["counts"]]) <- extractGencodeIds(rownames(txi[["counts"]]))
rownames(txi[["length"]]) <- extractGencodeIds(rownames(txi[["length"]]))

See related:

Paging Michael Love

Best, Mike

tximport • 1.1k views

ADD COMMENT • link updated 14 months ago by Michael Love 41k • written 14 months ago by Michael Steinbaugh ▴ 90

score 1 · Answer 1 · 2023-02-14

1

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

While tximport adapts to different upstream tools based on different column names, I don't want to also have the package responsible for guessing about processing the transcript names. We accommodate differences in transcript names currently to deal with tx2gene mapping (e.g. dropping version numbers), but for txOut=TRUE I don't want to be handling processing of the names of the transcripts were in the transcriptome reference FASTA. There could be important information that was added manually by the user, and then tximport may strip that out.

Why not use sub by the way?

sub("\\|.*", "", ...)

ADD COMMENT • link 14 months ago Michael Love 41k

0

Entering edit mode

Thanks for the update Mike, I'll see if we can fix this upstream in kallisto. salmon has a similar mode with the --gencode flag to handle this situation with the FASTA file.

ADD REPLY • link 14 months ago Michael Steinbaugh ▴ 90

0

Entering edit mode

PS here's a modified version of the code above that works using sub instead:

extractGencodeIds <- function(x) {
    if (!all(grepl(pattern = "|", x = x, fixed = TRUE))) {
        return(x)
    }
    x <- sub(pattern = "^([^\\|]+)\\|.+$", replacement = "\\1", x = x)
    x
}

ADD REPLY • link 14 months ago Michael Steinbaugh ▴ 90

score 0 · Answer 2 · 2023-02-15

0

Entering edit mode

Michael Steinbaugh ▴ 90

@mjsteinbaugh

Last seen 5 months ago

Cambridge, MA

Safe to close this issue. In case anybody else comes across this problem in the future, see related fix in my bcbioRNASeq package.