I analyzed some RNA-seq samples using
kallisto, to quantify the expression of protein-coding genes, and also
Telescope, to quantify expression of human endogenous retroviruses (HERVs). Ultimately, I would like to concatenate HERV and gene counts per sample, so that I can apply a TMM normalization, followed by an inverse normal transformation to these values collectively, as suggested by GTEx for an eQTL analysis.
The problem is that, whereas there are programs like
tximport to import
kallisto files into a
edgeR object, the same is not true for the output of
Telescope, so I am a bit unsure how I can uniformly concat these data, before applying the TMM normalization and the inverse normal transformation. My main concern is that tximport produces gene-level estimated counts and an associated edgeR offset matrix, for example, which I don't know how to construct for the Telescope output (but different HERVs will have different lengths!). Do you have any advice on the best practice to merge information from these two quantification tools?
From what I understand, Telescope outputs raw read counts (integers), corresponding to how many transcripts per specific HERV copy were detected in the RNA-seq data. Therefore, to import these counts to a DESeq2 object, for example, I would import a table with the raw HERV counts using the
DESeqDataSetFromMatrix function. For protein-coding genes, however, I have the impression that a
tximport object corresponding to my gene expression quantification from kallisto,
txi.tx$counts, will be normalized to average gene length (since I summarized transcripts to genes when importing the kallisto files, so it creates an "offset matrix"?). Thus, I don't know if I could simply concat
txi.tx$counts with my table containing HERV raw counts, before applying the TMM normalization, and inverse normal transformation. In terms of HERVs, I've got their putative lengths and I could correct those counts for their lengths (count divided by length - is that what the offset table is doing?), but I don't understand if this would be equivalent to the transcript-length normalization applied by
Can anyone shed light onto what you think would be the best practice here?