Question

What GENCODE GFF file for calculating exon length for TPM

0

Entering edit mode

gernophil • 0

@8acce5d5

Last seen 12 months ago

Germany

Hey everyone, I am extracting the total length of all exons to calculcate TPM from counts using this script:

library(data.table)

gencode_file_gff3 <- fread("gzcat gencode.v43.basic.annotation.gff3.gz | grep -v '^#'")
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"]

gencode_file_gff3 <- copy(gencode_file_gff3)

# per gene_id
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)]
gencode_file_gff3[, gene_id := ifelse((V1 == "chrY") &
                                           is.element(gene_id,
                                                      gencode_file_gff3[V1 != "chrY"]$gene_id),
                                         paste0(gene_id, "_PAR_Y"), gene_id)]
gencode_file_gff3[, bases := Map(seq, V4, V5)]
gencode_file_gff3[, count_unique_bases := uniqueN(unlist(bases)), by = gene_id] # alternatively use: gencode_file_gff3[, count_unique_bases := length(unique(unlist(bases))), by = gene_id]
gencode_file_gff3 <- gencode_file_gff3[, c("gene_id", "count_unique_bases")]
gencode_file_gff3 <- unique(gencode_file_gff3)

It's analogue to this script using the R package GenomicFeatures, but using data.table (it's producing exact same results):

rm(list=ls())

library(GenomicFeatures)
library(data.table)

txdb <- makeTxDbFromGFF("gencode.v43.basic.annotation.gff3.gz",
                        format = "gff3")

list_exons_per_gene <- exonsBy(txdb, by = "gene")
size_exons_per_gene <- as.data.table(sum(width(reduce(list_exons_per_gene))),
                                     keep.rownames = TRUE)

My question now is, what annotation file should I use? Since GENCODE v42 GENCODE says the basic annotation file is the main file for all users. Before that it was the comprehensive annotation file. The results in total exon length differe using these two annotation types. Which one would be better to use for my purpose (counts -> TPM)?

GenomicFeatures • 1.5k views

ADD COMMENT • link 20 months ago gernophil • 0

score 0 · Answer 1 · 2023-06-27

0

Entering edit mode

ATpoint ★ 4.7k

@atpoint-13662

Last seen 12 hours ago

Germany

Note that any length correction without knowledge of expressed transcript composition is basically meaningless imo. If you want accurate length estimates then use salmon to quantify your RNA-seq data. If will produce length estimates per gene and sample based on the transcripts per gene that are actually expressed, hence respecting the transcript expression profile per sample. By taking all exons you might overestimate length as some or many isoforms (that might be long) maybe are not even expressed in your data.

ADD COMMENT • link 20 months ago ATpoint ★ 4.7k

0

Entering edit mode

Totally agree, if this would be my data. However, these are publicly available datasets, that exist as a count data matrix only. Maybe it would be possible to get the original RNAseq, but it would be much easier, if I could just go from this count data.

ADD REPLY • link 20 months ago gernophil • 0