What GENCODE GFF file for calculating exon length for TPM
1
0
Entering edit mode
gernophil • 0
@8acce5d5
Last seen 9 months ago
Germany

Hey everyone, I am extracting the total length of all exons to calculcate TPM from counts using this script:

library(data.table)

gencode_file_gff3 <- fread("gzcat gencode.v43.basic.annotation.gff3.gz | grep -v '^#'")
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"]

gencode_file_gff3 <- copy(gencode_file_gff3)

# per gene_id
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)]
gencode_file_gff3[, gene_id := ifelse((V1 == "chrY") &
                                           is.element(gene_id,
                                                      gencode_file_gff3[V1 != "chrY"]$gene_id),
                                         paste0(gene_id, "_PAR_Y"), gene_id)]
gencode_file_gff3[, bases := Map(seq, V4, V5)]
gencode_file_gff3[, count_unique_bases := uniqueN(unlist(bases)), by = gene_id] # alternatively use: gencode_file_gff3[, count_unique_bases := length(unique(unlist(bases))), by = gene_id]
gencode_file_gff3 <- gencode_file_gff3[, c("gene_id", "count_unique_bases")]
gencode_file_gff3 <- unique(gencode_file_gff3)

It's analogue to this script using the R package GenomicFeatures, but using data.table (it's producing exact same results):

rm(list=ls())

library(GenomicFeatures)
library(data.table)

txdb <- makeTxDbFromGFF("gencode.v43.basic.annotation.gff3.gz",
                        format = "gff3")

list_exons_per_gene <- exonsBy(txdb, by = "gene")
size_exons_per_gene <- as.data.table(sum(width(reduce(list_exons_per_gene))),
                                     keep.rownames = TRUE)

My question now is, what annotation file should I use? Since GENCODE v42 GENCODE says the basic annotation file is the main file for all users. Before that it was the comprehensive annotation file. The results in total exon length differe using these two annotation types. Which one would be better to use for my purpose (counts -> TPM)?

GenomicFeatures • 1.4k views
ADD COMMENT
0
Entering edit mode
ATpoint ★ 4.6k
@atpoint-13662
Last seen 4 minutes ago
Germany

Note that any length correction without knowledge of expressed transcript composition is basically meaningless imo. If you want accurate length estimates then use salmon to quantify your RNA-seq data. If will produce length estimates per gene and sample based on the transcripts per gene that are actually expressed, hence respecting the transcript expression profile per sample. By taking all exons you might overestimate length as some or many isoforms (that might be long) maybe are not even expressed in your data.

ADD COMMENT
0
Entering edit mode

Totally agree, if this would be my data. However, these are publicly available datasets, that exist as a count data matrix only. Maybe it would be possible to get the original RNAseq, but it would be much easier, if I could just go from this count data.

ADD REPLY

Login before adding your answer.

Traffic: 535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6