Hey everyone, I am extracting the total length of all exons to calculcate TPM from counts using this script:
library(data.table)
gencode_file_gff3 <- fread("gzcat gencode.v43.basic.annotation.gff3.gz | grep -v '^#'")
gencode_file_gff3 <- gencode_file_gff3[V3 == "exon"]
gencode_file_gff3 <- copy(gencode_file_gff3)
# per gene_id
gencode_file_gff3[, gene_id := gsub(".*gene_id=(.*);transcript_id.*", "\\1", V9)]
gencode_file_gff3[, gene_id := ifelse((V1 == "chrY") &
is.element(gene_id,
gencode_file_gff3[V1 != "chrY"]$gene_id),
paste0(gene_id, "_PAR_Y"), gene_id)]
gencode_file_gff3[, bases := Map(seq, V4, V5)]
gencode_file_gff3[, count_unique_bases := uniqueN(unlist(bases)), by = gene_id] # alternatively use: gencode_file_gff3[, count_unique_bases := length(unique(unlist(bases))), by = gene_id]
gencode_file_gff3 <- gencode_file_gff3[, c("gene_id", "count_unique_bases")]
gencode_file_gff3 <- unique(gencode_file_gff3)
It's analogue to this script using the R package GenomicFeatures, but using data.table (it's producing exact same results):
rm(list=ls())
library(GenomicFeatures)
library(data.table)
txdb <- makeTxDbFromGFF("gencode.v43.basic.annotation.gff3.gz",
format = "gff3")
list_exons_per_gene <- exonsBy(txdb, by = "gene")
size_exons_per_gene <- as.data.table(sum(width(reduce(list_exons_per_gene))),
keep.rownames = TRUE)
My question now is, what annotation file should I use? Since GENCODE v42 GENCODE says the basic
annotation file is the main file for all users. Before that it was the comprehensive
annotation file. The results in total exon length differe using these two annotation types. Which one would be better to use for my purpose (counts -> TPM)?
Totally agree, if this would be my data. However, these are publicly available datasets, that exist as a count data matrix only. Maybe it would be possible to get the original RNAseq, but it would be much easier, if I could just go from this count data.