Hi all,
I am trying to find TPM gene expression using RNA-Seq data mapped by STAR with --quantMode GeneCounts to the genome, so I am starting with a counts table with samples in columns and gene counts in rows. I do not want to use this dataset for DEG analysis, rather I want to find expression levels by TPM. My code is actually working and I'm wondering if it is fine to start with reads mapped to genes instead of transcripts? Also, I'm looking for feedback on my data processing (especially exon lengths method) since I am unable to find a straightforward tutorial for calculating gene expression from RNA-Seq data as most go to DEG analysis. Thanks for your input! PS if there is any issue, if you can advise resources to help I would really appreciate it!
#read in the annotation
anno1 <- read.delim("/Users/Documents/Analysis/ensembl-ChlSab1-1.tsv",as.is=T)
#read in the counts dataset from STAR output
Verocounts <- read.delim("~/Documents/Analysis/VeroAnalysis1-AnnoChrOnly/Vero_rnaseqcounts_col4.txt", row.names = 1)
#load the gtf file
gtf <- "Chlorocebus_sabaeus.ChlSab1.1.109.chr.gtf"
txdb.filename <- "Chlorocebus_sabaeus.ChlSab1.1.109.chr.gtf.sqlite"
#Make the dge object
Verocounts <- read.delim("~/Documents/Analysis/Vero_rnaseqcounts_col4.txt", row.names = 1)
head(Verocounts)
d1 <- DGEList(Verocounts)
head(d1)
dim(d1) #27556 x 6
#get the exon lengths from the annotation to input as vector for TPM calculation
txdb <- makeTxDbFromGFF(gtf, format="gtf")
exonic <- exonsBy(txdb, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- vapply(width(red.exonic), sum, numeric(1))
# TPM
VeroTPM <- convertCounts(d1$counts,
unit = "tpm",
geneLength = exon.lengths,
log = FALSE,
prior.count=0)
VeroTPM <- data.frame(VeroTPM)
#Add in the gene names from annotation
# Find matching rows based on row names
matching_rows <- match(rownames(VeroTPM), anno1$Gene.stable.ID)
# Subset the relevant columns from `anno1`
additional_info <- anno1[matching_rows, c("Gene.stable.ID", "Gene.name")]
# Append the additional column to `Log2FPKM`
VeroTPM$Gene.name <- additional_info$Gene.name
#output is TPM counts output with samples in columns and counts in rows
sessionInfo( )