How to get gene length for RPKM directly from DGEList object in latest edgeR?
Entering edit mode
anikng ▴ 10
Last seen 23 months ago

Starting from featureCounts generated raw counts file, I used edgeR to estimate the DE analysis and it went well. Now I use CPM normalized files to explore some specific genes expression in multiple pathways. I am aware that CPM are corrected for library size without considering gene length. Is that OK to use this file for individual gene analysis and generate plots for publication OR do I need another normalized file? Keeping it in mind, I was trying to get RPKM normalized file. But even after reading similar posts, I am not sure how can I get input gene length to rpkm() function. This discussion tells that recent version of edgeR can directly find gene length from DGEList object. I am using edgeR_3.28.1 and can anyone direct me how to get the gene length so that I can export RPKM? Related info: I downloaded rice genome from MSU and reference assembly was done with Hisat2. Currently, I have only raw counts files with me(ie, no .bam files available).

Here is the code I used to generate CPM. normalization,

raw_counts<-read.delim("rawcounts.txt",row.names="Locus",check.names = TRUE)
y<-DGEList(counts = raw_counts, group = group)

#Filterout low count genes

keep <-rowSums(cpm(y)>=2) >=2
y <- y[keep, , keep.lib.sizes=FALSE]


#How can I incorporate gene length in rpkm()?
Normalization edgeR RPKM • 2.8k views
Entering edit mode
Last seen 25 minutes ago
WEHI, Melbourne, Australia

featureCounts returns the length of each gene. You should use the gene lengths returned by featureCounts because they correspond exactly to the gene annotation used to create the counts.

My previous answer on this topic (which you link to in your question) linked to a complete worked example showing how to get gene lengths from featureCounts, how to store the gene lengths in the DGEList and how to use them to compute rpkm

If for some reason you've lost the gene lengths returned by featureCounts, you can compute them again from the GTF file:

SAF <- Rsubread::flattenGTF("rice_annotation.gtf")
GeneLength <- rowsum(SAF$End-SAF$Start+1, SAF$GeneID)
Entering edit mode

Thanks@Gordon Symth. But featureCounts requires bam/sam files to estimate gene length (unfortunately, I don't have those mapped files with me). So do you think it is not possible to get gene length with featureCounts OR am I misinterpreting the document?

Entering edit mode

Gene lengths are computed from the gene annotation, not from the BAM files.

Your question says that the counts were obtained from featureCounts, so featureCounts must have been run and hence the gene lengths must be available, unless you deleted them.

Even if you have discarded the gene lengths for some reason, you can easily compute them again from the same GTF annotation that you used to get the counts.

You cannot get gene lengths from transcript lengths. Gene length is defined as the total bases covered by exons for that gene. This is as least as long as the length of the longest transcript length but may be longer.

If all you have is transcript lengths, then use the longest transcript length for each gene.

Entering edit mode

It's actually pretty simple to get the gene lengths from a TxDb package (or object):

> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> ex <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, "gene")
## reduce to longest exon extents
> ex <- reduce(ex)
## compute exon lengths, by gene
> exlen <- relist(width(unlist(ex)), ex)
## add them up
> exlens <- sapply(exlen, sum)
> head(exlens)
        1        10       100      1000     10000 100009613 
     6528      1285      2809      4634     18587      1004

And something very similar could be done using the TxDb that the OP generated.

Entering edit mode

Ok, I think I got it. Could you please confirm it?

I ran featureCounts with a single bam file (also used the same gtf file which was used to estimate raw counts)

featureCounts -t exon -g gene_id -a rice_annotation.gtf -o OUTPUT.txt 13.bam 

Then from the OUTPUT.txt, extracted the gene length from column 'Length' and input into rpm() function.

Entering edit mode

Yes, you can do it that way.

Or you could run featureCounts at the R prompt.

Or you could use the TxDb code that James MacDonald has provided.

Or you can compute gene lengths directly from the GTF file using code I have added to my answer above.

Entering edit mode
Last seen 17 hours ago
United States

In order to generate counts using featureCounts you had to have some information about the genes, from which you could compute the gene lengths, because rice isn't one of the inbuilt annotations. So you could presumably use those data to compute the gene lengths. The problem with using MSU's annotation is they have their own locus IDs, so you need to use their data in order to do anything. Hypothetically they might have a GTF or GFF file (I can't get to their download site right now), which you could use to generate a TxDb package.

But without knowing what you have (and MSU's download page seems unreachable right not) the only answer I can give is that you need to use the data you got from MSU to get the gene lengths.

Entering edit mode

Thanks @James W. MacDonald for your reply. MSU provided a gtf file and as you suggested, I generated gene length using TxDb from GenomicFeatures package. Can I use the longest transcript length from 'gene_lens' to feed rpkm() function?

 gff_path <- 'rice_annotation.gtf'
 x<-makeTxDbFromGFF(file = gff_path, format = 'gtf')
 tx_by_gene <- transcriptsBy(x, by="gene")
 gene_lens <- max(width(tx_by_gene))

I used the same gtf file and genome build from MSU for mapping and counts estimation. Code for above gene length identificationis here


Login before adding your answer.

Traffic: 434 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6