Calculating rpkm from counts data?
2
0
Entering edit mode
Biologist ▴ 90
@biologist-9801
Last seen 2.4 years ago

Hello Everyone,

I got counts data for some samples using "STAR". I need to convert the counts into rpkm values and then normalize them for further analysis. I'm not aware how to do this. Can anyone help me in this? Thank you

Example counts data:

                 V1 V2 V3 V4 V5 V6 V7 V8
ENSG00000000003   0  0  0  0  1  0  0  0
ENSG00000000005   0  0  0  0  0  0  0  0
ENSG00000000419  10 24 19 20 19  8 14  6
ENSG00000000457  17 15 13 18 21 18 21 15
ENSG00000000460   2  3  5  2  4  6  8  2
ENSG00000000938  20  4 35 16 10 17 19  9
r featurecounts rpkm tpm edger • 4.9k views
0
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 3 hours ago
The city by the bay

Perhaps the rpkm function in edgeR would help?

You'll have to dig out the lengths from somewhere though. If all you have is the Ensembl gene IDs, you could probably use biomaRt to get the transcript length for each gene.

0
Entering edit mode

Hi Aaron,

Thanks for the reply. I know this biomaRt. I have 57,000 Ensembl gene IDs. Is it possible to get gene lengths for all those?

0
Entering edit mode

I got the transcript length for Ensembl Ids. But Each ensembl ids having multiple transcript lengths. Which one should I use?

0
Entering edit mode

Ensembl Gene ID     Symbol   transcript_length

ENSG00000083642  PDS5B  797

ENSG00000083642  PDS5B  906

ENSG00000083642  PDS5B  684

ENSG00000083642  PDS5B  1889

ENSG00000083642  PDS5B  972

ENSG00000083642  PDS5B  5246

ENSG00000083642  PDS5B  4238

ENSG00000083642  PDS5B  581

ENSG00000083642  PDS5B  7497

Which one should I take as gene_length for calculating rpkm?

1
Entering edit mode

Personally I would do something like:

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- sum(width(red.exonic))


... which gives you the total exonic length of each gene, indexed by the Ensembl ID.

0
Entering edit mode

If not this, can I take the gene which has highest transcript_length from the above mentioned multiple IDs? I mean

ENSG00000083642  PDS5B  7497

1
Entering edit mode

Well, not really, because the longest transcript may not contain all exons of a gene. See my edited answer above for full instructions (you can also use makeTxDbFromBiomart, if you like biomaRt better). However, the best solution would be to dig out the annotation you used for feature counting, and calculate the sum of reduced exon lengths from that. For example, you could run makeTxDbFromGFF on the GTF file.

0
Entering edit mode

As you said I took the total exonic length into consideration and did the further analysis.

myDGEList <- DGEList(counts= expressionMatrix , genes= geneDataFrame) Here geneDataFrame is having a column length corresponding to each row in the expression matrix.

To calculate RPKM values I did like following:

myDGEList <- calcNormFactors(myDGEList)

rpkmMatrix <- rpkm(myDGEList)

My question is I want to look at the percentile of a particular gene in a specific sample before and after normalisation. Could you please tell me how to do that?

Thank you

0
Entering edit mode

I don't really understand what you want to do, but you can probably use rank to do it.

0
Entering edit mode
@kristoffervittingseerup-7310
Last seen 4.2 years ago
European Union

The easiest thing is probably to use edgeR. To calculate RPKM you need two things:

1) The count data (which you have)

2) The length of the genes.

You can then construct a DGElist with edgeR as follows:

myDGEList <- DGEList( counts= expressionMatrix , genes= geneDataFrame )

where the "geneDataFrame" is a data.frame with a collum called "lenght" which contains the gene lengths corresponding to each row in the expression matrix.

You can then do the inter-library normalization and calculate RPKM values as follows:

myDGEList <- calcNormFactors(myDGEList)
rpkmMatrix <- rpkm(myDGEList)

0
Entering edit mode

Thanks for the reply. I need the length of the genes first. Do you know the easiest way to get?

0
Entering edit mode

How did you quantify the genes (which tool)? And which excat database of gene models did you use for the quantification?

0
Entering edit mode

Ok. I dont have any problem with this now. I have transformed the counts data into rpkmMatrix as you said. I want to make a density plot with the rpkm values of the samples. How can I do this?