Question

Calculating rpkm from counts data?

0

Entering edit mode

Biologist ▴ 120

@biologist-9801

Last seen 4.8 years ago

Hello Everyone,

I got counts data for some samples using "STAR". I need to convert the counts into rpkm values and then normalize them for further analysis. I'm not aware how to do this. Can anyone help me in this? Thank you

Example counts data:

                 V1 V2 V3 V4 V5 V6 V7 V8
ENSG00000000003   0  0  0  0  1  0  0  0
ENSG00000000005   0  0  0  0  0  0  0  0
ENSG00000000419  10 24 19 20 19  8 14  6
ENSG00000000457  17 15 13 18 21 18 21 15
ENSG00000000460   2  3  5  2  4  6  8  2
ENSG00000000938  20  4 35 16 10 17 19  9

r featurecounts rpkm tpm edger • 8.7k views

ADD COMMENT • link updated 22 months ago by Chris ▴ 20 • written 7.4 years ago by Biologist ▴ 120

score 0 · Answer 1 · 2017-08-11

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 5 hours ago

The city by the bay

Perhaps the rpkm function in edgeR would help?

You'll have to dig out the lengths from somewhere though. If all you have is the Ensembl gene IDs, you could probably use biomaRt to get the transcript length for each gene.

ADD COMMENT • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks for the reply. I know this biomaRt. I have 57,000 Ensembl gene IDs. Is it possible to get gene lengths for all those?

ADD REPLY • link 7.4 years ago Biologist ▴ 120

0

Entering edit mode

I got the transcript length for Ensembl Ids. But Each ensembl ids having multiple transcript lengths. Which one should I use?

ADD REPLY • link 7.4 years ago Biologist ▴ 120

0

Entering edit mode

Ensembl Gene ID Symbol transcript_length

ENSG00000083642 PDS5B 797

ENSG00000083642 PDS5B 906

ENSG00000083642 PDS5B 684

ENSG00000083642 PDS5B 1889

ENSG00000083642 PDS5B 972

ENSG00000083642 PDS5B 5246

ENSG00000083642 PDS5B 4238

ENSG00000083642 PDS5B 581

ENSG00000083642 PDS5B 7497

Which one should I take as gene_length for calculating rpkm?

ADD REPLY • link 7.4 years ago Biologist ▴ 120

1

Entering edit mode

Personally I would do something like:

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- sum(width(red.exonic))

... which gives you the total exonic length of each gene, indexed by the Ensembl ID.

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

If not this, can I take the gene which has highest transcript_length from the above mentioned multiple IDs? I mean

ENSG00000083642 PDS5B 7497

ADD REPLY • link 7.4 years ago Biologist ▴ 120

1

Entering edit mode

Well, not really, because the longest transcript may not contain all exons of a gene. See my edited answer above for full instructions (you can also use makeTxDbFromBiomart, if you like biomaRt better). However, the best solution would be to dig out the annotation you used for feature counting, and calculate the sum of reduced exon lengths from that. For example, you could run makeTxDbFromGFF on the GTF file.

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

0

Entering edit mode

As you said I took the total exonic length into consideration and did the further analysis.

myDGEList <- DGEList(counts= expressionMatrix , genes= geneDataFrame) Here geneDataFrame is having a column length corresponding to each row in the expression matrix.

To calculate RPKM values I did like following:

myDGEList <- calcNormFactors(myDGEList)

rpkmMatrix <- rpkm(myDGEList)

My question is I want to look at the percentile of a particular gene in a specific sample before and after normalisation. Could you please tell me how to do that?

Looking forward to your response.

Thank you

ADD REPLY • link 7.4 years ago Biologist ▴ 120

0

Entering edit mode

I don't really understand what you want to do, but you can probably use rank to do it.

ADD REPLY • link 7.4 years ago Aaron Lun ★ 28k

score 0 · Answer 2 · 2017-08-11

0

Entering edit mode

kristoffer.vittingseerup ▴ 20

@kristoffervittingseerup-7310

Last seen 6.6 years ago

European Union

The easiest thing is probably to use edgeR. To calculate RPKM you need two things:

1) The count data (which you have)

2) The length of the genes.

You can then construct a DGElist with edgeR as follows:

myDGEList <- DGEList( counts= expressionMatrix , genes= geneDataFrame )

where the "geneDataFrame" is a data.frame with a collum called "lenght" which contains the gene lengths corresponding to each row in the expression matrix.

You can then do the inter-library normalization and calculate RPKM values as follows:

myDGEList <- calcNormFactors(myDGEList)
rpkmMatrix <- rpkm(myDGEList)

ADD COMMENT • link 7.4 years ago kristoffer.vittingseerup ▴ 20

0

Entering edit mode

Thanks for the reply. I need the length of the genes first. Do you know the easiest way to get?

ADD REPLY • link 7.4 years ago Biologist ▴ 120

0

Entering edit mode

How did you quantify the genes (which tool)? And which excat database of gene models did you use for the quantification?

ADD REPLY • link 7.4 years ago kristoffer.vittingseerup ▴ 20

0

Entering edit mode

Ok. I dont have any problem with this now. I have transformed the counts data into rpkmMatrix as you said. I want to make a density plot with the rpkm values of the samples. How can I do this?

ADD REPLY • link 7.4 years ago Biologist ▴ 120

0

Entering edit mode

Would you tell me how you get the length of the genes? I also want to transform the count matrix to rpkm.

ADD REPLY • link 22 months ago Chris ▴ 20