Search
Question: Calculating rpkm from counts data?
0
11 months ago by
Biologist70
Biologist70 wrote:

Hello Everyone,

I got counts data for some samples using "STAR". I need to convert the counts into rpkm values and then normalize them for further analysis. I'm not aware how to do this. Can anyone help me in this? Thank you

Example counts data:

                 V1 V2 V3 V4 V5 V6 V7 V8
ENSG00000000003   0  0  0  0  1  0  0  0
ENSG00000000005   0  0  0  0  0  0  0  0
ENSG00000000419  10 24 19 20 19  8 14  6
ENSG00000000457  17 15 13 18 21 18 21 15
ENSG00000000460   2  3  5  2  4  6  8  2
ENSG00000000938  20  4 35 16 10 17 19  9
modified 11 months ago by kristoffer.vittingseerup20 • written 11 months ago by Biologist70
0
11 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

Perhaps the rpkm function in edgeR would help?

You'll have to dig out the lengths from somewhere though. If all you have is the Ensembl gene IDs, you could probably use biomaRt to get the transcript length for each gene.

ADD COMMENTlink modified 11 months ago • written 11 months ago by Aaron Lun20k

Hi Aaron,

Thanks for the reply. I know this biomaRt. I have 57,000 Ensembl gene IDs. Is it possible to get gene lengths for all those?

I got the transcript length for Ensembl Ids. But Each ensembl ids having multiple transcript lengths. Which one should I use?

Ensembl Gene ID     Symbol   transcript_length

ENSG00000083642  PDS5B  797

ENSG00000083642  PDS5B  906

ENSG00000083642  PDS5B  684

ENSG00000083642  PDS5B  1889

ENSG00000083642  PDS5B  972

ENSG00000083642  PDS5B  5246

ENSG00000083642  PDS5B  4238

ENSG00000083642  PDS5B  581

ENSG00000083642  PDS5B  7497

Which one should I take as gene_length for calculating rpkm?

1

Personally I would do something like:

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- sum(width(red.exonic))


... which gives you the total exonic length of each gene, indexed by the Ensembl ID.

ADD REPLYlink modified 11 months ago • written 11 months ago by Aaron Lun20k

If not this, can I take the gene which has highest transcript_length from the above mentioned multiple IDs? I mean

ENSG00000083642  PDS5B  7497

ADD REPLYlink modified 11 months ago • written 11 months ago by Biologist70
1

Well, not really, because the longest transcript may not contain all exons of a gene. See my edited answer above for full instructions (you can also use makeTxDbFromBiomart, if you like biomaRt better). However, the best solution would be to dig out the annotation you used for feature counting, and calculate the sum of reduced exon lengths from that. For example, you could run makeTxDbFromGFF on the GTF file.

ADD REPLYlink modified 11 months ago • written 11 months ago by Aaron Lun20k

As you said I took the total exonic length into consideration and did the further analysis.

myDGEList <- DGEList(counts= expressionMatrix , genes= geneDataFrame) Here geneDataFrame is having a column length corresponding to each row in the expression matrix.

To calculate RPKM values I did like following:

myDGEList <- calcNormFactors(myDGEList)

rpkmMatrix <- rpkm(myDGEList)

My question is I want to look at the percentile of a particular gene in a specific sample before and after normalisation. Could you please tell me how to do that?

Looking forward to your response.

Thank you

ADD REPLYlink modified 11 months ago • written 11 months ago by Biologist70

I don't really understand what you want to do, but you can probably use rank to do it.

0
11 months ago by
European Union

The easiest thing is probably to use edgeR. To calculate RPKM you need two things:

1) The count data (which you have)

2) The length of the genes.

You can then construct a DGElist with edgeR as follows:

myDGEList <- DGEList( counts= expressionMatrix , genes= geneDataFrame )

where the "geneDataFrame" is a data.frame with a collum called "lenght" which contains the gene lengths corresponding to each row in the expression matrix.

You can then do the inter-library normalization and calculate RPKM values as follows:

myDGEList <- calcNormFactors(myDGEList)
rpkmMatrix <- rpkm(myDGEList)

Thanks for the reply. I need the length of the genes first. Do you know the easiest way to get?

How did you quantify the genes (which tool)? And which excat database of gene models did you use for the quantification?