Search
Question: Calculating rpkm from counts data?
0
gravatar for ghk
7 days ago by
ghk0
ghk0 wrote:

Hello Everyone,

I got counts data for some samples using "STAR". I need to convert the counts into rpkm values and then normalize them for further analysis. I'm not aware how to do this. Can anyone help me in this? Thank you

Example counts data:

                 V1 V2 V3 V4 V5 V6 V7 V8
ENSG00000000003   0  0  0  0  1  0  0  0
ENSG00000000005   0  0  0  0  0  0  0  0
ENSG00000000419  10 24 19 20 19  8 14  6
ENSG00000000457  17 15 13 18 21 18 21 15
ENSG00000000460   2  3  5  2  4  6  8  2
ENSG00000000938  20  4 35 16 10 17 19  9
ADD COMMENTlink modified 7 days ago by kristoffer.vittingseerup20 • written 7 days ago by ghk0
0
gravatar for Aaron Lun
7 days ago by
Aaron Lun15k
Cambridge, United Kingdom
Aaron Lun15k wrote:

Perhaps the rpkm function in edgeR would help?

You'll have to dig out the lengths from somewhere though. If all you have is the Ensembl gene IDs, you could probably use biomaRt to get the transcript length for each gene.

ADD COMMENTlink modified 7 days ago • written 7 days ago by Aaron Lun15k

Hi Aaron,

Thanks for the reply. I know this biomaRt. I have 57,000 Ensembl gene IDs. Is it possible to get gene lengths for all those? 

ADD REPLYlink written 7 days ago by ghk0

I got the transcript length for Ensembl Ids. But Each ensembl ids having multiple transcript lengths. Which one should I use?

ADD REPLYlink written 6 days ago by ghk0

Ensembl Gene ID     Symbol   transcript_length

ENSG00000083642  PDS5B  797

ENSG00000083642  PDS5B  906

ENSG00000083642  PDS5B  684

ENSG00000083642  PDS5B  1889

ENSG00000083642  PDS5B  972

ENSG00000083642  PDS5B  5246

ENSG00000083642  PDS5B  4238

ENSG00000083642  PDS5B  581

ENSG00000083642  PDS5B  7497

Which one should I take as gene_length for calculating rpkm?

ADD REPLYlink written 6 days ago by ghk0
1

Personally I would do something like:

library(GenomicFeatures)
hg19.ens <- makeTxDbFromUCSC(genome="hg19", tablename="ensGene")
exonic <- exonsBy(hg19.ens, by="gene")
red.exonic <- reduce(exonic)
exon.lengths <- sum(width(red.exonic))

... which gives you the total exonic length of each gene, indexed by the Ensembl ID.

ADD REPLYlink modified 3 days ago • written 6 days ago by Aaron Lun15k

If not this, can I take the gene which has highest transcript_length from the above mentioned multiple IDs? I mean 

ENSG00000083642  PDS5B  7497

ADD REPLYlink modified 5 days ago • written 5 days ago by ghk0
1

Well, not really, because the longest transcript may not contain all exons of a gene. See my edited answer above for full instructions (you can also use makeTxDbFromBiomart, if you like biomaRt better). However, the best solution would be to dig out the annotation you used for feature counting, and calculate the sum of reduced exon lengths from that. For example, you could run makeTxDbFromGFF on the GTF file.

ADD REPLYlink modified 3 days ago • written 3 days ago by Aaron Lun15k

As you said I took the total exonic length into consideration and did the further analysis.

myDGEList <- DGEList(counts= expressionMatrix , genes= geneDataFrame) Here geneDataFrame is having a column length corresponding to each row in the expression matrix.

To calculate RPKM values I did like following:

myDGEList <- calcNormFactors(myDGEList)

rpkmMatrix <- rpkm(myDGEList)

My question is I want to look at the percentile of a particular gene in a specific sample before and after normalisation. Could you please tell me how to do that?

Looking forward to your response.

Thank you

ADD REPLYlink modified 2 days ago • written 2 days ago by ghk0

I don't really understand what you want to do, but you can probably use rank to do it.

ADD REPLYlink written 2 days ago by Aaron Lun15k
0
gravatar for kristoffer.vittingseerup
7 days ago by
European Union

The easiest thing is probably to use edgeR. To calculate RPKM you need two things:

1) The count data (which you have)

2) The length of the genes.

You can then construct a DGElist with edgeR as follows:

 

    myDGEList <- DGEList( counts= expressionMatrix , genes= geneDataFrame )

 

where the "geneDataFrame" is a data.frame with a collum called "lenght" which contains the gene lengths corresponding to each row in the expression matrix.

You can then do the inter-library normalization and calculate RPKM values as follows:

 

    myDGEList <- calcNormFactors(myDGEList)
    rpkmMatrix <- rpkm(myDGEList)

 

 

 

ADD COMMENTlink written 7 days ago by kristoffer.vittingseerup20

Thanks for the reply. I need the length of the genes first. Do you know the easiest way to get?

ADD REPLYlink written 7 days ago by ghk0

How did you quantify the genes (which tool)? And which excat database of gene models did you use for the quantification?

ADD REPLYlink written 4 days ago by kristoffer.vittingseerup20

Ok. I dont have any problem with this now. I have transformed the counts data into rpkmMatrix as you said. I want to make a density plot with the rpkm values of the samples. How can I do this?

ADD REPLYlink written 3 days ago by ghk0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 228 users visited in the last hour