Hello Everyone,
I got counts data for some samples using "STAR". I need to convert the counts into rpkm values and then normalize them for further analysis. I'm not aware how to do this. Can anyone help me in this? Thank you
Example counts data:
V1 V2 V3 V4 V5 V6 V7 V8
ENSG00000000003 0 0 0 0 1 0 0 0
ENSG00000000005 0 0 0 0 0 0 0 0
ENSG00000000419 10 24 19 20 19 8 14 6
ENSG00000000457 17 15 13 18 21 18 21 15
ENSG00000000460 2 3 5 2 4 6 8 2
ENSG00000000938 20 4 35 16 10 17 19 9
Hi Aaron,
Thanks for the reply. I know this biomaRt. I have 57,000 Ensembl gene IDs. Is it possible to get gene lengths for all those?
I got the transcript length for Ensembl Ids. But Each ensembl ids having multiple transcript lengths. Which one should I use?
Ensembl Gene ID Symbol transcript_length
ENSG00000083642 PDS5B 797
ENSG00000083642 PDS5B 906
ENSG00000083642 PDS5B 684
ENSG00000083642 PDS5B 1889
ENSG00000083642 PDS5B 972
ENSG00000083642 PDS5B 5246
ENSG00000083642 PDS5B 4238
ENSG00000083642 PDS5B 581
ENSG00000083642 PDS5B 7497
Which one should I take as gene_length for calculating rpkm?
Personally I would do something like:
... which gives you the total exonic length of each gene, indexed by the Ensembl ID.
If not this, can I take the gene which has highest transcript_length from the above mentioned multiple IDs? I mean
ENSG00000083642 PDS5B 7497
Well, not really, because the longest transcript may not contain all exons of a gene. See my edited answer above for full instructions (you can also use
makeTxDbFromBiomart
, if you like biomaRt better). However, the best solution would be to dig out the annotation you used for feature counting, and calculate the sum of reduced exon lengths from that. For example, you could runmakeTxDbFromGFF
on the GTF file.As you said I took the total exonic length into consideration and did the further analysis.
myDGEList <- DGEList(counts= expressionMatrix , genes= geneDataFrame) Here geneDataFrame is having a column length corresponding to each row in the expression matrix.
To calculate RPKM values I did like following:
myDGEList <- calcNormFactors(myDGEList)
rpkmMatrix <- rpkm(myDGEList)
My question is I want to look at the percentile of a particular gene in a specific sample before and after normalisation. Could you please tell me how to do that?
Looking forward to your response.
Thank you
I don't really understand what you want to do, but you can probably use
rank
to do it.