Search
Question: Finding gene lengths of mm10 to calculate rpkm from count matrix of rna-seq dataset
0
2.8 years ago by
AB10
United States
AB10 wrote:

Hi everyone,

I am trying to generate the RPKM values using the rpkm() function. But it is asking for gene lengths. How can I get the gene lengths of mm10 ?

Apoorva

modified 2.8 years ago • written 2.8 years ago by AB10

Please see the posting guide. Your question is too vague for anybody to be able to help. There are just over 1100 packages in Bioconductor, any number of which may have an rpkm() function. If you want help, you should make it as easy as possible for people to know exactly what you are trying to do, what packages, etc.

My apologies. I'm a newbie and  I'm just trying to figure things out. I need to compute the Rpkm values from the count matrix generated from 22 samples. I tried using the rpkm() function which is present in the edgeR package but that requires gene lengths. The reference genome I am using is mm10. I am using Bioconductor in a Windows machine. So i am unable to load the Rsubread package to use the featureCounts function. I was just wondering if there was any other way to get the gene lengths or compute the rpkm from the count matrix.

If you didn't use featureCounts(), then how did you compute the count matrix?

Most read count programs will return the gene lengths that correspond to the same gene annotation used to count the reads.

I used the summarizeOverlaps function from the GenomicAlignmnets package to get the count table

You can install a Linux-like environment such as Cygwin (https://www.cygwin.com/) on your Windows computer, and then you can install R on Cygwin and run Rsubread and other R packages.

Is that true / easy? For instance, R doesn't support compilation on cygwin, using instead it's own MinGW-derived tool chain. So is it 'easy' to compile packages with this toolschain but link to the necessary cygwin libraries? This seems like a lot of work for something like RPKM. A docker image might be easier and more modern if one is interested in a linux environment on Windows.

Hi Martin, I am not saying this is easy. It needs computer science knowledge to install these software. But I do know people who successfully installed R on cygwin and ran Rsubread on it.

3
2.8 years ago by
United States
James W. MacDonald47k wrote:

OK, that makes more sense. Probably the easiest way to do this is to use the TxDb.Mmusculus.UCSC.mm10.knownGene package.

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> exns <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
## all the exon ranges, by gene
> exns
GRangesList object of length 23725:
$100009600 GRanges object with 7 ranges and 2 metadata columns: seqnames ranges strand | exon_id exon_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chr9 [21062393, 21062717] - | 133241 <NA> [2] chr9 [21062894, 21062987] - | 133242 <NA> [3] chr9 [21063314, 21063396] - | 133243 <NA> [4] chr9 [21066024, 21066377] - | 133244 <NA> [5] chr9 [21066940, 21067925] - | 133245 <NA> [6] chr9 [21068030, 21068117] - | 133246 <NA> [7] chr9 [21073075, 21075496] - | 133248 <NA>$100009609
GRanges object with 6 ranges and 2 metadata columns:
seqnames               ranges strand | exon_id exon_name
[1]     chr7 [84940169, 84941088]      - |  108923      <NA>
[2]     chr7 [84943141, 84943264]      - |  108924      <NA>
[3]     chr7 [84943504, 84943722]      - |  108925      <NA>
[4]     chr7 [84946200, 84947000]      - |  108926      <NA>
[5]     chr7 [84947372, 84947651]      - |  108927      <NA>
[6]     chr7 [84963816, 84964009]      - |  108928      <NA>

$100009614 GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | exon_id exon_name [1] chr10 [77711446, 77712009] + | 142611 <NA> ... <23722 more elements> ------- seqinfo: 66 sequences (1 circular) from mm10 genome ##get the widths of each exon > z <- width(exns) ## this is what we get - it's basically a list > z IntegerList of length 23725 [["100009600"]] 325 94 83 354 986 88 2422 [["100009609"]] 920 124 219 801 280 194 [["100009614"]] 564 [["100009664"]] 242 275 98 1781 [["100012"]] 883 582 314 75 [["100017"]] 1788 35 131 84 149 73 115 113 143 189 [["100019"]] 234 227 1326 225 108 193 243 132 ... 175 636 74 115 64 143 1267 [["100033459"]] 287 285 2655 107 954 68 298 988 [["100034251"]] 118 149 202 [["100034361"]] 639 90 160 161 109 538 190 ... <23715 more elements> ## sum the exon widths, over each gene > zz <- sapply(z, sum) ## and now we have a named vector, where the names are the gene IDs > head(zz) 100009600 100009609 100009614 100009664 100012 100017 4352 2538 564 2396 1854 2820  Now the trick is to get the gene widths in the same order as your data. Assuming you have gene IDs for your data, you can just use the "[" function. So if you have a character vector of gene IDs, you can reorder by doing this: zz <- zz[<character vector of gene IDs goes here>] And then you can feed that directly to rpkm(). ADD COMMENTlink written 2.8 years ago by James W. MacDonald47k That worked. Thank you ADD REPLYlink written 2.8 years ago by AB10 1 2.8 years ago by Hervé Pagès ♦♦ 13k United States Hervé Pagès ♦♦ 13k wrote: Hi Apoorva, See ?transcriptLengths in the GenomicFeatures package. Then to go from transcript lengths to gene lengths: max(splitAsList(dm3_txlens$tx_len, dm3_txlens\$gene_id))

The man page should probably show this.

H.