Search
Question: Finding gene lengths of mm10 to calculate rpkm from count matrix of rna-seq dataset
0
gravatar for AB
3.0 years ago by
AB10
United States
AB10 wrote:

Hi everyone,

I am trying to generate the RPKM values using the rpkm() function. But it is asking for gene lengths. How can I get the gene lengths of mm10 ?

Apoorva

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by AB10

Please see the posting guide. Your question is too vague for anybody to be able to help. There are just over 1100 packages in Bioconductor, any number of which may have an rpkm() function. If you want help, you should make it as easy as possible for people to know exactly what you are trying to do, what packages, etc.

ADD REPLYlink written 3.0 years ago by James W. MacDonald48k

My apologies. I'm a newbie and  I'm just trying to figure things out. I need to compute the Rpkm values from the count matrix generated from 22 samples. I tried using the rpkm() function which is present in the edgeR package but that requires gene lengths. The reference genome I am using is mm10. I am using Bioconductor in a Windows machine. So i am unable to load the Rsubread package to use the featureCounts function. I was just wondering if there was any other way to get the gene lengths or compute the rpkm from the count matrix.

ADD REPLYlink written 3.0 years ago by AB10

If you didn't use featureCounts(), then how did you compute the count matrix?

Most read count programs will return the gene lengths that correspond to the same gene annotation used to count the reads.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by Gordon Smyth35k

I used the summarizeOverlaps function from the GenomicAlignmnets package to get the count table

ADD REPLYlink written 3.0 years ago by AB10

You can install a Linux-like environment such as Cygwin (https://www.cygwin.com/) on your Windows computer, and then you can install R on Cygwin and run Rsubread and other R packages.

ADD REPLYlink written 3.0 years ago by Wei Shi2.9k

Is that true / easy? For instance, R doesn't support compilation on cygwin, using instead it's own MinGW-derived tool chain. So is it 'easy' to compile packages with this toolschain but link to the necessary cygwin libraries? This seems like a lot of work for something like RPKM. A docker image might be easier and more modern if one is interested in a linux environment on Windows.

ADD REPLYlink written 3.0 years ago by Martin Morgan ♦♦ 22k

Hi Martin, I am not saying this is easy. It needs computer science knowledge to install these software. But I do know people who successfully installed R on cygwin and ran Rsubread on it.

ADD REPLYlink written 3.0 years ago by Wei Shi2.9k
3
gravatar for James W. MacDonald
3.0 years ago by
United States
James W. MacDonald48k wrote:

OK, that makes more sense. Probably the easiest way to do this is to use the TxDb.Mmusculus.UCSC.mm10.knownGene package.

> library(TxDb.Mmusculus.UCSC.mm10.knownGene)
> exns <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
## all the exon ranges, by gene
> exns
GRangesList object of length 23725:
$100009600
GRanges object with 7 ranges and 2 metadata columns:
      seqnames               ranges strand |   exon_id   exon_name
         <Rle>            <IRanges>  <Rle> | <integer> <character>
  [1]     chr9 [21062393, 21062717]      - |    133241        <NA>
  [2]     chr9 [21062894, 21062987]      - |    133242        <NA>
  [3]     chr9 [21063314, 21063396]      - |    133243        <NA>
  [4]     chr9 [21066024, 21066377]      - |    133244        <NA>
  [5]     chr9 [21066940, 21067925]      - |    133245        <NA>
  [6]     chr9 [21068030, 21068117]      - |    133246        <NA>
  [7]     chr9 [21073075, 21075496]      - |    133248        <NA>

$100009609
GRanges object with 6 ranges and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]     chr7 [84940169, 84941088]      - |  108923      <NA>
  [2]     chr7 [84943141, 84943264]      - |  108924      <NA>
  [3]     chr7 [84943504, 84943722]      - |  108925      <NA>
  [4]     chr7 [84946200, 84947000]      - |  108926      <NA>
  [5]     chr7 [84947372, 84947651]      - |  108927      <NA>
  [6]     chr7 [84963816, 84964009]      - |  108928      <NA>

$100009614
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand | exon_id exon_name
  [1]    chr10 [77711446, 77712009]      + |  142611      <NA>

...
<23722 more elements>
-------
seqinfo: 66 sequences (1 circular) from mm10 genome

##get the widths of each exon
> z <- width(exns)
## this is what we get - it's basically a list
> z
IntegerList of length 23725
[["100009600"]] 325 94 83 354 986 88 2422
[["100009609"]] 920 124 219 801 280 194
[["100009614"]] 564
[["100009664"]] 242 275 98 1781
[["100012"]] 883 582 314 75
[["100017"]] 1788 35 131 84 149 73 115 113 143 189
[["100019"]] 234 227 1326 225 108 193 243 132 ... 175 636 74 115 64 143 1267
[["100033459"]] 287 285 2655 107 954 68 298 988
[["100034251"]] 118 149 202
[["100034361"]] 639 90 160 161 109 538 190
...
<23715 more elements>
## sum the exon widths, over each gene
> zz <- sapply(z, sum)
## and now we have a named vector, where the names are the gene IDs
> head(zz)
100009600 100009609 100009614 100009664    100012    100017
     4352      2538       564      2396      1854      2820

 

Now the trick is to get the gene widths in the same order as your data. Assuming you have gene IDs for your data, you can just use the "[" function. So if you have a character vector of gene IDs, you can reorder by doing this:

zz <- zz[<character vector of gene IDs goes here>]

And then you can feed that directly to rpkm().

ADD COMMENTlink written 3.0 years ago by James W. MacDonald48k

That worked. Thank you

ADD REPLYlink written 3.0 years ago by AB10
1
gravatar for Hervé Pagès
3.0 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi Apoorva,

See ?transcriptLengths in the GenomicFeatures package. Then to go from transcript lengths to gene lengths:

max(splitAsList(dm3_txlens$tx_len, dm3_txlens$gene_id))

The man page should probably show this.

H.

ADD COMMENTlink written 3.0 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 391 users visited in the last hour