Gene Size using GenomicFeatures
3
0
Entering edit mode
ehimelbl • 0
@ehimelbl-14023
Last seen 7.0 years ago

I want to calculate the total length of genes (including introns) from the start codon to the stop codon.  I have loaded a GFF file into R using the GenomicFeatures package.  Using genes(txdb) I get the output below that shows the positions of the first nucleotide of the start codon and the last nucleotide of the stop codon.  Is there a function in GenomicFeatures that will give the gene size/length using this information?  Or is there a way to get the absolute value of subtracting the last from the first nucleotide?

> genes(txdb)
GRanges object with 3 ranges and 1 metadata column:
             seqnames             ranges strand |     gene_id
                <Rle>          <IRanges>  <Rle> | <character>
  CG32006-RA     Chr4 [1703212, 1705370]      + |  CG32006-RA
  CG33978-RE     Chr4 [1640671, 1653615]      + |  CG33978-RE
      mav-RA     Chr4 [1708086, 1710381]      - |      mav-RA
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

genomicfeatures gene size gene length • 4.8k views
ADD COMMENT
1
Entering edit mode
@michael-lawrence-3846
Last seen 2.9 years ago
United States

No, there is no function in GenomicFeatures for that. You have a GRanges object, defined by the GenomicRanges package. There is a GRanges method on the generic function width() that will return what you want. You can check out ?GRanges for more on what you can do with a GRanges.

ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States

Hi,

Note that because a gene can have more than 1 transcript, there is no one definitive way to look at the length of a gene. A more sensible way to look at this though might be to consider the length of the longest transcript:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
tx_by_gene <- transcriptsBy(txdb, by="gene")
gene_lens <- max(width(tx_by_gene))

gene_lens is a named integer vector where the names are the gene ids and the values the lengths of the longest transcript for each gene:

head(gene_lens)
#        1        10       100      1000     10000 100008586 
#    14383      9969     32214    226516    355050     15729 

Also please note that there are a couple long-standing issues with genes() that get in the way here. One of them is that by default genes() discards genes that have transcripts on more than 1 chromosome. Another one is that some organizations like UCSC can link transcripts that are very far apart from each other to the same gene. This makes genes() return a range that is artificially big and has no biological meaning for these genes. The transcriptsBy()-based solution above avoids these problems.

Cheers,

H.

ADD COMMENT
0
Entering edit mode

Looks good.  Thank you!

ADD REPLY
0
Entering edit mode
ehimelbl • 0
@ehimelbl-14023
Last seen 7.0 years ago

Thank you! This worked:

gr <- genes(txdb)
genelengths <- width (gr)

 

ADD COMMENT

Login before adding your answer.

Traffic: 679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6