Question

Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs

0

Entering edit mode

Fong Chun Chan ▴ 320

@fong-chun-chan-5706

Last seen 11.4 years ago

Hi, I was hoping someone could explain this weird situation to me when I am trying to extract exon information using the GenomicFeatures Bioconductor package. Here is what I've done: txdb <- loadFeatures(txdbFile) allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) annotDf <- values(allExons) dim(annotDf) [1] 541825 2 According to this code, I've extracted a total of 541825 exons which are associated with x number of genes. Now when I do the following: length(unlist(annotDf[, 'gene_id'])) [1] 546664 How is it possible that there are suddenly an additional 4839 genes added to the 'gene_id' column? The reason this is a problem is that I would like to create a 3 column output where I use the command: data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, 'exon_id'], exonCounts = exonCounts) Where exonCounts is generated with the countOverlaps() function using allExons and reads from a bam file. This command is failing because it seems that the number of gene_ids is not pairing up with the exon_ids properly. Is there something I am missing here? I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built using the command: txdb <- makeTranscriptDbFromUCSC(genome = 'hg19', tablename = 'ensGene') Has anyone encountered this? Any help would be great. Thanks, Fong [[alternative HTML version deleted]]

GenomicFeatures GenomicFeatures • 1.3k views

ADD COMMENT • link updated 13.1 years ago by Steve Lianoglou ★ 13k • written 13.1 years ago by Fong Chun Chan ▴ 320

score 0 · Answer 1 · 2013-01-13

Hi Fong, On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at="" alumni.ubc.ca=""> wrote: > Hi, > > I was hoping someone could explain this weird situation to me when I am > trying to extract exon information using the GenomicFeatures Bioconductor > package. Here is what I've done: > > txdb <- loadFeatures(txdbFile) > allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) > annotDf <- values(allExons) > dim(annotDf) > > [1] 541825 2 > > According to this code, I've extracted a total of 541825 exons which are > associated with x number of genes. Now when I do the following: > > length(unlist(annotDf[, 'gene_id'])) > > [1] 546664 > > How is it possible that there are suddenly an additional 4839 genes added > to the 'gene_id' column? The first clue is that the gene_id column is stored as a *List type, which you correctly realized because you are `unlisting` it it to get its length. The reason that it's a list is so that one could store more than one element per index, otherwise they would have just used a character (or factor) vector. So, what you can do is to see what's going on here -- I am using the latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene" package, so things might not be exactly the same for you, but I said up my vars to mimc yours, so: R> gene.ids <- annotDf$gene_id R> counts <- elementLengths(gene.ids) ## number of elements per bin R> table(counts) counts 1 2 3 4 5 6 9 15 22 282939 3788 101 7 6 1 4 3 3 So we see that several gene ids can be assigned to one exon. You might as why, and I reckon the answer is that the exons are defined as "unique" simply by their chr,strand,start,end combo and several genes (or transcripts) might share the same exon (in "genomic space"). Let's see: R> allExons[elementLengths(gene.ids) > 1] GRanges with 3913 ranges and 2 metadata columns: seqnames ranges strand | gene_id exon_id <rle> <iranges> <rle> | <compressedcharacterlist> <integer> [1] chr1 [ 324288, 324345] + | 100132287,100133331 12 [2] chr1 [10490159, 10490625] + | 100526739,378708 897 [3] chr1 [10490804, 10491458] + | 100526739,378708 898 [4] chr1 [10493899, 10494022] + | 100526739,378708 899 [5] chr1 [10494714, 10494747] + | 100526739,378708 900 [6] chr1 [10500404, 10500470] + | 100526739,378708 901 [7] chr1 [10511434, 10512060] + | 100526739,1325 904 [8] chr1 [16355621, 16355794] + | 1187,1188 1504 [9] chr1 [19923471, 19923603] + | 100532736,440574 1788 ... ... ... ... ... ... ... [3905] chrY [26944570, 26944641] - | 57054,57135 274737 [3906] chrY [26946960, 26947031] - | 57054,57135 274738 [3907] chrY [26949403, 26949474] - | 57054,57135 274739 [3908] chrY [26950875, 26951014] - | 57054,57135 274740 [3909] chrY [26951104, 26951167] - | 57054,57135 274741 [3910] chrY [26951604, 26951655] - | 57054,57135 274742 [3911] chrY [26952216, 26952307] - | 57054,57135 274743 [3912] chrY [26952582, 26952728] - | 57054,57135 274744 [3913] chrY [26959330, 26959639] - | 57054,57135 274745 If you hop to any of these regions in your favorite genome browser, you should find overlapping gene annotations there that share exact exon boundaries across transcripts. > The reason this is a problem is that I would like > to create a 3 column output where I use the command: > > data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, > 'exon_id'], exonCounts = exonCounts) > > Where exonCounts is generated with the countOverlaps() function using > allExons and reads from a bam file. This command is failing because it > seems that the number of gene_ids is not pairing up with the exon_ids > properly. Is there something I am missing here? Hopefully this clarification helps? And also: > I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built > using the command: You should upgrade to the latest version of R and bioconductor packages to get the best functionality (and help). Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact