Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs
1
0
Entering edit mode
@fong-chun-chan-5706
Last seen 9.6 years ago
Hi, I was hoping someone could explain this weird situation to me when I am trying to extract exon information using the GenomicFeatures Bioconductor package. Here is what I've done: txdb <- loadFeatures(txdbFile) allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) annotDf <- values(allExons) dim(annotDf) [1] 541825 2 According to this code, I've extracted a total of 541825 exons which are associated with x number of genes. Now when I do the following: length(unlist(annotDf[, 'gene_id'])) [1] 546664 How is it possible that there are suddenly an additional 4839 genes added to the 'gene_id' column? The reason this is a problem is that I would like to create a 3 column output where I use the command: data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, 'exon_id'], exonCounts = exonCounts) Where exonCounts is generated with the countOverlaps() function using allExons and reads from a bam file. This command is failing because it seems that the number of gene_ids is not pairing up with the exon_ids properly. Is there something I am missing here? I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built using the command: txdb <- makeTranscriptDbFromUCSC(genome = 'hg19', tablename = 'ensGene') Has anyone encountered this? Any help would be great. Thanks, Fong [[alternative HTML version deleted]]
GenomicFeatures GenomicFeatures • 973 views
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 14 months ago
United States
Hi Fong, On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at="" alumni.ubc.ca=""> wrote: > Hi, > > I was hoping someone could explain this weird situation to me when I am > trying to extract exon information using the GenomicFeatures Bioconductor > package. Here is what I've done: > > txdb <- loadFeatures(txdbFile) > allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) > annotDf <- values(allExons) > dim(annotDf) > > [1] 541825 2 > > According to this code, I've extracted a total of 541825 exons which are > associated with x number of genes. Now when I do the following: > > length(unlist(annotDf[, 'gene_id'])) > > [1] 546664 > > How is it possible that there are suddenly an additional 4839 genes added > to the 'gene_id' column? The first clue is that the gene_id column is stored as a *List type, which you correctly realized because you are `unlisting` it it to get its length. The reason that it's a list is so that one could store more than one element per index, otherwise they would have just used a character (or factor) vector. So, what you can do is to see what's going on here -- I am using the latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene" package, so things might not be exactly the same for you, but I said up my vars to mimc yours, so: R> gene.ids <- annotDf$gene_id R> counts <- elementLengths(gene.ids) ## number of elements per bin R> table(counts) counts 1 2 3 4 5 6 9 15 22 282939 3788 101 7 6 1 4 3 3 So we see that several gene ids can be assigned to one exon. You might as why, and I reckon the answer is that the exons are defined as "unique" simply by their chr,strand,start,end combo and several genes (or transcripts) might share the same exon (in "genomic space"). Let's see: R> allExons[elementLengths(gene.ids) > 1] GRanges with 3913 ranges and 2 metadata columns: seqnames ranges strand | gene_id exon_id <rle> <iranges> <rle> | <compressedcharacterlist> <integer> [1] chr1 [ 324288, 324345] + | 100132287,100133331 12 [2] chr1 [10490159, 10490625] + | 100526739,378708 897 [3] chr1 [10490804, 10491458] + | 100526739,378708 898 [4] chr1 [10493899, 10494022] + | 100526739,378708 899 [5] chr1 [10494714, 10494747] + | 100526739,378708 900 [6] chr1 [10500404, 10500470] + | 100526739,378708 901 [7] chr1 [10511434, 10512060] + | 100526739,1325 904 [8] chr1 [16355621, 16355794] + | 1187,1188 1504 [9] chr1 [19923471, 19923603] + | 100532736,440574 1788 ... ... ... ... ... ... ... [3905] chrY [26944570, 26944641] - | 57054,57135 274737 [3906] chrY [26946960, 26947031] - | 57054,57135 274738 [3907] chrY [26949403, 26949474] - | 57054,57135 274739 [3908] chrY [26950875, 26951014] - | 57054,57135 274740 [3909] chrY [26951104, 26951167] - | 57054,57135 274741 [3910] chrY [26951604, 26951655] - | 57054,57135 274742 [3911] chrY [26952216, 26952307] - | 57054,57135 274743 [3912] chrY [26952582, 26952728] - | 57054,57135 274744 [3913] chrY [26959330, 26959639] - | 57054,57135 274745 If you hop to any of these regions in your favorite genome browser, you should find overlapping gene annotations there that share exact exon boundaries across transcripts. > The reason this is a problem is that I would like > to create a 3 column output where I use the command: > > data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, > 'exon_id'], exonCounts = exonCounts) > > Where exonCounts is generated with the countOverlaps() function using > allExons and reads from a bam file. This command is failing because it > seems that the number of gene_ids is not pairing up with the exon_ids > properly. Is there something I am missing here? Hopefully this clarification helps? And also: > I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built > using the command: You should upgrade to the latest version of R and bioconductor packages to get the best functionality (and help). Hope that helps, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact
ADD COMMENT
0
Entering edit mode
Thanks Steve. That clarifies what is going on to me. I suppose the only way to solve this to take the first gene id from an index that contains > 1 gene ids. Fong On Sun, Jan 13, 2013 at 2:36 AM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi Fong, > > On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun@alumni.ubc.ca> > wrote: > > Hi, > > > > I was hoping someone could explain this weird situation to me when I am > > trying to extract exon information using the GenomicFeatures Bioconductor > > package. Here is what I've done: > > > > txdb <- loadFeatures(txdbFile) > > allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) > > annotDf <- values(allExons) > > dim(annotDf) > > > > [1] 541825 2 > > > > According to this code, I've extracted a total of 541825 exons which are > > associated with x number of genes. Now when I do the following: > > > > length(unlist(annotDf[, 'gene_id'])) > > > > [1] 546664 > > > > How is it possible that there are suddenly an additional 4839 genes added > > to the 'gene_id' column? > > The first clue is that the gene_id column is stored as a *List type, > which you correctly realized because you are `unlisting` it it to get > its length. The reason that it's a list is so that one could store > more than one element per index, otherwise they would have just used a > character (or factor) vector. > > So, what you can do is to see what's going on here -- I am using the > latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene" > package, so things might not be exactly the same for you, but I said > up my vars to mimc yours, so: > > R> gene.ids <- annotDf$gene_id > R> counts <- elementLengths(gene.ids) ## number of elements per bin > R> table(counts) > counts > 1 2 3 4 5 6 9 15 22 > 282939 3788 101 7 6 1 4 3 3 > > So we see that several gene ids can be assigned to one exon. You might > as why, and I reckon the answer is that the exons are defined as > "unique" simply by their chr,strand,start,end combo and several genes > (or transcripts) might share the same exon (in "genomic space"). Let's > see: > > R> allExons[elementLengths(gene.ids) > 1] > GRanges with 3913 ranges and 2 metadata columns: > seqnames ranges strand | > gene_id exon_id > <rle> <iranges> <rle> | > <compressedcharacterlist> <integer> > [1] chr1 [ 324288, 324345] + | > 100132287,100133331 12 > [2] chr1 [10490159, 10490625] + | > 100526739,378708 897 > [3] chr1 [10490804, 10491458] + | > 100526739,378708 898 > [4] chr1 [10493899, 10494022] + | > 100526739,378708 899 > [5] chr1 [10494714, 10494747] + | > 100526739,378708 900 > [6] chr1 [10500404, 10500470] + | > 100526739,378708 901 > [7] chr1 [10511434, 10512060] + | > 100526739,1325 904 > [8] chr1 [16355621, 16355794] + | > 1187,1188 1504 > [9] chr1 [19923471, 19923603] + | > 100532736,440574 1788 > ... ... ... ... ... > ... ... > [3905] chrY [26944570, 26944641] - | > 57054,57135 274737 > [3906] chrY [26946960, 26947031] - | > 57054,57135 274738 > [3907] chrY [26949403, 26949474] - | > 57054,57135 274739 > [3908] chrY [26950875, 26951014] - | > 57054,57135 274740 > [3909] chrY [26951104, 26951167] - | > 57054,57135 274741 > [3910] chrY [26951604, 26951655] - | > 57054,57135 274742 > [3911] chrY [26952216, 26952307] - | > 57054,57135 274743 > [3912] chrY [26952582, 26952728] - | > 57054,57135 274744 > [3913] chrY [26959330, 26959639] - | > 57054,57135 274745 > > If you hop to any of these regions in your favorite genome browser, > you should find overlapping gene annotations there that share exact > exon boundaries across transcripts. > > > The reason this is a problem is that I would like > > to create a 3 column output where I use the command: > > > > data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, > > 'exon_id'], exonCounts = exonCounts) > > > > Where exonCounts is generated with the countOverlaps() function using > > allExons and reads from a bam file. This command is failing because it > > seems that the number of gene_ids is not pairing up with the exon_ids > > properly. Is there something I am missing here? > > Hopefully this clarification helps? > > And also: > > > I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built > > using the command: > > You should upgrade to the latest version of R and bioconductor > packages to get the best functionality (and help). > > Hope that helps, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 566 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6