Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 13 hours ago

Seattle, WA, United States

Hi Fong, On 01/13/2013 10:36 AM, Fong Chun Chan wrote: > Thanks Steve. That clarifies what is going on to me. I suppose the only way > to solve this to take the first gene id from an index that contains > 1 > gene ids. Problem with this solution is that if exon1 belongs to gene1 and gene2, then gene2 will have a missing exon in you final data.frame. A better solution is to create 2 rows for exon1 in the data.frame: gene1 exon1 gene2 exon1 If 'annotDf' is your DataFrame with the gene_id and exon_id cols (of types *List and integer, respectively), you can obtain the data.frame with: gene_id <- annotDf[, 'gene_id'] exon_id <- annotDf[, 'exon_id'] df <- data.frame(geneID=unlist(gene_id), exonID=rep(exon_id, elementLengths(gene_id))) Note that there is no loss of information with this approach i.e. it's possible (and very easy) to reconstruct 'annotDf' from 'df'. Hope this helps, H. > > Fong > > > On Sun, Jan 13, 2013 at 2:36 AM, Steve Lianoglou < > mailinglist.honeypot at gmail.com> wrote: > >> Hi Fong, >> >> On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at="" alumni.ubc.ca=""> >> wrote: >>> Hi, >>> >>> I was hoping someone could explain this weird situation to me when I am >>> trying to extract exon information using the GenomicFeatures Bioconductor >>> package. Here is what I've done: >>> >>> txdb <- loadFeatures(txdbFile) >>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) >>> annotDf <- values(allExons) >>> dim(annotDf) >>> >>> [1] 541825 2 >>> >>> According to this code, I've extracted a total of 541825 exons which are >>> associated with x number of genes. Now when I do the following: >>> >>> length(unlist(annotDf[, 'gene_id'])) >>> >>> [1] 546664 >>> >>> How is it possible that there are suddenly an additional 4839 genes added >>> to the 'gene_id' column? >> >> The first clue is that the gene_id column is stored as a *List type, >> which you correctly realized because you are `unlisting` it it to get >> its length. The reason that it's a list is so that one could store >> more than one element per index, otherwise they would have just used a >> character (or factor) vector. >> >> So, what you can do is to see what's going on here -- I am using the >> latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene" >> package, so things might not be exactly the same for you, but I said >> up my vars to mimc yours, so: >> >> R> gene.ids <- annotDf$gene_id >> R> counts <- elementLengths(gene.ids) ## number of elements per bin >> R> table(counts) >> counts >> 1 2 3 4 5 6 9 15 22 >> 282939 3788 101 7 6 1 4 3 3 >> >> So we see that several gene ids can be assigned to one exon. You might >> as why, and I reckon the answer is that the exons are defined as >> "unique" simply by their chr,strand,start,end combo and several genes >> (or transcripts) might share the same exon (in "genomic space"). Let's >> see: >> >> R> allExons[elementLengths(gene.ids) > 1] >> GRanges with 3913 ranges and 2 metadata columns: >> seqnames ranges strand | >> gene_id exon_id >> <rle> <iranges> <rle> | >> <compressedcharacterlist> <integer> >> [1] chr1 [ 324288, 324345] + | >> 100132287,100133331 12 >> [2] chr1 [10490159, 10490625] + | >> 100526739,378708 897 >> [3] chr1 [10490804, 10491458] + | >> 100526739,378708 898 >> [4] chr1 [10493899, 10494022] + | >> 100526739,378708 899 >> [5] chr1 [10494714, 10494747] + | >> 100526739,378708 900 >> [6] chr1 [10500404, 10500470] + | >> 100526739,378708 901 >> [7] chr1 [10511434, 10512060] + | >> 100526739,1325 904 >> [8] chr1 [16355621, 16355794] + | >> 1187,1188 1504 >> [9] chr1 [19923471, 19923603] + | >> 100532736,440574 1788 >> ... ... ... ... ... >> ... ... >> [3905] chrY [26944570, 26944641] - | >> 57054,57135 274737 >> [3906] chrY [26946960, 26947031] - | >> 57054,57135 274738 >> [3907] chrY [26949403, 26949474] - | >> 57054,57135 274739 >> [3908] chrY [26950875, 26951014] - | >> 57054,57135 274740 >> [3909] chrY [26951104, 26951167] - | >> 57054,57135 274741 >> [3910] chrY [26951604, 26951655] - | >> 57054,57135 274742 >> [3911] chrY [26952216, 26952307] - | >> 57054,57135 274743 >> [3912] chrY [26952582, 26952728] - | >> 57054,57135 274744 >> [3913] chrY [26959330, 26959639] - | >> 57054,57135 274745 >> >> If you hop to any of these regions in your favorite genome browser, >> you should find overlapping gene annotations there that share exact >> exon boundaries across transcripts. >> >>> The reason this is a problem is that I would like >>> to create a 3 column output where I use the command: >>> >>> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, >>> 'exon_id'], exonCounts = exonCounts) >>> >>> Where exonCounts is generated with the countOverlaps() function using >>> allExons and reads from a bam file. This command is failing because it >>> seems that the number of gene_ids is not pairing up with the exon_ids >>> properly. Is there something I am missing here? >> >> Hopefully this clarification helps? >> >> And also: >> >>> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built >>> using the command: >> >> You should upgrade to the latest version of R and bioconductor >> packages to get the best functionality (and help). >> >> Hope that helps, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer GenomicFeatures Cancer GenomicFeatures • 1.0k views

ADD COMMENT • link 12.9 years ago Hervé Pagès 16k

Login before adding your answer.