Extract Exon Features from GenomicFeatures Leads to Inconsistent Pairing of Gene and Exon IDs
0
0
Entering edit mode
@herve-pages-1542
Last seen 21 hours ago
Seattle, WA, United States
Hi Fong, On 01/13/2013 10:36 AM, Fong Chun Chan wrote: > Thanks Steve. That clarifies what is going on to me. I suppose the only way > to solve this to take the first gene id from an index that contains > 1 > gene ids. Problem with this solution is that if exon1 belongs to gene1 and gene2, then gene2 will have a missing exon in you final data.frame. A better solution is to create 2 rows for exon1 in the data.frame: gene1 exon1 gene2 exon1 If 'annotDf' is your DataFrame with the gene_id and exon_id cols (of types *List and integer, respectively), you can obtain the data.frame with: gene_id <- annotDf[, 'gene_id'] exon_id <- annotDf[, 'exon_id'] df <- data.frame(geneID=unlist(gene_id), exonID=rep(exon_id, elementLengths(gene_id))) Note that there is no loss of information with this approach i.e. it's possible (and very easy) to reconstruct 'annotDf' from 'df'. Hope this helps, H. > > Fong > > > On Sun, Jan 13, 2013 at 2:36 AM, Steve Lianoglou < > mailinglist.honeypot at gmail.com> wrote: > >> Hi Fong, >> >> On Sun, Jan 13, 2013 at 3:09 AM, Fong Chun Chan <fongchun at="" alumni.ubc.ca=""> >> wrote: >>> Hi, >>> >>> I was hoping someone could explain this weird situation to me when I am >>> trying to extract exon information using the GenomicFeatures Bioconductor >>> package. Here is what I've done: >>> >>> txdb <- loadFeatures(txdbFile) >>> allExons <- exons(txdb, columns = c('gene_id', 'exon_id')) >>> annotDf <- values(allExons) >>> dim(annotDf) >>> >>> [1] 541825 2 >>> >>> According to this code, I've extracted a total of 541825 exons which are >>> associated with x number of genes. Now when I do the following: >>> >>> length(unlist(annotDf[, 'gene_id'])) >>> >>> [1] 546664 >>> >>> How is it possible that there are suddenly an additional 4839 genes added >>> to the 'gene_id' column? >> >> The first clue is that the gene_id column is stored as a *List type, >> which you correctly realized because you are `unlisting` it it to get >> its length. The reason that it's a list is so that one could store >> more than one element per index, otherwise they would have just used a >> character (or factor) vector. >> >> So, what you can do is to see what's going on here -- I am using the >> latest version of R and the "TxDb.Hsapiens.UCSC.hg19.knownGene" >> package, so things might not be exactly the same for you, but I said >> up my vars to mimc yours, so: >> >> R> gene.ids <- annotDf$gene_id >> R> counts <- elementLengths(gene.ids) ## number of elements per bin >> R> table(counts) >> counts >> 1 2 3 4 5 6 9 15 22 >> 282939 3788 101 7 6 1 4 3 3 >> >> So we see that several gene ids can be assigned to one exon. You might >> as why, and I reckon the answer is that the exons are defined as >> "unique" simply by their chr,strand,start,end combo and several genes >> (or transcripts) might share the same exon (in "genomic space"). Let's >> see: >> >> R> allExons[elementLengths(gene.ids) > 1] >> GRanges with 3913 ranges and 2 metadata columns: >> seqnames ranges strand | >> gene_id exon_id >> <rle> <iranges> <rle> | >> <compressedcharacterlist> <integer> >> [1] chr1 [ 324288, 324345] + | >> 100132287,100133331 12 >> [2] chr1 [10490159, 10490625] + | >> 100526739,378708 897 >> [3] chr1 [10490804, 10491458] + | >> 100526739,378708 898 >> [4] chr1 [10493899, 10494022] + | >> 100526739,378708 899 >> [5] chr1 [10494714, 10494747] + | >> 100526739,378708 900 >> [6] chr1 [10500404, 10500470] + | >> 100526739,378708 901 >> [7] chr1 [10511434, 10512060] + | >> 100526739,1325 904 >> [8] chr1 [16355621, 16355794] + | >> 1187,1188 1504 >> [9] chr1 [19923471, 19923603] + | >> 100532736,440574 1788 >> ... ... ... ... ... >> ... ... >> [3905] chrY [26944570, 26944641] - | >> 57054,57135 274737 >> [3906] chrY [26946960, 26947031] - | >> 57054,57135 274738 >> [3907] chrY [26949403, 26949474] - | >> 57054,57135 274739 >> [3908] chrY [26950875, 26951014] - | >> 57054,57135 274740 >> [3909] chrY [26951104, 26951167] - | >> 57054,57135 274741 >> [3910] chrY [26951604, 26951655] - | >> 57054,57135 274742 >> [3911] chrY [26952216, 26952307] - | >> 57054,57135 274743 >> [3912] chrY [26952582, 26952728] - | >> 57054,57135 274744 >> [3913] chrY [26959330, 26959639] - | >> 57054,57135 274745 >> >> If you hop to any of these regions in your favorite genome browser, >> you should find overlapping gene annotations there that share exact >> exon boundaries across transcripts. >> >>> The reason this is a problem is that I would like >>> to create a 3 column output where I use the command: >>> >>> data.frame( geneID = unlist(annotDf[, 'gene_id']), exonID = annotDf[, >>> 'exon_id'], exonCounts = exonCounts) >>> >>> Where exonCounts is generated with the countOverlaps() function using >>> allExons and reads from a bam file. This command is failing because it >>> seems that the number of gene_ids is not pairing up with the exon_ids >>> properly. Is there something I am missing here? >> >> Hopefully this clarification helps? >> >> And also: >> >>> I am using 1.6.9 of GenomicFeatures, R is 2.14.2 and my txdb was built >>> using the command: >> >> You should upgrade to the latest version of R and bioconductor >> packages to get the best functionality (and help). >> >> Hope that helps, >> -steve >> >> -- >> Steve Lianoglou >> Graduate Student: Computational Systems Biology >> | Memorial Sloan-Kettering Cancer Center >> | Weill Medical College of Cornell University >> Contact Info: http://cbio.mskcc.org/~lianos/contact >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
Cancer GenomicFeatures Cancer GenomicFeatures • 977 views
ADD COMMENT

Login before adding your answer.

Traffic: 1215 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6