GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

0

Entering edit mode

Adaikalavan Ramasamy ▴ 220

@adaikalavan-ramasamy-5765

Last seen 11.2 years ago

United Kingdom

Dear all, I am finding some unexpected results (to me anyway) with the VariantAnnotation package. Basically, there are situations where the GENEID is missing when LOCATION is either coding, promoter, intron, threeUTR or fiveUTR. Here is an example with five SNPs (among many more). I have marked the unexpected results with "##". library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), c("rs1063412", "chr1", 172410967), c("rs78291220", "chr2", 60890373), c("rs116917239", "chr17", 44061025), c("rs11593", "chrX", 153627145) ) colnames(tmp) <- c("rsid", "chr", "pos") tmp$pos <- as.numeric( as.character(tmp$pos) ) target <- with(tmp, GRanges(seqnames = Rle(chr), ranges = IRanges(pos, end=pos, names=rsid), strand = Rle(strand("*")) ) ) loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants()) names(loc) <- NULL out <- as.data.frame(loc) out$rsid <- names(target)[ out$QUERYID ] out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")] out <- unique(out) rownames(out) <- NULL out rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID 1 rs10917388 chr1 23803138 intron 55616 <na> <na> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## 3 rs1063412 chr1 172410967 intron 92346 <na> <na> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 9 rs116917239 chr17 44061025 coding 4137 <na> <na> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## 12 rs11593 chrX 153627145 intron 6134 <na> <na> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## Can anyone help explain what is happening please? Is this to be expected? Thank you. Regards, Adai

• 2.0k views

ADD COMMENT • link updated 12.9 years ago by Valerie Obenchain ★ 6.8k • written 12.9 years ago by Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 4.0 years ago

United States

Hello, All values in the output (GENEID, TXID, etc.) are taken from the annotation you are using. If the annotaion does not have a GENEID, TXID, etc. for a particular range, then none will be reported. To take a closer look at the annotation we can extract the transcripts and ask for gene_id, cds_id and tx_id as columns. txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) Looking at the first 3 rows, we see none of these have a gene_id, >> tx[1:3] > GRanges with 3 ranges and 3 metadata columns: > seqnames ranges strand | tx_id gene_id > <rle> <iranges> <rle> | <integer> <compressedcharacterlist> > [1] chr1 [11874, 14409] + | 1 > [2] chr1 [11874, 14409] + | 2 > [3] chr1 [11874, 14409] + | 3 > cds_id > <compressedintegerlist> > [1] NA > [2] 1,2,3 > [3] NA To isolate the portion of the annotation that overlaps with your ranges you can use subsetByOverlaps(), gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), IRanges(c(23803138, 172410967, 60890373, 44061025, 153627145), width=1)) res <- subsetByOverlaps(tx, gr) Here is an example of a transcript with no gene_id >> res[8:10] > GRanges with 3 ranges and 3 metadata columns: > seqnames ranges strand | tx_id > <rle> <iranges> <rle> | <integer> > [1] chr1 [172410597, 172413230] - | 6937 > [2] chr1 [172410869, 172411762] - | 6938 > [3] chr17 [ 43971748, 44105699] + | 59914 > gene_id cds_id > <compressedcharacterlist> <compressedintegerlist> > [1] 5279 NA,20697 > [2] 20697 > [3] 4137 NA,178155,178156,... Also remember that if a range is 'intergenic' that it will not have a GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. Valerie On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: > Dear all, > > I am finding some unexpected results (to me anyway) with the > VariantAnnotation package. Basically, there are situations where the > GENEID is missing when LOCATION is either coding, promoter, intron, > threeUTR or fiveUTR. Here is an example with five SNPs (among many > more). I have marked the unexpected results with "##". > > > library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), > c("rs1063412", "chr1", 172410967), > c("rs78291220", "chr2", 60890373), > c("rs116917239", "chr17", 44061025), > c("rs11593", "chrX", 153627145) ) > colnames(tmp) <- c("rsid", "chr", "pos") > tmp$pos <- as.numeric( as.character(tmp$pos) ) > > target <- with(tmp, GRanges(seqnames = Rle(chr), > ranges = IRanges(pos, > end=pos, names=rsid), > strand = Rle(strand("*")) ) ) > > loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants()) > names(loc) <- NULL > out <- as.data.frame(loc) > out$rsid <- names(target)[ out$QUERYID ] > out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", > "PRECEDEID", "FOLLOWID")] > out <- unique(out) > rownames(out) <- NULL > out > > rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID > 1 rs10917388 chr1 23803138 intron 55616 <na> <na> > 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## > > 3 rs1063412 chr1 172410967 intron 92346 <na> <na> > 4 rs1063412 chr1 172410967 intron 5279 <na> <na> > 5 rs1063412 chr1 172410967 coding 5279 <na> <na> > 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## > > 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## > 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 > > 9 rs116917239 chr17 44061025 coding 4137 <na> <na> > 10 rs116917239 chr17 44061025 intron 4137 <na> <na> > 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## > > 12 rs11593 chrX 153627145 intron 6134 <na> <na> > 13 rs11593 chrX 153627145 promoter 6134 <na> <na> > 14 rs11593 chrX 153627145 promoter 26778 <na> <na> > 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## > 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## > 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## > > Can anyone help explain what is happening please? Is this to be > expected? Thank you. > > Regards, Adai > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, Thanks once again for the quick reply. I understand there the different the annotation databases are not consistent and I am trying to understand the reason for some of the noise. If I look up chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I see that it overlaps with C1orf105 and PIGC http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411 762&hgsid=326905233&pubs=pack http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-1724 11762 so why does the GeneID is blank in this case? On the other hand res[8, ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? BTW, the examples I sent were all non-intergenics. Thank you. Regards, Adai On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Hello, > > All values in the output (GENEID, TXID, etc.) are taken from the annotation > you are using. If the annotaion does not have a GENEID, TXID, etc. for a > particular range, then none will be reported. > > To take a closer look at the annotation we can extract the transcripts and > ask for gene_id, cds_id and tx_id as columns. > > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) > > > Looking at the first 3 rows, we see none of these have a gene_id, > >>> tx[1:3] >> >> GRanges with 3 ranges and 3 metadata columns: >> seqnames ranges strand | tx_id gene_id >> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> >> [1] chr1 [11874, 14409] + | 1 >> [2] chr1 [11874, 14409] + | 2 >> [3] chr1 [11874, 14409] + | 3 >> cds_id >> <compressedintegerlist> >> [1] NA >> [2] 1,2,3 >> [3] NA > > > To isolate the portion of the annotation that overlaps with your ranges you > can use subsetByOverlaps(), > > gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), > IRanges(c(23803138, 172410967, 60890373, > 44061025, 153627145), width=1)) > res <- subsetByOverlaps(tx, gr) > > Here is an example of a transcript with no gene_id > >>> res[8:10] >> >> GRanges with 3 ranges and 3 metadata columns: >> seqnames ranges strand | tx_id >> <rle> <iranges> <rle> | <integer> >> [1] chr1 [172410597, 172413230] - | 6937 >> [2] chr1 [172410869, 172411762] - | 6938 >> [3] chr17 [ 43971748, 44105699] + | 59914 >> gene_id cds_id >> <compressedcharacterlist> <compressedintegerlist> >> [1] 5279 NA,20697 >> [2] 20697 >> [3] 4137 NA,178155,178156,... > > > > Also remember that if a range is 'intergenic' that it will not have a > GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. > > Valerie > > > > > On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >> >> Dear all, >> >> I am finding some unexpected results (to me anyway) with the >> VariantAnnotation package. Basically, there are situations where the >> GENEID is missing when LOCATION is either coding, promoter, intron, >> threeUTR or fiveUTR. Here is an example with five SNPs (among many >> more). I have marked the unexpected results with "##". >> >> >> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >> c("rs1063412", "chr1", 172410967), >> c("rs78291220", "chr2", 60890373), >> c("rs116917239", "chr17", 44061025), >> c("rs11593", "chrX", 153627145) ) >> colnames(tmp) <- c("rsid", "chr", "pos") >> tmp$pos <- as.numeric( as.character(tmp$pos) ) >> >> target <- with(tmp, GRanges(seqnames = Rle(chr), >> ranges = IRanges(pos, >> end=pos, names=rsid), >> strand = Rle(strand("*")) ) ) >> >> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >> AllVariants()) >> names(loc) <- NULL >> out <- as.data.frame(loc) >> out$rsid <- names(target)[ out$QUERYID ] >> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >> "PRECEDEID", "FOLLOWID")] >> out <- unique(out) >> rownames(out) <- NULL >> out >> >> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID >> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## >> >> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## >> >> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## >> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >> >> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## >> >> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## >> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## >> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## >> >> Can anyone help explain what is happening please? Is this to be >> expected? Thank you. >> >> Regards, Adai >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 12.9 years ago Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations can be found here, http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationDat a The TxDb packages found on this page were made from specific tracks at UCSC or elsewhere. The details of how a package was made can be found on the package man page, ?TxDb.Hsapiens.UCSC.hg19.knownGene If none of the TxDb's meet your need you can create your own. In the GenomicFeatures package there are several functions available for creating custom annotations. ?makeTranscriptDbFromBiomart ?makeTranscriptDbFromGFF ?makeTranscriptDbFromUCSC I've also cc'd Marc, who handles our annotations, in case he has something more to add. Valerie On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote: > Dear Valerie, > > Thanks once again for the quick reply. I understand there the > different the annotation databases are not consistent and I am trying > to understand the reason for some of the noise. If I look up > chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I > see that it overlaps with C1orf105 and PIGC > > http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-1724 11762&hgsid=326905233&pubs=pack > http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-17 2411762 > > so why does the GeneID is blank in this case? On the other hand res[8, > ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? > BTW, the examples I sent were all non-intergenics. Thank you. > > Regards, Adai > > > > On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: >> Hello, >> >> All values in the output (GENEID, TXID, etc.) are taken from the annotation >> you are using. If the annotaion does not have a GENEID, TXID, etc. for a >> particular range, then none will be reported. >> >> To take a closer look at the annotation we can extract the transcripts and >> ask for gene_id, cds_id and tx_id as columns. >> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) >> >> >> Looking at the first 3 rows, we see none of these have a gene_id, >> >>>> tx[1:3] >>> >>> GRanges with 3 ranges and 3 metadata columns: >>> seqnames ranges strand | tx_id gene_id >>> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> >>> [1] chr1 [11874, 14409] + | 1 >>> [2] chr1 [11874, 14409] + | 2 >>> [3] chr1 [11874, 14409] + | 3 >>> cds_id >>> <compressedintegerlist> >>> [1] NA >>> [2] 1,2,3 >>> [3] NA >> >> >> To isolate the portion of the annotation that overlaps with your ranges you >> can use subsetByOverlaps(), >> >> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), >> IRanges(c(23803138, 172410967, 60890373, >> 44061025, 153627145), width=1)) >> res <- subsetByOverlaps(tx, gr) >> >> Here is an example of a transcript with no gene_id >> >>>> res[8:10] >>> >>> GRanges with 3 ranges and 3 metadata columns: >>> seqnames ranges strand | tx_id >>> <rle> <iranges> <rle> | <integer> >>> [1] chr1 [172410597, 172413230] - | 6937 >>> [2] chr1 [172410869, 172411762] - | 6938 >>> [3] chr17 [ 43971748, 44105699] + | 59914 >>> gene_id cds_id >>> <compressedcharacterlist> <compressedintegerlist> >>> [1] 5279 NA,20697 >>> [2] 20697 >>> [3] 4137 NA,178155,178156,... >> >> >> >> Also remember that if a range is 'intergenic' that it will not have a >> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. >> >> Valerie >> >> >> >> >> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >>> >>> Dear all, >>> >>> I am finding some unexpected results (to me anyway) with the >>> VariantAnnotation package. Basically, there are situations where the >>> GENEID is missing when LOCATION is either coding, promoter, intron, >>> threeUTR or fiveUTR. Here is an example with five SNPs (among many >>> more). I have marked the unexpected results with "##". >>> >>> >>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> >>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >>> c("rs1063412", "chr1", 172410967), >>> c("rs78291220", "chr2", 60890373), >>> c("rs116917239", "chr17", 44061025), >>> c("rs11593", "chrX", 153627145) ) >>> colnames(tmp) <- c("rsid", "chr", "pos") >>> tmp$pos <- as.numeric( as.character(tmp$pos) ) >>> >>> target <- with(tmp, GRanges(seqnames = Rle(chr), >>> ranges = IRanges(pos, >>> end=pos, names=rsid), >>> strand = Rle(strand("*")) ) ) >>> >>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >>> AllVariants()) >>> names(loc) <- NULL >>> out <- as.data.frame(loc) >>> out$rsid <- names(target)[ out$QUERYID ] >>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >>> "PRECEDEID", "FOLLOWID")] >>> out <- unique(out) >>> rownames(out) <- NULL >>> out >>> >>> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID >>> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >>> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## >>> >>> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >>> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >>> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >>> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## >>> >>> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## >>> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >>> >>> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >>> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >>> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## >>> >>> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >>> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >>> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >>> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## >>> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## >>> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## >>> >>> Can anyone help explain what is happening please? Is this to be >>> expected? Thank you. >>> >>> Regards, Adai >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>

ADD REPLY • link 12.9 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, thank you again for your patience and the explanations. Regards, Adai On Tue, Feb 19, 2013 at 8:55 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations > can be found here, > > http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationD ata > > The TxDb packages found on this page were made from specific tracks at UCSC > or elsewhere. The details of how a package was made can be found on the > package man page, > > ?TxDb.Hsapiens.UCSC.hg19.knownGene > > If none of the TxDb's meet your need you can create your own. In the > GenomicFeatures package there are several functions available for creating > custom annotations. > > ?makeTranscriptDbFromBiomart > ?makeTranscriptDbFromGFF > ?makeTranscriptDbFromUCSC > > I've also cc'd Marc, who handles our annotations, in case he has something > more to add. > > Valerie > > > > On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote: >> >> Dear Valerie, >> >> Thanks once again for the quick reply. I understand there the >> different the annotation databases are not consistent and I am trying >> to understand the reason for some of the noise. If I look up >> chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I >> see that it overlaps with C1orf105 and PIGC >> >> >> http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172 411762&hgsid=326905233&pubs=pack >> >> http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-1 72411762 >> >> so why does the GeneID is blank in this case? On the other hand res[8, >> ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? >> BTW, the examples I sent were all non-intergenics. Thank you. >> >> Regards, Adai >> >> >> >> On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> >> wrote: >>> >>> Hello, >>> >>> All values in the output (GENEID, TXID, etc.) are taken from the >>> annotation >>> you are using. If the annotaion does not have a GENEID, TXID, etc. for a >>> particular range, then none will be reported. >>> >>> To take a closer look at the annotation we can extract the transcripts >>> and >>> ask for gene_id, cds_id and tx_id as columns. >>> >>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) >>> >>> >>> Looking at the first 3 rows, we see none of these have a gene_id, >>> >>>>> tx[1:3] >>>> >>>> >>>> GRanges with 3 ranges and 3 metadata columns: >>>> seqnames ranges strand | tx_id >>>> gene_id >>>> <rle> <iranges> <rle> | <integer> >>>> <compressedcharacterlist> >>>> [1] chr1 [11874, 14409] + | 1 >>>> [2] chr1 [11874, 14409] + | 2 >>>> [3] chr1 [11874, 14409] + | 3 >>>> cds_id >>>> <compressedintegerlist> >>>> [1] NA >>>> [2] 1,2,3 >>>> [3] NA >>> >>> >>> >>> To isolate the portion of the annotation that overlaps with your ranges >>> you >>> can use subsetByOverlaps(), >>> >>> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), >>> IRanges(c(23803138, 172410967, 60890373, >>> 44061025, 153627145), width=1)) >>> res <- subsetByOverlaps(tx, gr) >>> >>> Here is an example of a transcript with no gene_id >>> >>>>> res[8:10] >>>> >>>> >>>> GRanges with 3 ranges and 3 metadata columns: >>>> seqnames ranges strand | tx_id >>>> <rle> <iranges> <rle> | <integer> >>>> [1] chr1 [172410597, 172413230] - | 6937 >>>> [2] chr1 [172410869, 172411762] - | 6938 >>>> [3] chr17 [ 43971748, 44105699] + | 59914 >>>> gene_id cds_id >>>> <compressedcharacterlist> <compressedintegerlist> >>>> [1] 5279 NA,20697 >>>> [2] 20697 >>>> [3] 4137 NA,178155,178156,... >>> >>> >>> >>> >>> Also remember that if a range is 'intergenic' that it will not have a >>> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. >>> >>> Valerie >>> >>> >>> >>> >>> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >>>> >>>> >>>> Dear all, >>>> >>>> I am finding some unexpected results (to me anyway) with the >>>> VariantAnnotation package. Basically, there are situations where the >>>> GENEID is missing when LOCATION is either coding, promoter, intron, >>>> threeUTR or fiveUTR. Here is an example with five SNPs (among many >>>> more). I have marked the unexpected results with "##". >>>> >>>> >>>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> >>>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >>>> c("rs1063412", "chr1", 172410967), >>>> c("rs78291220", "chr2", 60890373), >>>> c("rs116917239", "chr17", 44061025), >>>> c("rs11593", "chrX", 153627145) >>>> ) >>>> colnames(tmp) <- c("rsid", "chr", "pos") >>>> tmp$pos <- as.numeric( as.character(tmp$pos) ) >>>> >>>> target <- with(tmp, GRanges(seqnames = Rle(chr), >>>> ranges = IRanges(pos, >>>> end=pos, names=rsid), >>>> strand = Rle(strand("*")) >>>> ) ) >>>> >>>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >>>> AllVariants()) >>>> names(loc) <- NULL >>>> out <- as.data.frame(loc) >>>> out$rsid <- names(target)[ out$QUERYID ] >>>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >>>> "PRECEDEID", "FOLLOWID")] >>>> out <- unique(out) >>>> rownames(out) <- NULL >>>> out >>>> >>>> rsid seqnames start LOCATION GENEID PRECEDEID >>>> FOLLOWID >>>> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >>>> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> >>>> ## >>>> >>>> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >>>> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >>>> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >>>> 6 rs1063412 chr1 172410967 coding <na> <na> <na> >>>> ## >>>> >>>> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> >>>> ## >>>> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >>>> >>>> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >>>> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >>>> 11 rs116917239 chr17 44061025 coding <na> <na> <na> >>>> ## >>>> >>>> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >>>> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >>>> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >>>> 15 rs11593 chrX 153627145 promoter <na> <na> <na> >>>> ## >>>> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> >>>> ## >>>> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> >>>> ## >>>> >>>> Can anyone help explain what is happening please? Is this to be >>>> expected? Thank you. >>>> >>>> Regards, Adai >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >

ADD REPLY • link 12.9 years ago Adaikalavan Ramasamy ▴ 220

Login before adding your answer.