GENEID is missing when LOCATION is non-intergenic in VariantAnnotation package

0

Entering edit mode

Adaikalavan Ramasamy ▴ 220

@adaikalavan-ramasamy-5765

Last seen 10.4 years ago

United Kingdom

Dear all, I am finding some unexpected results (to me anyway) with the VariantAnnotation package. Basically, there are situations where the GENEID is missing when LOCATION is either coding, promoter, intron, threeUTR or fiveUTR. Here is an example with five SNPs (among many more). I have marked the unexpected results with "##". library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), c("rs1063412", "chr1", 172410967), c("rs78291220", "chr2", 60890373), c("rs116917239", "chr17", 44061025), c("rs11593", "chrX", 153627145) ) colnames(tmp) <- c("rsid", "chr", "pos") tmp$pos <- as.numeric( as.character(tmp$pos) ) target <- with(tmp, GRanges(seqnames = Rle(chr), ranges = IRanges(pos, end=pos, names=rsid), strand = Rle(strand("*")) ) ) loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants()) names(loc) <- NULL out <- as.data.frame(loc) out$rsid <- names(target)[ out$QUERYID ] out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")] out <- unique(out) rownames(out) <- NULL out rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID 1 rs10917388 chr1 23803138 intron 55616 <na> <na> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## 3 rs1063412 chr1 172410967 intron 92346 <na> <na> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 9 rs116917239 chr17 44061025 coding 4137 <na> <na> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## 12 rs11593 chrX 153627145 intron 6134 <na> <na> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## Can anyone help explain what is happening please? Is this to be expected? Thank you. Regards, Adai

• 1.7k views

ADD COMMENT • link updated 12.1 years ago by Valerie Obenchain ★ 6.8k • written 12.1 years ago by Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Valerie Obenchain ★ 6.8k

@valerie-obenchain-4275

Last seen 3.2 years ago

United States

Hello, All values in the output (GENEID, TXID, etc.) are taken from the annotation you are using. If the annotaion does not have a GENEID, TXID, etc. for a particular range, then none will be reported. To take a closer look at the annotation we can extract the transcripts and ask for gene_id, cds_id and tx_id as columns. txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) Looking at the first 3 rows, we see none of these have a gene_id, >> tx[1:3] > GRanges with 3 ranges and 3 metadata columns: > seqnames ranges strand | tx_id gene_id > <rle> <iranges> <rle> | <integer> <compressedcharacterlist> > [1] chr1 [11874, 14409] + | 1 > [2] chr1 [11874, 14409] + | 2 > [3] chr1 [11874, 14409] + | 3 > cds_id > <compressedintegerlist> > [1] NA > [2] 1,2,3 > [3] NA To isolate the portion of the annotation that overlaps with your ranges you can use subsetByOverlaps(), gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), IRanges(c(23803138, 172410967, 60890373, 44061025, 153627145), width=1)) res <- subsetByOverlaps(tx, gr) Here is an example of a transcript with no gene_id >> res[8:10] > GRanges with 3 ranges and 3 metadata columns: > seqnames ranges strand | tx_id > <rle> <iranges> <rle> | <integer> > [1] chr1 [172410597, 172413230] - | 6937 > [2] chr1 [172410869, 172411762] - | 6938 > [3] chr17 [ 43971748, 44105699] + | 59914 > gene_id cds_id > <compressedcharacterlist> <compressedintegerlist> > [1] 5279 NA,20697 > [2] 20697 > [3] 4137 NA,178155,178156,... Also remember that if a range is 'intergenic' that it will not have a GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. Valerie On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: > Dear all, > > I am finding some unexpected results (to me anyway) with the > VariantAnnotation package. Basically, there are situations where the > GENEID is missing when LOCATION is either coding, promoter, intron, > threeUTR or fiveUTR. Here is an example with five SNPs (among many > more). I have marked the unexpected results with "##". > > > library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) > > tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), > c("rs1063412", "chr1", 172410967), > c("rs78291220", "chr2", 60890373), > c("rs116917239", "chr17", 44061025), > c("rs11593", "chrX", 153627145) ) > colnames(tmp) <- c("rsid", "chr", "pos") > tmp$pos <- as.numeric( as.character(tmp$pos) ) > > target <- with(tmp, GRanges(seqnames = Rle(chr), > ranges = IRanges(pos, > end=pos, names=rsid), > strand = Rle(strand("*")) ) ) > > loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants()) > names(loc) <- NULL > out <- as.data.frame(loc) > out$rsid <- names(target)[ out$QUERYID ] > out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", > "PRECEDEID", "FOLLOWID")] > out <- unique(out) > rownames(out) <- NULL > out > > rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID > 1 rs10917388 chr1 23803138 intron 55616 <na> <na> > 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## > > 3 rs1063412 chr1 172410967 intron 92346 <na> <na> > 4 rs1063412 chr1 172410967 intron 5279 <na> <na> > 5 rs1063412 chr1 172410967 coding 5279 <na> <na> > 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## > > 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## > 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 > > 9 rs116917239 chr17 44061025 coding 4137 <na> <na> > 10 rs116917239 chr17 44061025 intron 4137 <na> <na> > 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## > > 12 rs11593 chrX 153627145 intron 6134 <na> <na> > 13 rs11593 chrX 153627145 promoter 6134 <na> <na> > 14 rs11593 chrX 153627145 promoter 26778 <na> <na> > 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## > 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## > 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## > > Can anyone help explain what is happening please? Is this to be > expected? Thank you. > > Regards, Adai > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 12.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, Thanks once again for the quick reply. I understand there the different the annotation databases are not consistent and I am trying to understand the reason for some of the noise. If I look up chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I see that it overlaps with C1orf105 and PIGC http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172411 762&hgsid=326905233&pubs=pack http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-1724 11762 so why does the GeneID is blank in this case? On the other hand res[8, ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? BTW, the examples I sent were all non-intergenics. Thank you. Regards, Adai On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Hello, > > All values in the output (GENEID, TXID, etc.) are taken from the annotation > you are using. If the annotaion does not have a GENEID, TXID, etc. for a > particular range, then none will be reported. > > To take a closer look at the annotation we can extract the transcripts and > ask for gene_id, cds_id and tx_id as columns. > > txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene > tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) > > > Looking at the first 3 rows, we see none of these have a gene_id, > >>> tx[1:3] >> >> GRanges with 3 ranges and 3 metadata columns: >> seqnames ranges strand | tx_id gene_id >> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> >> [1] chr1 [11874, 14409] + | 1 >> [2] chr1 [11874, 14409] + | 2 >> [3] chr1 [11874, 14409] + | 3 >> cds_id >> <compressedintegerlist> >> [1] NA >> [2] 1,2,3 >> [3] NA > > > To isolate the portion of the annotation that overlaps with your ranges you > can use subsetByOverlaps(), > > gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), > IRanges(c(23803138, 172410967, 60890373, > 44061025, 153627145), width=1)) > res <- subsetByOverlaps(tx, gr) > > Here is an example of a transcript with no gene_id > >>> res[8:10] >> >> GRanges with 3 ranges and 3 metadata columns: >> seqnames ranges strand | tx_id >> <rle> <iranges> <rle> | <integer> >> [1] chr1 [172410597, 172413230] - | 6937 >> [2] chr1 [172410869, 172411762] - | 6938 >> [3] chr17 [ 43971748, 44105699] + | 59914 >> gene_id cds_id >> <compressedcharacterlist> <compressedintegerlist> >> [1] 5279 NA,20697 >> [2] 20697 >> [3] 4137 NA,178155,178156,... > > > > Also remember that if a range is 'intergenic' that it will not have a > GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. > > Valerie > > > > > On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >> >> Dear all, >> >> I am finding some unexpected results (to me anyway) with the >> VariantAnnotation package. Basically, there are situations where the >> GENEID is missing when LOCATION is either coding, promoter, intron, >> threeUTR or fiveUTR. Here is an example with five SNPs (among many >> more). I have marked the unexpected results with "##". >> >> >> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >> >> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >> c("rs1063412", "chr1", 172410967), >> c("rs78291220", "chr2", 60890373), >> c("rs116917239", "chr17", 44061025), >> c("rs11593", "chrX", 153627145) ) >> colnames(tmp) <- c("rsid", "chr", "pos") >> tmp$pos <- as.numeric( as.character(tmp$pos) ) >> >> target <- with(tmp, GRanges(seqnames = Rle(chr), >> ranges = IRanges(pos, >> end=pos, names=rsid), >> strand = Rle(strand("*")) ) ) >> >> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >> AllVariants()) >> names(loc) <- NULL >> out <- as.data.frame(loc) >> out$rsid <- names(target)[ out$QUERYID ] >> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >> "PRECEDEID", "FOLLOWID")] >> out <- unique(out) >> rownames(out) <- NULL >> out >> >> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID >> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## >> >> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## >> >> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## >> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >> >> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## >> >> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## >> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## >> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## >> >> Can anyone help explain what is happening please? Is this to be >> expected? Thank you. >> >> Regards, Adai >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 12.1 years ago Adaikalavan Ramasamy ▴ 220

0

Entering edit mode

Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations can be found here, http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationDat a The TxDb packages found on this page were made from specific tracks at UCSC or elsewhere. The details of how a package was made can be found on the package man page, ?TxDb.Hsapiens.UCSC.hg19.knownGene If none of the TxDb's meet your need you can create your own. In the GenomicFeatures package there are several functions available for creating custom annotations. ?makeTranscriptDbFromBiomart ?makeTranscriptDbFromGFF ?makeTranscriptDbFromUCSC I've also cc'd Marc, who handles our annotations, in case he has something more to add. Valerie On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote: > Dear Valerie, > > Thanks once again for the quick reply. I understand there the > different the annotation databases are not consistent and I am trying > to understand the reason for some of the noise. If I look up > chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I > see that it overlaps with C1orf105 and PIGC > > http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-1724 11762&hgsid=326905233&pubs=pack > http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-17 2411762 > > so why does the GeneID is blank in this case? On the other hand res[8, > ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? > BTW, the examples I sent were all non-intergenics. Thank you. > > Regards, Adai > > > > On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: >> Hello, >> >> All values in the output (GENEID, TXID, etc.) are taken from the annotation >> you are using. If the annotaion does not have a GENEID, TXID, etc. for a >> particular range, then none will be reported. >> >> To take a closer look at the annotation we can extract the transcripts and >> ask for gene_id, cds_id and tx_id as columns. >> >> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) >> >> >> Looking at the first 3 rows, we see none of these have a gene_id, >> >>>> tx[1:3] >>> >>> GRanges with 3 ranges and 3 metadata columns: >>> seqnames ranges strand | tx_id gene_id >>> <rle> <iranges> <rle> | <integer> <compressedcharacterlist> >>> [1] chr1 [11874, 14409] + | 1 >>> [2] chr1 [11874, 14409] + | 2 >>> [3] chr1 [11874, 14409] + | 3 >>> cds_id >>> <compressedintegerlist> >>> [1] NA >>> [2] 1,2,3 >>> [3] NA >> >> >> To isolate the portion of the annotation that overlaps with your ranges you >> can use subsetByOverlaps(), >> >> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), >> IRanges(c(23803138, 172410967, 60890373, >> 44061025, 153627145), width=1)) >> res <- subsetByOverlaps(tx, gr) >> >> Here is an example of a transcript with no gene_id >> >>>> res[8:10] >>> >>> GRanges with 3 ranges and 3 metadata columns: >>> seqnames ranges strand | tx_id >>> <rle> <iranges> <rle> | <integer> >>> [1] chr1 [172410597, 172413230] - | 6937 >>> [2] chr1 [172410869, 172411762] - | 6938 >>> [3] chr17 [ 43971748, 44105699] + | 59914 >>> gene_id cds_id >>> <compressedcharacterlist> <compressedintegerlist> >>> [1] 5279 NA,20697 >>> [2] 20697 >>> [3] 4137 NA,178155,178156,... >> >> >> >> Also remember that if a range is 'intergenic' that it will not have a >> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. >> >> Valerie >> >> >> >> >> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >>> >>> Dear all, >>> >>> I am finding some unexpected results (to me anyway) with the >>> VariantAnnotation package. Basically, there are situations where the >>> GENEID is missing when LOCATION is either coding, promoter, intron, >>> threeUTR or fiveUTR. Here is an example with five SNPs (among many >>> more). I have marked the unexpected results with "##". >>> >>> >>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>> >>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >>> c("rs1063412", "chr1", 172410967), >>> c("rs78291220", "chr2", 60890373), >>> c("rs116917239", "chr17", 44061025), >>> c("rs11593", "chrX", 153627145) ) >>> colnames(tmp) <- c("rsid", "chr", "pos") >>> tmp$pos <- as.numeric( as.character(tmp$pos) ) >>> >>> target <- with(tmp, GRanges(seqnames = Rle(chr), >>> ranges = IRanges(pos, >>> end=pos, names=rsid), >>> strand = Rle(strand("*")) ) ) >>> >>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >>> AllVariants()) >>> names(loc) <- NULL >>> out <- as.data.frame(loc) >>> out$rsid <- names(target)[ out$QUERYID ] >>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >>> "PRECEDEID", "FOLLOWID")] >>> out <- unique(out) >>> rownames(out) <- NULL >>> out >>> >>> rsid seqnames start LOCATION GENEID PRECEDEID FOLLOWID >>> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >>> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> ## >>> >>> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >>> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >>> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >>> 6 rs1063412 chr1 172410967 coding <na> <na> <na> ## >>> >>> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> ## >>> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >>> >>> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >>> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >>> 11 rs116917239 chr17 44061025 coding <na> <na> <na> ## >>> >>> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >>> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >>> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >>> 15 rs11593 chrX 153627145 promoter <na> <na> <na> ## >>> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> ## >>> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> ## >>> >>> Can anyone help explain what is happening please? Is this to be >>> expected? Thank you. >>> >>> Regards, Adai >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>

ADD REPLY • link 12.1 years ago Valerie Obenchain ★ 6.8k

0

Entering edit mode

Dear Valerie, thank you again for your patience and the explanations. Regards, Adai On Tue, Feb 19, 2013 at 8:55 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> wrote: > Yes, evidently C1orf105 is not a 'known gene'. All Bioconductor annotations > can be found here, > > http://bioconductor.org/packages/devel/BiocViews.html#___AnnotationD ata > > The TxDb packages found on this page were made from specific tracks at UCSC > or elsewhere. The details of how a package was made can be found on the > package man page, > > ?TxDb.Hsapiens.UCSC.hg19.knownGene > > If none of the TxDb's meet your need you can create your own. In the > GenomicFeatures package there are several functions available for creating > custom annotations. > > ?makeTranscriptDbFromBiomart > ?makeTranscriptDbFromGFF > ?makeTranscriptDbFromUCSC > > I've also cc'd Marc, who handles our annotations, in case he has something > more to add. > > Valerie > > > > On 02/19/2013 10:46 AM, Adaikalavan Ramasamy wrote: >> >> Dear Valerie, >> >> Thanks once again for the quick reply. I understand there the >> different the annotation databases are not consistent and I am trying >> to understand the reason for some of the noise. If I look up >> chr1:172410869-172411762 (i.e. res[9, ]) on ucsc and also ensembl, I >> see that it overlaps with C1orf105 and PIGC >> >> >> http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr1:172410869-172 411762&hgsid=326905233&pubs=pack >> >> http://www.ensembl.org/Homo_sapiens/Location/View?r=1%3A172410869-1 72411762 >> >> so why does the GeneID is blank in this case? On the other hand res[8, >> ] calls it PIGC only. Is it because C1orf105 is not a "known gene"? >> BTW, the examples I sent were all non-intergenics. Thank you. >> >> Regards, Adai >> >> >> >> On Tue, Feb 19, 2013 at 5:21 PM, Valerie Obenchain <vobencha at="" fhcrc.org=""> >> wrote: >>> >>> Hello, >>> >>> All values in the output (GENEID, TXID, etc.) are taken from the >>> annotation >>> you are using. If the annotaion does not have a GENEID, TXID, etc. for a >>> particular range, then none will be reported. >>> >>> To take a closer look at the annotation we can extract the transcripts >>> and >>> ask for gene_id, cds_id and tx_id as columns. >>> >>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene >>> tx <- transcripts(txdb, columns=c("tx_id", "gene_id", "cds_id")) >>> >>> >>> Looking at the first 3 rows, we see none of these have a gene_id, >>> >>>>> tx[1:3] >>>> >>>> >>>> GRanges with 3 ranges and 3 metadata columns: >>>> seqnames ranges strand | tx_id >>>> gene_id >>>> <rle> <iranges> <rle> | <integer> >>>> <compressedcharacterlist> >>>> [1] chr1 [11874, 14409] + | 1 >>>> [2] chr1 [11874, 14409] + | 2 >>>> [3] chr1 [11874, 14409] + | 3 >>>> cds_id >>>> <compressedintegerlist> >>>> [1] NA >>>> [2] 1,2,3 >>>> [3] NA >>> >>> >>> >>> To isolate the portion of the annotation that overlaps with your ranges >>> you >>> can use subsetByOverlaps(), >>> >>> gr <- GRanges(c("chr1", "chr1", "chr2", "chr17", "chrX"), >>> IRanges(c(23803138, 172410967, 60890373, >>> 44061025, 153627145), width=1)) >>> res <- subsetByOverlaps(tx, gr) >>> >>> Here is an example of a transcript with no gene_id >>> >>>>> res[8:10] >>>> >>>> >>>> GRanges with 3 ranges and 3 metadata columns: >>>> seqnames ranges strand | tx_id >>>> <rle> <iranges> <rle> | <integer> >>>> [1] chr1 [172410597, 172413230] - | 6937 >>>> [2] chr1 [172410869, 172411762] - | 6938 >>>> [3] chr17 [ 43971748, 44105699] + | 59914 >>>> gene_id cds_id >>>> <compressedcharacterlist> <compressedintegerlist> >>>> [1] 5279 NA,20697 >>>> [2] 20697 >>>> [3] 4137 NA,178155,178156,... >>> >>> >>> >>> >>> Also remember that if a range is 'intergenic' that it will not have a >>> GENEID. It will have a PRECEDEID and FOLLOWID, but no GENEID. >>> >>> Valerie >>> >>> >>> >>> >>> On 02/19/2013 06:41 AM, Adaikalavan Ramasamy wrote: >>>> >>>> >>>> Dear all, >>>> >>>> I am finding some unexpected results (to me anyway) with the >>>> VariantAnnotation package. Basically, there are situations where the >>>> GENEID is missing when LOCATION is either coding, promoter, intron, >>>> threeUTR or fiveUTR. Here is an example with five SNPs (among many >>>> more). I have marked the unexpected results with "##". >>>> >>>> >>>> library(VariantAnnotation); library(TxDb.Hsapiens.UCSC.hg19.knownGene) >>>> >>>> tmp <- rbind.data.frame(c("rs10917388", "chr1", 23803138), >>>> c("rs1063412", "chr1", 172410967), >>>> c("rs78291220", "chr2", 60890373), >>>> c("rs116917239", "chr17", 44061025), >>>> c("rs11593", "chrX", 153627145) >>>> ) >>>> colnames(tmp) <- c("rsid", "chr", "pos") >>>> tmp$pos <- as.numeric( as.character(tmp$pos) ) >>>> >>>> target <- with(tmp, GRanges(seqnames = Rle(chr), >>>> ranges = IRanges(pos, >>>> end=pos, names=rsid), >>>> strand = Rle(strand("*")) >>>> ) ) >>>> >>>> loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, >>>> AllVariants()) >>>> names(loc) <- NULL >>>> out <- as.data.frame(loc) >>>> out$rsid <- names(target)[ out$QUERYID ] >>>> out <- out[ , c("rsid", "seqnames", "start", "LOCATION", "GENEID", >>>> "PRECEDEID", "FOLLOWID")] >>>> out <- unique(out) >>>> rownames(out) <- NULL >>>> out >>>> >>>> rsid seqnames start LOCATION GENEID PRECEDEID >>>> FOLLOWID >>>> 1 rs10917388 chr1 23803138 intron 55616 <na> <na> >>>> 2 rs10917388 chr1 23803138 promoter <na> <na> <na> >>>> ## >>>> >>>> 3 rs1063412 chr1 172410967 intron 92346 <na> <na> >>>> 4 rs1063412 chr1 172410967 intron 5279 <na> <na> >>>> 5 rs1063412 chr1 172410967 coding 5279 <na> <na> >>>> 6 rs1063412 chr1 172410967 coding <na> <na> <na> >>>> ## >>>> >>>> 7 rs78291220 chr2 60890373 promoter <na> <na> <na> >>>> ## >>>> 8 rs78291220 chr2 60890373 intergenic <na> 64895 400957 >>>> >>>> 9 rs116917239 chr17 44061025 coding 4137 <na> <na> >>>> 10 rs116917239 chr17 44061025 intron 4137 <na> <na> >>>> 11 rs116917239 chr17 44061025 coding <na> <na> <na> >>>> ## >>>> >>>> 12 rs11593 chrX 153627145 intron 6134 <na> <na> >>>> 13 rs11593 chrX 153627145 promoter 6134 <na> <na> >>>> 14 rs11593 chrX 153627145 promoter 26778 <na> <na> >>>> 15 rs11593 chrX 153627145 promoter <na> <na> <na> >>>> ## >>>> 16 rs11593 chrX 153627145 fiveUTR <na> <na> <na> >>>> ## >>>> 17 rs11593 chrX 153627145 threeUTR <na> <na> <na> >>>> ## >>>> >>>> Can anyone help explain what is happening please? Is this to be >>>> expected? Thank you. >>>> >>>> Regards, Adai >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >

ADD REPLY • link 12.1 years ago Adaikalavan Ramasamy ▴ 220

Login before adding your answer.

Similar Posts

(no subject) •

updated 14.8 years ago by Wolfgang Huber ★ 13k • written 14.8 years ago by emilie sohier ▴ 60

<div class="preformatted">Hello, i am a French bioinformatics student. I want to use the crlmm package to make the copy number,for this i …

crlmm package help for copy number affymetrix snp 6.0 •

updated 14.8 years ago by Benilton Carvalho ★ 4.3k • written 14.8 years ago by emilie sohier ▴ 60

<div class="preformatted">Hello, i am a French bioinformatics student. I want to use the crlmm package to make the copy number of Genome-W…

TCGAbiolinks GDCquery Error: Parsing problems •

updated 7.4 years ago by maysa_taheir ▴ 50 • written 7.4 years ago by Ramiro Magno ▴ 100

<span style="background-color:Yellow">Using TCGAbiolinks 2.5.9.</span> __It seems that the GDC API might have changed something because__ …

problems filtering antigenomic probes from HTA 2.0 •

8.6 years ago • updated 8.1 years ago s.munster ▴ 40

Hello,   I am currently working with data from 16 HTA 2.0 microarrays which I have normalized using RMA using the following comm…

as.numeric and NA •

updated 19.8 years ago by Adaikalavan Ramasamy ★ 1.8k • written 19.9 years ago by Mohammad Esad-Djou ▴ 520

<div class="preformatted">Hello, I would like to use for 42 experiments "as.numeric" (objects of type '"numeric"') I wrote in such a way: …

re incomplete analysis in Deseq •

updated 13.1 years ago by Wolfgang Huber ★ 13k • written 13.1 years ago by Guest User ★ 13k

<div class="preformatted"> I'm using deseq with 454 data and it worked for one set of data but the same script is failing me the second tim…

gcrma and chip without mm values •

updated 18.2 years ago by James W. MacDonald 68k • written 18.2 years ago by Karin Lagesen ▴ 80

<div class="preformatted"> I am working with a custom chip which does not have mm probes, but which does have negative control probes. I w…

[netbenchmark] GeneNet method "Error in .local(object, from, to, ...)" •

7.2 years ago fshodan • 0

GeneNet method usage from netbenchmark package sometimes results in error: <pre> > library(netbenchmark) > top20 <- netbenchmar…

What are these genes ? [How to get Ensembl IDs for them] •

updated 3.7 years ago by abf ▴ 30 • written 3.7 years ago by prabin.dm • 0

Hi, I need to get the ensembl ID for the genes in my dataset. I believe I have gene symbols, but I can not figure out what are these gene …

Error of pamr.knnimpute •

updated 18.2 years ago by James W. MacDonald 68k • written 18.2 years ago by washiot+1@gmail.com ▴ 10

<div class="preformatted">Dear all, I am trying to use pamr.knnimpute with Agilent Microarray data which consist from 137 colums (samples)…

RSQLite and DB problems •

updated 11.5 years ago by James W. MacDonald 68k • written 11.5 years ago by Assa Yeroslaviz ★ 1.5k

<div class="preformatted">Hi, I'm trying to work with the oligo package and the example provided in the pdf file to pre-processing exon ar…

topTable no t-test results •

updated 5.1 years ago by Gordon Smyth 52k • written 5.1 years ago by s.vander.sluis • 0

Hi, I conducted a multi-level experiment with gene expression measures in multiple brain regions in 2 groups. So for instance, my **dat…

DESeq analysis •

updated 12.8 years ago by Wolfgang Huber ★ 13k • written 12.8 years ago by Guest User ★ 13k

<div class="preformatted"> Hi all I am doing some RNA seq analysis with DESeq. I have applied the nbinomTest to my dataset which I know ha…

Why dbConnect GEOmetadb_demo.sqlite shows some information while dbConnect GEOmetadb.sqlite shows no information •

updated 18 months ago by James W. MacDonald 68k • written 18 months ago by Lee ▴ 10

When I used the demo sqlite file, there are some outputs ```r >sqlfile_demo <- getSQLiteFile(destdir = "~/bin/Rpack", destfile = …

hgu133plus2 GO issues •

updated 19.0 years ago by James W. MacDonald 68k • written 19.0 years ago by Jacob Michaelson ▴ 320

<div class="preformatted">Hi list, Could someone please help me understand the differences between the (hgu133plus2)GO, GO2PROBE, GO2ALLPR…

Pathway analysis using GAGE returns all NA values •

updated 23 months ago by Rob • 0 • written 4.4 years ago by microPhD • 0

I have RNA-seq data that I'm trying to run pathway analysis on. I'm using the GAGE R package. When I run the gage function, all the values …

annotating microarray data with mogene10stv1 •

updated 10.7 years ago by James W. MacDonald 68k • written 10.7 years ago by Jakub Stanislaw Nowak ▴ 70

<div class="preformatted">Hello everyone, This my first attempt so it may not be a perfect email. I am not very advanced in bioinformatics…

problem with NA •

updated 19.5 years ago by Saroj Mohapatra ▴ 450 • written 19.5 years ago by Alberto Goldoni ▴ 360

<div class="preformatted">Hi to everybody, i have a little problem with some values: when I type: tmp <- read.table("…

AnnotationDbi: Select - Join order •

6.1 years ago faustfrankenstein • 0

Hey everyone, I am working with the AnnotationForge package, but realised a more general issue wrt. to the `select` command. The resul…

R could not could not find function "makeTxDbFromGFF" after loading (GenomicFeatures) •

updated 20 months ago by Michael Love 43k • written 20 months ago by ikpa_p123 • 0

Enter the body of text here I am new to R (R version 4.3.0) and RNAseq. As tutorial, I am using the paper 'RNA-Seq workflow: gene-level ex…

coverage() output SimpleRleList cannot be converted to GRanges •

updated 7.1 years ago by Hervé Pagès 16k • written 7.1 years ago by xiaotong.yao23 ▴ 10

Converting the SimpleRleList object to GRanges now does not work. Please advise the quickest fix. It probably has something to do with the …

Reason for NA in table after beta regression? •

4.8 years ago akhaira • 0

Hi, I am attempting to analyze 4 files with a total of 12,800,011 elements using BiSeq. Currently, this is the code I am using to obtain be…

GEOquery returning NAs as probe names •

updated 3.2 years ago by Sean Davis 21k • written 3.2 years ago by John ▴ 30

I am re-running old code to obtain microarray datasets with GEOquery. However, after not running the same code for a month or so, I am gett…

Is it possible to get SRA sample attributes for a given sample or experiment ID? •

12.0 years ago Vladimir Morozov ▴ 130

<div class="preformatted">Hi, Is it possible to get SRA sample attributes for a given sample or experiment ID? Here is the sample http://w…

[GenomicRanges] subsetByOverlaps to keep info from both GRanges objects? •

updated 11.6 years ago by Valerie Obenchain ★ 6.8k • written 11.6 years ago by enricoferrero ▴ 680

<div class="preformatted">Hi, I have two GRanges objects, the first one is a list of SNPs, the second one are DNase hypersensitivity sites…

Problem with genomicFeatures: id2name •

updated 14.5 years ago by Marc Carlson ★ 7.2k • written 14.5 years ago by Paul Leo ▴ 970

<div class="preformatted">id2name(txdb, feature.type="cds") and id2name(txdb, feature.type="exon") both return all NAs foe ensemble and re…

maSigPro - NA values for coefficient estimates - Is a polynomical fit recommended for my data? •

15.4 years ago jeremy wilson ▴ 150

<div class="preformatted">Dear BioConductors, I have unbalanced number of time points in my experiment. In 3 groups I have, one group has …

Problem with marray •

21.2 years ago Patrice Godard ▴ 20

<div class="preformatted">Hi, I'm using marray packages in order to normalize my data. I load my data using the command lines : sli…

justMAS reliability •

updated 19.8 years ago by Crispin Miller ★ 1.1k • written 19.8 years ago by Eitan Halper-Stromberg ▴ 30

<div class="preformatted">Hi, I am trying to speed up my mas5 processing by using justMAS on my affybatch. My AffyBatch is made of 8 chi…

report a problem of DESeq •

updated 12.7 years ago by Steve Lianoglou ★ 13k • written 12.7 years ago by wang peter ★ 2.0k

<div class="preformatted">hello all: i run a scipts to deal with my data. every thing is ok, but the last step generate some wired …

Loading Similar Posts

Traffic: 515 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6