Search
Question: nearest genes using TxDb.Hsapiens.UCSC.hg19
0
2.2 years ago by
C Lin60
United States
C Lin60 wrote:

Hi everyone,

I am trying to use TxDb.Hsapiens.UCSC.hg19.knownGene to get the nearest genes.

I tried to put in the range for PLCXD1 (chrX:198,061-220,022) but interestingly the nearest gene found is LINC00102 which is 2Mbp away at chrX: 2,531,032-2,533,388. Any idea why is this? Was my query wrong? Please see below on how I queried it. Thank you in advance for your help.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(ChIPpeakAnno)
genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
genes[nearest(GRanges("chrX",IRanges(198061,220022)),genes)]

This is the output:

GRanges object with 1 range and 1 metadata column:
seqnames             ranges strand |     gene_id
<Rle>          <IRanges>  <Rle> | <character>
100359394     chrX [2531032, 2533388]      - |   100359394
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

modified 2.2 years ago by Julie Zhu3.8k • written 2.2 years ago by C Lin60
2
2.2 years ago by
United States
James W. MacDonald47k wrote:

PLCXD1 is on both chrX and chrY. Given that there is just one Gene ID for a thing that is in two places at once, the genes function doesn't return anything for that gene. So the nearest gene in that situation is in fact the LINC gene. It's a different story if we talk about transcripts, however.

> gns <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
> gns["55344",]
Error: subscript contains invalid names
> tx <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
> tx["55344"]
GRangesList object of length 1:
$55344 GRanges object with 4 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] chrX [192991, 220022] + | 75334 uc011mgx.2 [2] chrX [198061, 220022] + | 75335 uc004cpc.3 [3] chrY [142991, 170022] + | 78356 uc011nae.2 [4] chrY [148061, 170022] + | 78357 uc004fop.3 ------- seqinfo: 93 sequences (1 circular) from hg19 genome > txun <- unlist(tx) > txun[nearest(GRanges("chrX", IRanges(198061,220022)),txun),] GRanges object with 1 range and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> 55344 chrX [198061, 220022] + | 75335 uc004cpc.3 ------- seqinfo: 93 sequences (1 circular) from hg19 genome ADD COMMENTlink written 2.2 years ago by James W. MacDonald47k 1 2.2 years ago by Julie Zhu3.8k United States Julie Zhu3.8k wrote: Hi Cenny, You might want to try using EnsDb.Hsapiens.v75 to annotate your data when you found that there are missing annotations in the knownGene of UCSC. Here is the code you may want to try: library(BiocInstaller) biocLite("EnsDb.Hsapiens.v75") # v75 is release version of 75, library(EnsDb.Hsapiens.v75) target <- GRanges("chrX", IRanges(60743,60753)) genes <- genes(EnsDb.Hsapiens.v75) seqlevelsStyle(genes) <- "UCSC" # you may want to filter the pseudogene genes <- genes[genes$gene_biotype %in% c("protein_coding", "miRNA", "lincRNA", "misc_RNA", "rRNA", "antisense", "snRNA", "snoRNA", "Mt_tRNA", "Mt_rRNA")]

ms.anno <- annotatePeakInBatch(target, AnnotationData=genes,ignore.strand = TRUE)

ms.anno <- ms.anno[!is.na(ms.anno$feature)] mcols(ms.anno) <- cbind(mcols(ms.anno), mcols(genes[ms.anno$feature] ))

If you have any questions, please let me know. Thank you.

Best,

Posted for Jianhong Ou

Hi Julie and Jianhong,

Thank you very much. The above code is very useful. I will use Ensembl for the annotation.

Quick question, what is seqlevelsStyle(genes) <- "UCSC" for ?

For those who was wondering like I did, although the genome assembly is the same hg19, you may find some genes have different locations. That is because of the way Ensembl and UCSC locate the genes. See more here.

Ensembl annotates many more genes than UCSC. Here is an interesting paper comparing them. One other thing to note: you need to add 1 to the start coordinates from UCSC to get the equivalent Ensembl coordinate (one-based vs. zero-based coordinate system).

The description of the different gene biotypes is here.

Thank you, Julie!

Hi Julie and Jianhong,

Is there an easy way to return only 1 gene for each peak region? Sometimes, there are genes that have exactly the same nearest distance that both of them got returned. Maybe return the longest gene? or collapse the multiple nearest genes together, so that each row still refers to one peak region? For example this one:

library(EnsDb.Hsapiens.v75)
target <- GRanges("chr3", IRanges(52010210,52010234))
genes <- genes(EnsDb.Hsapiens.v75)
seqlevelsStyle(genes) <- "UCSC"
genes <- genes[!grepl('pseudogene',genes$gene_biotype)] ms.anno <- annotatePeakInBatch(target, AnnotationData=genes,PeakLocForDistance = "middle") ms.anno <- ms.anno[!is.na(ms.anno$feature)]
mcols(ms.anno) <- cbind(mcols(ms.anno), mcols(genes[ms.anno$feature] )) Thanks! ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by C Lin60 1 Here is my code: ms.anno$anno.width <- abs(ms.anno$start_position - ms.anno$end_position)
ms.anno <- ms.anno[order(ms.anno$peak, -ms.anno$anno.width)]
ms.anno <- ms.anno[!duplicated(ms.anno$peak)] ms.anno$anno.width <- NULL
ms.anno


Thank you so much, Jianhong!! That is very efficient and clever. Not to mention very helpful.

I still need to get used to GRanges and other genomics R objects. Your codes help a lot!

1
2.2 years ago by
Julie Zhu3.8k
United States
Julie Zhu3.8k wrote:

Cenny,

You wonder about why the gene PLCXD1 can be found as transcripts but not genes in UCSC.

My understanding is that you picked a very special gene. This gene has two locations: one in chrX, and one in chrY.

http://www.genecards.org/cgi-bin/carddisp.pl?gene=PLCXD1&keywords=PLCXD1

When you use genes() function with default parameter in the GenomicFeatures package, the returned gene locations are all unique.

However, the transcript_ids are unique for the different transcripts of this gene. So you have no problem to use transcripts(txdb) to get the transcripts.

If you read the help of genes function, there is a parameter to deal with multiple location genes:

 single.strand.genes.only TRUE or FALSE. If TRUE (the default), then genes that have exons located on both strands of the same chromosome or on two different chromosomes are dropped. In that case, the genes are returned in a GRanges object. Otherwise, all genes are returned in aGRangesList object with the columns specified thru the columns argument set as top level metadata columns. (Please keep in mind that the top level metadata columns of a  GRangesList object are not displayed by the show method.)

The following code are very educational:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)

txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

cols <- c("tx_id", "tx_chrom", "tx_strand",

"exon_id", "exon_chrom", "exon_strand")

single_strand_genes <- genes(txdb, columns=cols)

all_genes <- genes(txdb, columns=cols, single.strand.genes.only=FALSE)

all_genes  # a GRangesList object

multiple_strand_genes <- all_genes[elementNROWS(all_genes) >= 2]

multiple_strand_genes[["55344"]]

55344 is the gene id of PLCXD1.

You can run it yourself.

As to the gene_bio_type, here is the explanation:

http://www.gencodegenes.org/gencode_biotypes.html

Thank you for using ChIPpeakAnno, and thanks for the suggestion. Let us know if you have more questions.

Thanks,

Posted for Jun Yu

Thank you. That's very useful.

0
2.2 years ago by
C Lin60
United States
C Lin60 wrote:

Thank very much for your reply, James! Didn't realize I should use transcripts. I was thinking that was because the gene was found in 2010 when I was using 2009 assembly.

I probably should have asked a different question. Because now, I cannot figure out how to get the official symbol if it is a transcript.

I was using the annotatePeakInBatch function to annotate my peaks. If I used genes then I will get the LINC gene. How can I used transcript to get the PLCXD1 gene? Thank you again in advance for your help.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(org.Hs.eg.db)
library(ChIPpeakAnno)
target <- GRanges("chrX", IRanges(198061,220022))
genes <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
ms.anno <- annotatePeakInBatch(target, AnnotationData=genes,ignore.strand = TRUE)
orgAnn="org.Hs.eg.db",
feature_id_type="entrez_id",
IDs2Add=c("symbol","ensembl","unigene","entrez_id"))
ms.anno1
GRanges object with 1 range and 12 metadata columns:
seqnames           ranges strand |        peak     feature start_position end_position feature_strand
<Rle>        <IRanges>  <Rle> | <character> <character>      <integer>    <integer>    <character>
X1.100359394     chrX [198061, 220022]      * |           1   100359394        2531032      2533388              -
insideFeature distancetoFeature shortestDistance fromOverlappingOrNearest    symbol         ensembl
<factor>         <numeric>        <integer>              <character>  <factor>        <factor>
X1.100359394    downstream           2335327          2311010          NearestLocation LINC00102 ENSG00000230542
unigene
<factor>
X1.100359394 Hs.571720
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths

In addition, you should read the vignette for the package you are trying to use, as there are any number of functions that are intended to make your life simpler. As an example, annoGR, which I found by reading the help page for annotatePeakInBatch, and toGRanges, which I found by reading the vignette. Learning how to find answers yourself is an invaluable skill that you should cultivate.

> agr <- annoGR(TxDb.Hsapiens.UCSC.hg19.knownGene, "transcript")
> agr[nearest(GRanges("chrX",IRanges(198061,220022)),agr),]
annoGR object with 1 range and 2 metadata columns:
seqnames           ranges strand |     tx_name         gene_id
<Rle>        <IRanges>  <Rle> | <character> <CharacterList>
75335     chrX [198061, 220022]      + |  uc004cpc.3           55344
-------
seqinfo: 93 sequences (1 circular) from hg19 genome

Thank you very much for letting me know where you found the annoGR and toGRanges. I completely agree with you learning how to find answers is extremely valuable. That is what I am trying to do, but reading the vignettes and manuals can be overwhelming for me that I need more guidance from people like you. So, Thank You !!

Forgive me for my ignorance. But, now I am not sure whether I should use transcripts or genes to annotate my peaks. As I mentioned before, I wanted to get the nearest genes. Is LINC00102 the nearest gene? or is PLCXD1 the nearest gene?

Thanks again for you help

Now you are asking questions that are well beyond the scope of this support site. What is and isn't a gene with respect to your experiment is up to you to decide. But I would point out that region you are looking at is within PLCXD1 (at least one of the locations for that gene, at least), so I can't think of any definition of 'nearest' that would say any other gene is nearer to the region of interest than the gene it is within!

Like, if you have two houses, and they are identical, and one is on this block and the other is two blocks over, and I am visiting you in one of your houses, am I nearer to your house, or your neighbors?

Also LINC00102 is a lincRNA, and hence not known to be translated to protein. So it's a gene in that it is transcribed to mRNA, but it isn't of interest if you only care about that subset of genes that make proteins.