Hi,
I want to determine the three genes which are closest to my annotated intergenic variants in both directions. After annotating the variant list, I used commands which are part of the examples of locateVariants in the variantannotation package.
This is what I tried:
loc <- locateVariants(target, TxDb.Hsapiens.UCSC.hg19.knownGene, AllVariants(promoter=PromoterVariants(downstream=500))) names(loc) <- NULL out <- as.data.frame(loc) out$names <- names(target)[ out$QUERYID ] annotTable <- out[ , c("names", "seqnames","QUERYID", "start", "end", "LOCATION", "GENEID", "PRECEDEID", "FOLLOWID")] for (i in 1:nrow(annotTable)) { annotTable$distance[[i]] <- integer(0) } for (i in 1:length(loc)){ p_ids <- unlist(loc$FOLLOWID[i,],use.names=FALSE) if (length(p_ids) > 0){ exp_ranges <- rep(loc[i,], elementLengths(loc$FOLLOWID[i,])) p_dist <- distance(exp_ranges, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids, type="gene") print(loc$FOLLOWID[i,]) print(p_dist) dist.order <- order(p_dist) annotTable$distance[[i]] <- p_dist[dist.order[1:3]] } }
Now I wonder why the output of the two print statements looks like this, having "NAs" in the list of distances:
FOLLOWIDs:
[[1]] 100130417 100132062 100133331 100133331 100288069 148398 26155 284593 339451 ... 643837 729737 729759 729759 79501 79854 84069 84808 9636
Distances:
[1] 209983 736219 NA NA 737862 385064 NA 350732 184839 170121 252618 163701 73301 55113 13064 129248 129248 269974 924234 696203
[21] 442766 994792
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 15.10 locale: [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8 LC_MONETARY=de_DE.UTF-8 [6] LC_MESSAGES=de_DE.UTF-8 LC_PAPER=de_DE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] BiocInstaller_1.20.3 org.Hs.eg.db_3.2.3 RSQLite_1.0.0 [4] DBI_0.3.1 TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.22.13 [7] AnnotationDbi_1.32.3 VariantAnnotation_1.16.4 Rsamtools_1.22.0 [10] SummarizedExperiment_1.0.2 GenomicRanges_1.22.4 GenomeInfoDb_1.6.3 [13] Biostrings_2.38.4 XVector_0.10.0 IRanges_2.4.8 [16] S4Vectors_0.8.11 GWASTools_1.16.1 Biobase_2.30.0 [19] BiocGenerics_0.16.1 loaded via a namespace (and not attached): [1] Rcpp_0.12.4 futile.logger_1.4.1 plyr_1.8.3 futile.options_1.0.0 bitops_1.0-6 [6] tools_3.2.2 zlibbioc_1.16.0 biomaRt_2.26.1 BSgenome_1.38.0 gtable_0.2.0 [11] lattice_0.20-33 ncdf_1.6.9 logistf_1.21 Matrix_1.2-4 SparseM_1.7 [16] rtracklayer_1.30.4 MatrixModels_0.4-1 lmtest_0.9-34 grid_3.2.2 GWASExactHW_1.01 [21] quantsmooth_1.36.0 DNAcopy_1.44.0 XML_3.98-1.4 BiocParallel_1.4.3 survival_2.38-3 [26] lambda.r_1.1.7 ggplot2_2.1.0 GenomicAlignments_1.6.3 scales_0.4.0 splines_3.2.2 [31] gdsfmt_1.6.2 colorspace_1.2-6 quantreg_5.21 sandwich_2.3-4 RCurl_1.95-4.8 [36] munsell_0.4.3 zoo_1.7-12
How can the distance not be known if the "PRECEDEIDs" and "FOLLOWIDs" are determined based on distance parameters of locateVariants and the same package is used (TxDb.Hsapiens.UCSC.hg19.knownGene) for both commands (locateVariants and distance)? Did I do anything wrong? Is there another way to obtain the genes which are closest to each variant in my list?
Thank you for your help!
I found it:
GenomicRanges:::findKNN()
. It's not exported, because I never tested it, and maybe it should be generalized to IRanges. It would be great if someone on the core team got it into working order. Then we could support it inlocateVariants()
.locateVariants can find the closest one? As far as I understood it, it just provides a seemingly unsorted list of all genes within the user defined region (?). Which is not much of a problem if I can obtain the respective distances with the "distance()" function and order by distance. The problem is that some of the distances are "NA", so I cannot tell whether these genes are among the closest 3 genes.
Yea, my bad, I made an assumption. Please see the edited answer.
You are right that I have to take care of the out of bounds case for results with fewer than three elements, but the NA entries are surely not coming from that, since my print statement comes before cutting the list to three elements. I am printing the complete list here, input list as well as output.
In the meanwhile I was checking the specific geneIDs in loc$FOLLOWID that lead to NAs on the UCSC webpage. The thing is: For all "following" genes of a specific gene in "target" on chr1 within a given distance, I obtain a list of geneID on chr1 with "locateVariants". That is ok, so far. But when I check on these IDs on the UCSC webpage I get several hits for some IDs, including hits on chr5. Could this be the problem? Maybe the distance function only checks on this hit on chr5 and obviously cannot determine the distance of a gene on chr1 to chr5?
If this is the problem... what can I do about it?
Maybe you could make available a subset of the "target" object that reproduces the problem.
Is this ok, or what did you have in mind?
I can't reproduce the NAs, and I just noticed that your Bioconductor version is over a year old. Please update.
Ok, thank you, I'll try that. So you get the same list of distances (printed above), but "normal" distances except of the NA entries?
Everything should be uptodate now. I deleted R and all Bioconductor packages but the problem remains.
When look for "following" genes and use "distance()" to obtain the respective distances, I get this:
[1] 209983 736219 NA NA 737862 385064 NA 350732 184839 170121 252618 163701 73301 55113 13064 129248 129248 269974 924234 696203 442766 994792
I think this is some kind of bug. Using a distance range to obtain geneids and using these geneids to obtain distances leads to the correct distances in 19 of 22 cases und fails three times...
I removed R and Bioconductor completely and installed the new versions.
I was also thinking that there should be no loop, but somehow it doesn't work for me.
p_ids <- unlist(loc_int$FOLLOWID, use.names=FALSE)
exp_ranges <- rep(loc_int, elementNROWS(loc_int$FOLLOWID))
p_dist <- distance(exp_ranges, TxDb.Hsapiens.UCSC.hg19.knownGene, id=p_ids, type="gene")
loc_int$FOLLOW_DIST <- relist(p_dist, loc_int$FOLLOWID)
which should work exactly like the code you posted, except I removed the last part with sorting and selecting three elements (for testing), leads to this (corresponds to the line of the query I posted above):
When I check other queries, most of the values are like this and there are some numbers in between the NAs, which are far too big to be the distances I was looking for. I really don't know what this means, but when I made the loop and made a query for each line in my table, it worked fine (I checked the distances), except of the original problem with single NAs at some positions.
Ok, I see what is happening. There are 403 genes that have transcripts on both strands. Those genes are excluded from gene-level queries by default. The distance() method should probably retrieve all of the genes and reduce over the strands. But for this analysis, I think you might want the distance to the transcript with the promoter containing the variant. It would be nice if locateVariants() had the option to return transcript IDs, not just gene Ids.
Sorry, but I don't really get it... When I obtain all distances at once (like you proposed) OR when I make a loop over my variants (so I obtain a list of distances for one variant) OR when I make a loop over my variants and a nested loop over each geneid of one variant, shouldn't this give the same result in each of the three cases??? If not... I really don't understand why. I get three different results and it seems only the last option (two loops) leads to the correct distances (even without any NAs). It would really be helpful if you clarified this.
Not sure. When I run the code above, I get the correct results (or at least they appear to be correct), just with some NAs.
So you get NAs? Didn't you say you don't get them? Ok... let's make a simple test including the second and third option I described (Call distance() for all geneids close to my variant at once AND call distance() for each of the geneids independently):
This leads to the following output:
The list p_dist2 seems to include all distances of the geneids in p_ids, no NAs, the distances I checked are correct. Everything fine. The first two entries of p_dist are the same as in p_dist2. The remaining distances which are the same in both lists are at different positions and there are three NAs.
I would really like to know:
1. Can anyone reproduce the fact that p_dist and p_dist2 differ?
2. Is there any other explanation for a difference between p_dist and p_dist2 than a bug in distance()?
If I have to get the distances using two loops I think it will take days...
The issue is as you initially suspected: there are genes with transcripts that are non-overlapping (even on different chromosomes), so there is no sane way to reduce those to a range for the entire gene. There was a bug that prevented an informative error message in this case, and I've fixed it. The real solution here is to query by transcript ID, and it looks like Val is working on that.
Ok, but does that explain why distance() gives different results depending on whether a single geneid is submitted or the geneid is part of a list of geneids? If it does not work for the genes you mentioned, shouldn't it work at all for these genes?
Yes, it does explain it, because everything got mixed up (a technical term) when the appropriate assertion was not in place.
I can add an option for locationVariants() to return transcript ids. I'll post back here when it's done.
Valerie
This is done in VariantAnnotation 1.21.6. I've added a 'distanceIDAsGeneID' argument to the IntergenicVariants() constructor:
When TRUE (default) the values in PRECEDEID and FOLLOWID are geneids; when FALSE transcript ids are returned. I've modified the example to show that the call to distance() should specify type="tx" when distanceIDAsGeneID=TRUE.
Let me know if you run into problems or if this was not the implementation you were after.
Valerie
The name of the argument seems strange since the locateVariants() function does not (explicitly) compute any distances. Maybe something like idType=c("gene", "tx")? Or maybe it could just return both the gene and tx ID? Maybe the return value could somehow keep track of the type of ID in each column, so that distance() would not require the type argument?
I agree, idType is a better name. I've made this change in 1.21.7 and made it a character with default "gene".
Returning both id types and tracking id type in the columns would be more involved. I can't get to this right away - could put it on the TODO or accept a patch.
Val
I've re-worked distance,GenomicRanges,TxDb-method in GenomicFeatures so it's more well behaved. It now returns NA when (1) 'id' is not in 'y' or (2) when 'id' defines a range that can't be reduced to a single range (e.g., gene id spans multiple chromosomes). Previously the method would just throw an error when either of these conditions were met.
Changes are in GenomicFeatures 1.27.4.
Valerie