I am trying to annotate a list of SNPs using the hg38 genome (knownGene) and locateVariants(). The program is able to successfully run and provide "GeneIDs" for several of the loci. However, some GeneIDs are applied to SNPs in completely different regions and on completely different chromosomes. When I cross reference these GeneIDs (I like to use ABO, geneID=28, as an example) with the Homo_sapiens.gene_info file from NCBI and check the original knownGene.txt.gz file on the UCSC website, I don't see that any of the reading frames associated with these genes as being in the ranges that locateVariant seems to indicate. I don't know if the problem is with the knownGene.sqlite file, the locateVariants() function, or my R code. I have updated my R software and all used packages within the past two weeks.
The vcf file that I am using looks like:
##fileformat=VCFv4.0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
1 248883847
1 248883842
1 248883838
1 248883825
The gene_info file looks like:
tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|AllianceGenome:HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA,20221209,-
9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|AllianceGenome:HGNC:7,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M,20230312,-
When the code is done test contains variants associated with gene #28 (ABO), with an intron on chr1 (wrong), a promoter on chr9 (correct), and an intron on chr11 (wrong). Subset:
seqnames start end width strand LOCATION GENEID
157 chr1 246336469 246336469 1 - intron 28
158 chr1 246336469 246336469 1 - intron 28
159 chr1 246336469 246336469 1 - intron 28
227014 chr9 133277952 133277952 1 - promoter 28
227016 chr9 133277979 133277979 1 - promoter 28
227018 chr9 133277980 133277980 1 - promoter 28
254294 chr11 69305850 69305850 1 - intron 28
254295 chr11 69305850 69305850 1 - intron 28
254296 chr11 69305850 69305850 1 - intron 28
254297 chr11 69305850 69305850 1 - intron 28
library("GenomicFeatures")
library("VariantAnnotation")
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
setwd("C:\\Users\\david\\Desktop\\Annotations")
filename <- "CP_E92_vs_E76"
dir <- "C:\\Users\\david\\Desktop\\Annotations\\D_Human_knownGene\\CP_E92_vs_E76"
setwd(dir)
vcf <- readVcf(paste(filename, "_DP4_13min_nona_rheMac10_liftover.vcf.txt", sep = ""), genome = "hg38", verbose = TRUE)
seqlevels(vcf) = paste0("chr", seqlevels(vcf))
rd <- rowRanges(vcf)
variants <- locateVariants(rd, txdb, AllVariants())
write.csv(variants[,c(1,7)], paste(filename, "_annotations.csv", sep = ""), row.names = FALSE)
n <- max(count.fields(paste(filename, "_annotations.csv", sep = ""), sep = ','))
SNPs1 <- read.table(paste(filename, "_annotations.csv", sep = ""), sep=",", skip=1, header = FALSE, col.names = paste0("V",seq_len(n)), fill = TRUE)
SNPs2 <- SNPs1[SNPs1[,4]==1,]
colnames(SNPs2) <- c("seqnames", "start", "end", "width", "strand", "LOCATION", "GENEID")
SNPs3 <- SNPs2[!SNPs2[,6]=="intergenic",]
symbol <- read.csv("C:\\Users\\david\\Desktop\\Annotations\\Z_Alias\\human\\Homo_sapiens.gene_info_031523.csv", header=TRUE)
symbol <- symbol[,2:3]
colnames(symbol) <- c("GENEID", "Symbol")
snps_symbol <- merge(SNPs3, as.data.frame(symbol), by = "GENEID")
test <- SNPs2[which(SNPs2[,7]==28),]
sessionInfo()
Warning: The working directory was changed to C:/Users/david/Desktop/Annotations inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER
found header lines for 0 ‘info’ fields:
found header lines for 0 ‘geno’ fields:
Warning: GRanges object contains 37482 out-of-bound ranges located on sequences 22281, 22282, 22155, 22156, 22157, 22159,...
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
Warning: GRanges object contains 222 out-of-bound ranges located on sequences chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr1_KQ458384v1_alt,...
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8 LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0 VariantAnnotation_1.44.1 Rsamtools_2.14.0 Biostrings_2.66.0
[5] XVector_0.38.0 SummarizedExperiment_1.28.0 MatrixGenerics_1.10.0 matrixStats_0.63.0
[9] GenomicFeatures_1.50.4 AnnotationDbi_1.60.2 Biobase_2.58.0 GenomicRanges_1.50.2
[13] GenomeInfoDb_1.34.9 IRanges_2.32.0 S4Vectors_0.36.2 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] httr_1.4.5 bit64_4.0.5 BiocManager_1.30.20 BiocFileCache_2.6.1 blob_1.2.3 BSgenome_1.66.3 GenomeInfoDbData_1.2.9 yaml_2.3.7
[9] progress_1.2.2 pillar_1.8.1 RSQLite_2.3.0 lattice_0.20-45 glue_1.6.2 digest_0.6.31 htmltools_0.5.4 Matrix_1.5-1
[17] XML_3.99-0.13 pkgconfig_2.0.3 biomaRt_2.54.0 zlibbioc_1.44.0 BiocParallel_1.32.5 tibble_3.2.0 KEGGREST_1.38.0 generics_0.1.3
[25] ellipsis_0.3.2 cachem_1.0.7 cli_3.6.0 magrittr_2.0.3 crayon_1.5.2 memoise_2.0.1 evaluate_0.20 fansi_1.0.4
[33] xml2_1.3.3 tools_4.2.2 prettyunits_1.1.1 hms_1.1.2 BiocIO_1.8.0 lifecycle_1.0.3 stringr_1.5.0 DelayedArray_0.23.2
[41] compiler_4.2.2 rlang_1.0.6 grid_4.2.2 RCurl_1.98-1.10 rjson_0.2.21 rappdirs_0.3.3 bitops_1.0-7 rmarkdown_2.20
[49] restfulr_0.0.15 codetools_0.2-18 DBI_1.1.3 curl_5.0.0 R6_2.5.1 GenomicAlignments_1.34.1 knitr_1.42 dplyr_1.1.0
[57] rtracklayer_1.58.0 fastmap_1.1.1 bit_4.0.5 utf8_1.2.3 filelock_1.0.2 stringi_1.7.12 parallel_4.2.2 Rcpp_1.0.10
[65] vctrs_0.5.2 png_0.1-8 dbplyr_2.3.1 tidyselect_1.2.0 xfun_0.37
Thank you very much for any help you can give me in identifying what is going wrong here.
This is already a known issue.
Thank you very much!
I apologize for my ignorance, however I would like to clarify the problem/solution. The problem specifically occurs with how locateVariants() handles introns when the txdb has missing values.
It seems to me that the most straightforward solution might be to remove the entries with missing values from the txdb. I have tried making a GRanges object to use with:
However, this simply generates a new error message:
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘locateVariants’ for signature ‘"GRanges", "CompressedGRangesList", "AllVariants"’
You cannot use
AllVariants
for aGRangesList
. TheGRangesList
that you are providing contains only the intronic ranges, whereasAllVariants
is meant to locate variants across all possible genomic locations (e.g., coding regions, intergenic regions, promoters, etc). But you can do whatAllVariants
does, by hand.Thank you so much for your help!
When I run:
"intron" has 0 results. I have tried this again specifically adding loci to the vcf that should be recognized as an intron. My best guess is that grl_no_zero needs to identify included genomic ranges as "introns" for IntronVariants() to acknowledge the SNPs in the vcf as introns. However, I am unsure if this is true or how I would go about doing adding these identifications, even if it were true.
This works for me.
Thank you!