Question

TxDB.Hsapiens.UCSC.hg38.knownGene with locateVariants() identifying SNPs from various chromosome being part of the same gene

0

Entering edit mode

davidhillis • 0

@c9419ef9

Last seen 13 months ago

United States

I am trying to annotate a list of SNPs using the hg38 genome (knownGene) and locateVariants(). The program is able to successfully run and provide "GeneIDs" for several of the loci. However, some GeneIDs are applied to SNPs in completely different regions and on completely different chromosomes. When I cross reference these GeneIDs (I like to use ABO, geneID=28, as an example) with the Homo_sapiens.gene_info file from NCBI and check the original knownGene.txt.gz file on the UCSC website, I don't see that any of the reading frames associated with these genes as being in the ranges that locateVariant seems to indicate. I don't know if the problem is with the knownGene.sqlite file, the locateVariants() function, or my R code. I have updated my R software and all used packages within the past two weeks.

The vcf file that I am using looks like:
##fileformat=VCFv4.0    
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
1   248883847
1   248883842
1   248883838
1   248883825

The gene_info file looks like:
tax_id,GeneID,Symbol,LocusTag,Synonyms,dbXrefs,chromosome,map_location,description,type_of_gene,Symbol_from_nomenclature_authority,Full_name_from_nomenclature_authority,Nomenclature_status,Other_designations,Modification_date,Feature_type
9606,1,A1BG,-,A1B|ABG|GAB|HYST2477,MIM:138670|HGNC:HGNC:5|Ensembl:ENSG00000121410|AllianceGenome:HGNC:5,19,19q13.43,alpha-1-B glycoprotein,protein-coding,A1BG,alpha-1-B glycoprotein,O,alpha-1B-glycoprotein|HEL-S-163pA|epididymis secretory sperm binding protein Li 163pA,20221209,-
9606,2,A2M,-,A2MD|CPAMD5|FWP007|S863-7,MIM:103950|HGNC:HGNC:7|Ensembl:ENSG00000175899|AllianceGenome:HGNC:7,12,12p13.31,alpha-2-macroglobulin,protein-coding,A2M,alpha-2-macroglobulin,O,alpha-2-macroglobulin|C3 and PZP-like alpha-2-macroglobulin domain-containing protein 5|alpha-2-M,20230312,-

When the code is done test contains variants associated with gene #28 (ABO), with an intron on chr1 (wrong), a promoter on chr9 (correct), and an intron on chr11 (wrong). Subset:

       seqnames     start       end width strand LOCATION GENEID
157        chr1 246336469 246336469     1      -   intron     28
158        chr1 246336469 246336469     1      -   intron     28
159        chr1 246336469 246336469     1      -   intron     28
227014     chr9 133277952 133277952     1      - promoter     28
227016     chr9 133277979 133277979     1      - promoter     28
227018     chr9 133277980 133277980     1      - promoter     28
254294    chr11  69305850  69305850     1      -   intron     28
254295    chr11  69305850  69305850     1      -   intron     28
254296    chr11  69305850  69305850     1      -   intron     28
254297    chr11  69305850  69305850     1      -   intron     28

library("GenomicFeatures")
library("VariantAnnotation")
library("TxDb.Hsapiens.UCSC.hg38.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene

setwd("C:\\Users\\david\\Desktop\\Annotations")
filename <- "CP_E92_vs_E76"

dir <- "C:\\Users\\david\\Desktop\\Annotations\\D_Human_knownGene\\CP_E92_vs_E76"

setwd(dir)
vcf <- readVcf(paste(filename, "_DP4_13min_nona_rheMac10_liftover.vcf.txt", sep = ""), genome = "hg38", verbose = TRUE)
seqlevels(vcf) = paste0("chr", seqlevels(vcf))
rd <- rowRanges(vcf)
variants <- locateVariants(rd, txdb, AllVariants()) 

write.csv(variants[,c(1,7)], paste(filename, "_annotations.csv", sep = ""), row.names = FALSE) 
n <- max(count.fields(paste(filename, "_annotations.csv", sep = ""), sep = ','))
SNPs1 <- read.table(paste(filename, "_annotations.csv", sep = ""), sep=",", skip=1, header = FALSE, col.names = paste0("V",seq_len(n)), fill = TRUE)
SNPs2 <- SNPs1[SNPs1[,4]==1,]
colnames(SNPs2) <- c("seqnames", "start", "end", "width", "strand", "LOCATION", "GENEID")

SNPs3 <- SNPs2[!SNPs2[,6]=="intergenic",]

symbol <- read.csv("C:\\Users\\david\\Desktop\\Annotations\\Z_Alias\\human\\Homo_sapiens.gene_info_031523.csv", header=TRUE)
symbol <- symbol[,2:3]
colnames(symbol) <- c("GENEID", "Symbol")
snps_symbol <- merge(SNPs3, as.data.frame(symbol), by = "GENEID")

test <- SNPs2[which(SNPs2[,7]==28),]

sessionInfo()

Warning: The working directory was changed to C:/Users/david/Desktop/Annotations inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.found header lines for 3 ‘fixed’ fields: ALT, QUAL, FILTER 
found header lines for 0 ‘info’ fields:  
found header lines for 0 ‘geno’ fields:  
Warning: GRanges object contains 37482 out-of-bound ranges located on sequences 22281, 22282, 22155, 22156, 22157, 22159,...
'select()' returned many:1 mapping between keys and columns
'select()' returned many:1 mapping between keys and columns
Warning: GRanges object contains 222 out-of-bound ranges located on sequences chr1_GL383518v1_alt, chr1_KI270762v1_alt, chr1_KQ458384v1_alt,...

R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0 VariantAnnotation_1.44.1                 Rsamtools_2.14.0                         Biostrings_2.66.0                       
 [5] XVector_0.38.0                           SummarizedExperiment_1.28.0              MatrixGenerics_1.10.0                    matrixStats_0.63.0                      
 [9] GenomicFeatures_1.50.4                   AnnotationDbi_1.60.2                     Biobase_2.58.0                           GenomicRanges_1.50.2                    
[13] GenomeInfoDb_1.34.9                      IRanges_2.32.0                           S4Vectors_0.36.2                         BiocGenerics_0.44.0                     

loaded via a namespace (and not attached):
 [1] httr_1.4.5               bit64_4.0.5              BiocManager_1.30.20      BiocFileCache_2.6.1      blob_1.2.3               BSgenome_1.66.3          GenomeInfoDbData_1.2.9   yaml_2.3.7              
 [9] progress_1.2.2           pillar_1.8.1             RSQLite_2.3.0            lattice_0.20-45          glue_1.6.2               digest_0.6.31            htmltools_0.5.4          Matrix_1.5-1            
[17] XML_3.99-0.13            pkgconfig_2.0.3          biomaRt_2.54.0           zlibbioc_1.44.0          BiocParallel_1.32.5      tibble_3.2.0             KEGGREST_1.38.0          generics_0.1.3          
[25] ellipsis_0.3.2           cachem_1.0.7             cli_3.6.0                magrittr_2.0.3           crayon_1.5.2             memoise_2.0.1            evaluate_0.20            fansi_1.0.4             
[33] xml2_1.3.3               tools_4.2.2              prettyunits_1.1.1        hms_1.1.2                BiocIO_1.8.0             lifecycle_1.0.3          stringr_1.5.0            DelayedArray_0.23.2     
[41] compiler_4.2.2           rlang_1.0.6              grid_4.2.2               RCurl_1.98-1.10          rjson_0.2.21             rappdirs_0.3.3           bitops_1.0-7             rmarkdown_2.20          
[49] restfulr_0.0.15          codetools_0.2-18         DBI_1.1.3                curl_5.0.0               R6_2.5.1                 GenomicAlignments_1.34.1 knitr_1.42               dplyr_1.1.0             
[57] rtracklayer_1.58.0       fastmap_1.1.1            bit_4.0.5                utf8_1.2.3               filelock_1.0.2           stringi_1.7.12           parallel_4.2.2           Rcpp_1.0.10             
[65] vctrs_0.5.2              png_0.1-8                dbplyr_2.3.1             tidyselect_1.2.0         xfun_0.37

Thank you very much for any help you can give me in identifying what is going wrong here.

MetID TxDB.Hsapiens.UCSC.hg38.knownGene locateVariants GenomicFeatures VariantAnnotation • 1.4k views

ADD COMMENT • link 13 months ago davidhillis • 0

score 1 · Answer 1 · 2023-03-17

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 17 hours ago

United States

Here is a simplified version that shows what I think is the problem.

> library(VariantAnnotation)
> library(GenomicFeatures)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
> gr <- GRanges("chr1:246336469-246336469")
> intx <- intronsByTranscript(TxDb.Hsapiens.UCSC.hg38.knownGene)
> z <- mapToTranscripts(gr, intx)
> z
GRanges object with 14 ranges and 2 metadata columns:
       seqnames    ranges strand |     xHits transcriptsHits
          <Rle> <IRanges>  <Rle> | <integer>       <integer>
   [1]    20048     80763      - |         1           20048
   [2]    20049    170521      - |         1           20049
   [3]    20050    279352      - |         1           20050
   [4]    20058     18648      - |         1           20058
   [5]    20059    170521      - |         1           20059
   ...      ...       ...    ... .       ...             ...
  [10]   263822     18648      - |         1          263822
  [11]   263823    170521      - |         1          263823
  [12]   263823    472265      - |         1          263823
  [13]   263824    170936      - |         1          263824
  [14]   263827    146902      - |         1          263827

Now the transcriptsHits column of the GRanges is supposed to indicate which of the GRangesList items map to the input positions, but they don't!

> unlist(intx[unique(z$transcriptsHits)])
GRanges object with 65 ranges and 0 metadata columns:
                    seqnames              ranges strand
                       <Rle>           <IRanges>  <Rle>
   20048                chr1 182646918-182648136      -
   20048                chr1 182648304-182665968      -
   20048                chr1 182666034-182666871      -
   20048                chr1 182666974-182671656      -
   20048                chr1 182671731-182672818      -
     ...                 ...                 ...    ...
  263824 chr9_KN196479v1_fix       160696-162132      -
  263827 chr9_KN196479v1_fix       167789-170218      -
  263827 chr9_KN196479v1_fix       170428-171265      -
  263827 chr9_KN196479v1_fix       171347-172633      -
  263827 chr9_KN196479v1_fix       172795-174015      -
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

This has to do with the method that dispatches on a GRanges and a GRangesList, which does this:

transcripts <- .orderElementsByTranscription(transcripts)
hits <- findOverlaps(x, unlist(transcripts, use.names = FALSE), 
            minoverlap = 1L, type = "within", ignore.strand = ignore.strand)
map <- .mapToTranscripts(x, transcripts, hits, ignore.strand, 
            intronJunctions)

In other words, the transcripts GRangesList gets reordered inside the function, and the transcriptsHits then apply to that reordered GRangesList rather than the one we have in hand.

> intx.ord <- GenomicFeatures:::.orderElementsByTranscription(intx)
> unlist(intx.ord[unique(z$transcriptsHits)])
GRanges object with 65 ranges and 1 metadata column:
         seqnames              ranges strand | unordered
            <Rle>           <IRanges>  <Rle> | <integer>
   20048     chr1 246355095-246417295      - |        11
   20048     chr1 246335475-246355030      - |        10
   20048     chr1 246330538-246335366      - |         9
   20048     chr1 246327338-246330479      - |         8
   20048     chr1 245929938-246327200      - |         7
     ...      ...                 ...    ... .       ...
  263824     chr1 246330538-246335366      - |         3
  263827     chr1 246355095-246483434      - |         4
  263827     chr1 246335475-246355030      - |         3
  263827     chr1 246330538-246335366      - |         2
  263827     chr1 246327338-246330479      - |         1
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

ADD COMMENT • link 13 months ago James W. MacDonald 65k

1

Entering edit mode

This is already a known issue.

ADD REPLY • link 13 months ago James W. MacDonald 65k

0

Entering edit mode

Thank you very much!

I apologize for my ignorance, however I would like to clarify the problem/solution. The problem specifically occurs with how locateVariants() handles introns when the txdb has missing values.

It seems to me that the most straightforward solution might be to remove the entries with missing values from the txdb. I have tried making a GRanges object to use with:

txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
grl <- intronsByTranscript(txdb) # taken from the linked posting
grl_no_zero  <- grl[elementNROWS(grl) > 0] # taken from the linked posting
vcf <- readVcf("CP_E92_vs_E76_DP4_13min_nona_rheMac10_liftover.vcf.txt", genome = "hg38", verbose = TRUE)
seqlevels(vcf) = paste0("chr", seqlevels(vcf))
variants <- locateVariants(vcf, grl_no_zero, AllVariants())

However, this simply generates a new error message:

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘locateVariants’ for signature ‘"GRanges", "CompressedGRangesList", "AllVariants"’

ADD REPLY • link 13 months ago davidhillis • 0

1

Entering edit mode

You cannot use AllVariants for a GRangesList. The GRangesList that you are providing contains only the intronic ranges, whereas AllVariants is meant to locate variants across all possible genomic locations (e.g., coding regions, intergenic regions, promoters, etc). But you can do what AllVariants does, by hand.

coding <- locateVariants(vcf, txdb, CodingVariants())
intron <-  locateVariants(vcf, grl_no_zero, IntronVariants())  ## Note that this one is different!
splice <- locateVariants(vcf, txdb, SpliceSiteVariants())
promoter <- locateVariants(vcf, txdb, PromoterVariants(upstream(promoter(AllVariants())), downstream(promoter(AllVariants()))))
fiveUTR <- locateVariants(vcf, txdb, FiveUTRVariants())
threeUTR <- locateVariants(vcf, txdb, ThreeUTRVariants())
intergenic <- locateVariants(vcf, txdb, IntergenicVariants(upstream(intergenic(AllVariants())), downstream(intergenic(AllVariants()))))
rslt <- c(coding, intron, fiveUTR, threeUTR, splice, promoter, intergenic)
mc <- values(rslt)
rslt <- rslt[order(mc$QUERYID, mc$TXID, mc$GENEID),]

ADD REPLY • link 13 months ago James W. MacDonald 65k

0

Entering edit mode

Thank you so much for your help!

ADD REPLY • link 13 months ago davidhillis • 0

0

Entering edit mode

When I run:

intron <-  locateVariants(vcf, grl_no_zero, IntronVariants())

"intron" has 0 results. I have tried this again specifically adding loci to the vcf that should be recognized as an intron. My best guess is that grl_no_zero needs to identify included genomic ranges as "introns" for IntronVariants() to acknowledge the SNPs in the vcf as introns. However, I am unsure if this is true or how I would go about doing adding these identifications, even if it were true.

ADD REPLY • link 13 months ago davidhillis • 0

1

Entering edit mode

This works for me.

> library(VariantAnnotation)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> intx <- intronsByTranscript(txdb)
> intx <- intx[elementNROWS(intx) > 0L]
> gr <- GRanges(paste0("chr", rep(20, 5)), IRanges(c(14370, 17330, 1110696, 1230237,1234567), width = c(1,1,1,1,2)), seqinfo = seqinfo(txdb)["chr20"])
> introns <- locateVariants(gr, intx, IntronVariants())
> introns
GRanges object with 6 ranges and 9 metadata columns:
      seqnames          ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
         <Rle>       <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
  [1]    chr20         1110696      + |   intron     16408     16408         3
  [2]    chr20         1110696      + |   intron     16258     16258         3
  [3]    chr20         1110696      + |   intron     10915     10915         3
  [4]    chr20         1110696      + |   intron     10915     10915         3
  [5]    chr20         1230237      + |   intron     21899     21899         4
  [6]    chr20 1234567-1234568      + |   intron     26229     26230         5
             TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
      <character> <IntegerList> <character> <CharacterList> <CharacterList>
  [1]       70497                      <NA>                                
  [2]       70498                      <NA>                                
  [3]       70499                      <NA>                                
  [4]       70500                      <NA>                                
  [5]       70503                      <NA>                                
  [6]       70503                      <NA>                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

## Oops, need to annotate gene IDs too!
> introns$GENEID <- mapIds(txdb, introns$TXID, "GENEID","TXID")
'select()' returned 1:1 mapping between keys and columns
> introns
GRanges object with 6 ranges and 9 metadata columns:
      seqnames          ranges strand | LOCATION  LOCSTART    LOCEND   QUERYID
         <Rle>       <IRanges>  <Rle> | <factor> <integer> <integer> <integer>
  [1]    chr20         1110696      + |   intron     16408     16408         3
  [2]    chr20         1110696      + |   intron     16258     16258         3
  [3]    chr20         1110696      + |   intron     10915     10915         3
  [4]    chr20         1110696      + |   intron     10915     10915         3
  [5]    chr20         1230237      + |   intron     21899     21899         4
  [6]    chr20 1234567-1234568      + |   intron     26229     26230         5
             TXID         CDSID      GENEID       PRECEDEID        FOLLOWID
      <character> <IntegerList> <character> <CharacterList> <CharacterList>
  [1]       70497                      9491                                
  [2]       70498                      9491                                
  [3]       70499                      9491                                
  [4]       70500                      9491                                
  [5]       70503                    642636                                
  [6]       70503                    642636                                
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths