Question

Homo.sapiens package: Some genes are not found

1

Entering edit mode

Vang Le ▴ 80

@vang-le-6690

Last seen 4.0 years ago

Denmark

I have a long list of about 700 genes. Some of them here:

ABCF1
ABHD16A
ACYP1
ADAMTSL4-AS1
AGER
AGPAT1
AIF1
AKAP17A
ALPI
AMY1A
ANKRD20A1
ANKRD20A3
ANKRD42
ANKRD60
APOM
ARHGAP27
ARL17A
ARPP19
ASMT
ASMTL
ATAT1
ATF6B
ATP6V1G2-DDX39B
AUNIP
AZIN1-AS1
B3GALT4
BAG6
BECN2
BLACE
BMS1P18
BPY2
BRD2
BTNL10
BTNL2
C10orf11
C10orf126
C11orf72
C12orf43
C12orf66
C16orf47
C17orf100
C19orf68
C1orf137
C1orf167
C1QTNF3-AMACR
C2
C20orf62
C22orf34
C3orf56
C4A

From the following snippet, I could see that at least IL9R and VAMP7 genes are not included.

library(Homo.sapiens)
library(GenomicRanges)
g = genes(Homo.sapiens, column = "SYMBOL")
s = mcols(g)[["SYMBOL"]]

> grep("VAM", s , value = T)
  10791    6843    6844    8673    8674    9341
"VAMP5" "VAMP1" "VAMP2" "VAMP8" "VAMP4" "VAMP3"
> grep("IL9R", s , value = T)
named character(0)

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] Homo.sapiens_1.3.1                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 org.Hs.eg.db_3.5.0                     
 [4] GO.db_3.5.0                             OrganismDbi_1.20.0                      GenomicFeatures_1.30.3                 
 [7] GenomicRanges_1.30.3                    GenomeInfoDb_1.14.0                     AnnotationDbi_1.40.0                   
[10] IRanges_2.12.0                          S4Vectors_0.16.0                        Biobase_2.38.0                         
[13] BiocGenerics_0.24.0                     GenABEL_1.8-0                           GenABEL.data_1.0.0                     
[16] MASS_7.3-50                            

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.8.1 progress_1.2.0             lattice_0.20-35            rtracklayer_1.38.3         blob_1.1.1                
 [6] XML_3.98-1.11              RBGL_1.54.0                rlang_0.2.1                DBI_1.0.0                  BiocParallel_1.12.0       
[11] bit64_0.9-7                matrixStats_0.53.1         GenomeInfoDbData_1.0.0     stringr_1.3.1              zlibbioc_1.24.0           
[16] Biostrings_2.46.0          memoise_1.1.0              biomaRt_2.34.2             BiocInstaller_1.28.0       Rcpp_0.12.17              
[21] DelayedArray_0.4.1         graph_1.56.0               XVector_0.18.0             bit_1.1-14                 Rsamtools_1.30.0          
[26] RMySQL_0.10.15             hms_0.4.2                  digest_0.6.15              stringi_1.2.3              grid_3.4.4                
[31] tools_3.4.4                bitops_1.0-6               magrittr_1.5               RCurl_1.95-4.10            RSQLite_2.1.1             
[36] crayon_1.3.4               pkgconfig_2.0.1            Matrix_1.2-14              prettyunits_1.0.2          assertthat_0.2.0          
[41] httr_1.3.1                 rstudioapi_0.7             R6_2.2.2                   GenomicAlignments_1.14.2   compiler_3.4.4

Update with additional information.

I am working with this file S30409818_Covered.bed from Agilent's Clinical Research Exome V2. According to their support team, the BED file was construct from several sources: RefSeq, Ensembl, CCDS, Genecode, VEGA, dbSNP, CytoBand. The "Name" column of the file is quite messy (see first 2000 lines here: https://pastebin.com/6B7GdE3D).

Expected final result:

What I want is to clean up the names keep only one gene name per row.

In the final results, If an interval can be mapped to RefSeq with a HGNC symbol, use that symbol for name, and retrieve strand information for it. Else if it can be mapped to Ensembl, use Ensembl's Name. Or else take anything else that can be mapped to.

If an interval is a part of multiple genes, make multiple rows in the BED file with corresponding names. For example, below two intervals are repeated twice because they belong to 4 different genes.

chr1    13115348        13115613        Gene1 0       -

chr1    13115348        13115613        Gene2 0       -

chr1    13123587        13123707        Gene3 0       -

chr1    13123587        13123707        Gene4 0       +

If an interval is an intronic one, leave as is can copy over the final result.

I already did mapping to TxDb.UCSC.hg19.knownGene, but still there are many invervals that can't be mapped. Therefore I take an intermediate solution to copy those "NoHits" intervals over and clean up their names. The final results looks like the following. But I still want to be able to map all the non-intronic reasons and give them proper names and strands. My current R code is here: . It is best if NoHit in mapBedtoGene function in the gist is empty (i.e. all interval are found).

chr1    13115348        13115613        PRAMEF5 0       -
chr1    13117509        13117772        PRAMEF5 0       -
chr1    13123587        13123707        ENST00000512015 0       .
chr1    13125298        13125455        ENST00000512015 0       .
chr1    13140477        13140687        ENST00000415919 0       .
chr1    13141134        13141791        ENST00000415919 0       .
chr1    13144412        13145014        ENST00000415919 0       .
chr1    13145171        13145389        -       0       .
chr1    13172307        13172427        ENST00000458709 0       .
chr1    13182988        13183108        HNRNPCL2        0       -
chr1    13183113        13183236        HNRNPCL2        0       -
chr1    13183403        13183567        HNRNPCL2        0       -
chr1    13183590        13183768        HNRNPCL2        0       -
chr1    13183793        13183943        HNRNPCL2        0       -
chr1    13183997        13184220        HNRNPCL2        0       -
chr1    13216298        13216516        ENST00000423177 0       .
chr1    13216673        13217275        ENST00000423177 0       .
chr1    13219028        13219665        ENST00000423177 0       .
chr1    13328215        13328346        .       0       -
chr1    13328812        13329471        .       0       -
chr1    13330329        13330986        .       0       -
chr1    13331433        13331643        .       0       -
chr1    13352112        13352306        LOC391003       0       .
chr1    13353831        13353981        LOC391003       0       .
chr1    13359797        13360060        PRAMEF5 0       +
chr1    13361957        13362222        PRAMEF5 0       +

homo.sapiens • 1.0k views

ADD COMMENT • link 5.7 years ago Vang Le ▴ 80

score 1 · Answer 1 · 2018-08-08

Most of these genes are likely to be either on multiple chromosomes or on different strands. For example

> select(Homo.sapiens, "VAMP7", c("TXCHROM","TXSTART","TXEND"), "SYMBOL")
'select()' returned 1:many mapping between keys and columns
  SYMBOL TXCHROM   TXSTART     TXEND
1  VAMP7    chrX 155110943 155173433
2  VAMP7    chrY  59213949  59276439

You can't actually get genes that are on multiple chromosomes (or different strands of the same chromosome) using the genes function on a MultiDb object. If you call genes on a TxDb object you can set the 'single.strand.genes.only' argument to FALSE, in which case you would get VAMP7 You could hypothetically cob a GRanges together from a couple of select calls.

> uh <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene, single.strand.genes.only = FALSE)
> uh <- unlist(uh)
> mcols(uh) <- mapIds(org.Hs.eg.db, names(uh), "SYMBOL", "ENTREZID")
'select()' returned 1:1 mapping between keys and columns
> uh
GRanges object with 24826 ranges and 1 metadata column:
        seqnames              ranges strand |           X
           <Rle>           <IRanges>  <Rle> | <character>
      1    chr19   58858172-58874214      - |        A1BG
     10     chr8   18248755-18258723      + |        NAT2
    100    chr20   43248163-43280376      - |         ADA
   1000    chr18   25530930-25757445      - |        CDH2
  10000     chr1 243651535-244006886      - |        AKT3
    ...      ...                 ...    ... .         ...
   9991     chr9 114979995-115095944      - |       PTBP3
   9992    chr21   35736323-35743440      + |       KCNE2
   9993    chr22   19023795-19109967      - |       DGCR2
   9994     chr6   90539619-90584155      + |    CASP8AP2
   9997    chr22   50961997-50964905      - |        SCO2
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

> uh[grep("VAMP7", mcols(uh)$X),]
GRanges object with 2 ranges and 1 metadata column:
       seqnames              ranges strand |           X
          <Rle>           <IRanges>  <Rle> | <character>
  6845     chrX 155110943-155173433      + |       VAMP7
  6845     chrY   59213949-59276439      + |       VAMP7
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome