Question

select With Regular Expression

0

Entering edit mode

Dario Strbenac ★ 1.5k

@dario-strbenac-5916

Last seen 3 hours ago

Australia

The protocadherin family of genes has gene symbols such as PCDHA1, PCDHA2, and PCDHB1. I'd like to get the chromosome, strand, start and end coordinates of every protocadherin gene. The select function has a keys parameter which requires a character vector. Instead of manually finding which elements have the PCDH suffix

> symbols <- keys(org.Hs.eg.db, "SYMBOL")
> pKeys <- grep("PCDH*", symbols, value = TRUE)
> head(select(org.Hs.eg.db, pKeys, "ENTREZID","SYMBOL"))
'select()' returned 1:1 mapping between keys and columns
   SYMBOL ENTREZID
1   PCDH1     5097
2 PCDHGC3     5098
3   PCDH7     5099
4   PCDH8     5100
5   PCDH9     5101
6 PCDHGB4     8641

is there a way to use regular expressions with select? Once the gene symbols are converted into Entrez IDs, I'll query org.Hs.eg.db for the locations.

annotationdbi Wildcard • 952 views

ADD COMMENT • link updated 7.8 years ago by Johannes Rainer ★ 2.0k • written 7.8 years ago by Dario Strbenac ★ 1.5k

score 2 · Answer 1 · 2016-06-21

select(Homo.sapiens, keys(Homo.sapiens, "SYMBOL", pattern = "^PCDH"), c("CDSCHROM","CDSSTART","CDSEND"), "SYMBOL")

Or maybe more usefully, depending on what you are after

> tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns

> z <- mapIds(Homo.sapiens, keys(Homo.sapiens, "SYMBOL", pattern = "^PCDH"), "ENTREZID","SYMBOL")
'select()' returned 1:1 mapping between keys and columns

> tx[names(tx) %in% z]
GRangesList object of length 71:
$100874064
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_name          SYMBOL
         <Rle>            <IRanges>  <Rle> | <character> <CharacterList>
  [1]    chr13 [67399301, 67489163]      + |  uc031qmb.1       PCDH9-AS2

$100874086
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |    tx_name    SYMBOL
  [1]    chr13 [67551521, 67559908]      + | uc031qmc.1 PCDH9-AS3

$26025
GRanges object with 2 ranges and 2 metadata columns:
      seqnames                 ranges strand |    tx_name   SYMBOL
  [1]     chr5 [140810158, 140812789]      + | uc011dba.2 PCDHGA12
  [2]     chr5 [140810158, 140892548]      + | uc003lkt.2 PCDHGA12

...
<68 more elements>
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
>

score 1 · Answer 2 · 2016-06-21

You can use pattern search using ensembldb filters in EnsDb objects:

## Load the human annotations for Ensembl 75
> library(EnsDb.Hsapiens.v75)
> edb <- EnsDb.Hsapiens.v75

## Use a GenenameFilter specifying the pattern (has to be a SQL pattern, so, % instead of *)
> Res <- select(edb, keys=GenenameFilter("PCDH%", condition="like"))

> unique(Res$GENENAME)
 [1] "PCDHB4"    "PCDHA6"    "PCDHGA2"   "PCDH11Y"   "PCDH11X"   "PCDHB2"   
 [7] "PCDHB3"    "PCDHB5"    "PCDHB6"    "PCDHB7"    "PCDHB15"   "PCDH12"   
[13] "PCDH17"    "PCDHB8"    "PCDHB10"   "PCDHB14"   "PCDHB12"   "PCDH8"    
[19] "PCDH10"    "PCDHB18"   "PCDH15"    "PCDH1"     "PCDH19"    "PCDH7"    
[25] "PCDHB1"    "PCDHB9"    "PCDH9"     "PCDHB13"   "PCDH18"    "PCDHB16"  
[31] "PCDHB11"   "PCDH20"    "PCDHGA1"   "PCDHA9"    "PCDHA8"    "PCDHA7"   
[37] "PCDHA5"    "PCDHA4"    "PCDHA2"    "PCDHA1"    "PCDH9-AS3" "PCDH8P1"  
[43] "PCDH9-AS2" "PCDH9-AS4" "PCDH9-AS1" "PCDHA13"   "PCDHGC3"   "PCDHGC5"  
[49] "PCDHGC4"   "PCDHAC2"   "PCDHAC1"   "PCDHGB8P"  "PCDHA11"   "PCDHA14"  
[55] "PCDHA10"   "PCDHA12"   "PCDHGA12"  "PCDHGB6"   "PCDHGA5"   "PCDHGA7"  
[61] "PCDHGA6"   "PCDHGA8"   "PCDHGA10"  "PCDHGA11"  "PCDHGB2"   "PCDHGB4"  
[67] "PCDHGB7"   "PCDHGB1"   "PCDHGA3"   "PCDHA3"    "PCDHB17"   "PCDHGA9"  
[73] "PCDHB19P"  "PCDHGB3"   "PCDHGA4"  


## Alternatively, just use the genes method:
> genes(edb, filter=GenenameFilter("PCDH%", condition="like"))
GRanges object with 109 ranges and 5 metadata columns:
                  seqnames                 ranges strand |         gene_id
                     <Rle>              <IRanges>  <Rle> |     <character>
  ENSG00000099715        Y   [ 4868267,  5610265]      + | ENSG00000099715
  ENSG00000169851        4   [30722037, 31148422]      + | ENSG00000169851
  ENSG00000136099       13   [53418109, 53422775]      - | ENSG00000136099
              ...      ...                    ...    ... .             ...
  ENSG00000240764        5 [140868808, 140892546]      + | ENSG00000240764
  ENSG00000156453        5 [141232938, 141258811]      - | ENSG00000156453
  ENSG00000113555        5 [141323150, 141349304]      - | ENSG00000113555
                    gene_name    entrezid   gene_biotype seq_coord_system
                  <character> <character>    <character>      <character>
  ENSG00000099715     PCDH11Y 83259;27328 protein_coding       chromosome
  ENSG00000169851       PCDH7        5099 protein_coding       chromosome
  ENSG00000136099       PCDH8        5100 protein_coding       chromosome
              ...         ...         ...            ...              ...
  ENSG00000240764     PCDHGC5  5098;56097 protein_coding       chromosome
  ENSG00000156453       PCDH1        5097 protein_coding       chromosome
  ENSG00000113555      PCDH12       51294 protein_coding       chromosome
  -------
  seqinfo: 7 sequences from GRCh37 genome

As you see above, Ensembl 75 bases on the "old" GRCh37 genome release; you might want to use a more recent package, but that would be easy to create e.g. using AnnotationHub (check the ensembldb vignette).

cheers, jo