Question

Genes for a given chromosome

0

Entering edit mode

Gamal • 0

@5adcc264

Last seen 14 months ago

United States

Dear all,

First of all. I am new to bioinformatics. I am trying to answer a question from an assignment regarding promoters from Chr19. I am using the following: txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene genes_txdb <- genes(txdb) however, the genes are listed by id. I want to add the gene names to my GRanges object. Is there a way to map the gene id to the gene name?

Thank you

GRangesobjectmodification • 870 views

ADD COMMENT • link updated 15 months ago by James W. MacDonald 65k • written 15 months ago by Gamal • 0

score 0 · Answer 1 · 2023-01-20

If you want to do that sort of thing, you should use the Homo.sapiens package.

> library(Homo.sapiens)
## update the TxDb, using the refGene version because knownGene is Ensembl based
> library(TxDb.Hsapiens.UCSC.hg38.refGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.refGene
> genes(Homo.sapiens, columns = "SYMBOL")
  2136 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned 1:1 mapping between keys and columns
GRanges object with 26171 ranges and 1 metadata column:
            seqnames              ranges strand |          SYMBOL
               <Rle>           <IRanges>  <Rle> | <CharacterList>
          1    chr19   58345183-58353492      - |            A1BG
         10     chr8   18391282-18401215      + |            NAT2
        100    chr20   44619519-44651758      - |             ADA
       1000    chr18   27950966-28177130      - |            CDH2
  100008586     chrX   49341114-49577757      + |         GAGE12F
        ...      ...                 ...    ... .             ...
       9990    chr15   34229784-34338057      - |         SLC12A6
       9991     chr9 112217715-112333664      - |           PTBP3
       9992    chr21   34364006-34371381      + |           KCNE2
       9993    chr22   19036282-19122454      - |           DGCR2
       9997    chr22   50523568-50526439      - |            SCO2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

But do note that genes don't have promoters. In fact, genes aren't really a thing, as they are just a way to collapse all the transcripts (which do have promoters) down to a single item. You will note that there is a warning about 2136 genes getting dropped. That's because a gene (for Bioconductor at least) is the genomic distance between the farthest 3' and 5' positions for any transcript. And some transcripts (the PAR genes being notable examples) are found on more than one chromosome, so it's hard to give a single gene boundary for such a thing. But you can get transcripts!

>  z <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> z
GRangesList object of length 28307:
$`1`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]    chr19 58345183-58353492      - |   NM_130786            A1BG
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]     chr8 18391282-18401215      + |   NM_000015            NAT2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`100`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |      tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> |  <character> <CharacterList>
  [1]    chr20 44619519-44651758      - | NM_001322050             ADA
  [2]    chr20 44619519-44651758      - | NM_001322051             ADA
  [3]    chr20 44619519-44651758      - |    NR_136160             ADA
  [4]    chr20 44619522-44651699      - |    NM_000022             ADA
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

...
<28304 more elements>

## we can collapse to a GRanges object
> unlist(z)
GRanges object with 88819 ranges and 2 metadata columns:
                  seqnames            ranges strand |      tx_name
                     <Rle>         <IRanges>  <Rle> |  <character>
     1               chr19 58345183-58353492      - |    NM_130786
    10                chr8 18391282-18401215      + |    NM_000015
   100               chr20 44619519-44651758      - | NM_001322050
   100               chr20 44619519-44651758      - | NM_001322051
   100               chr20 44619519-44651758      - |    NR_136160
   ...                 ...               ...    ... .          ...
  9994 chr6_KV766194v1_fix       25225-72124      + |    NM_012115
  9997               chr22 50523568-50525598      - |    NM_005138
  9997               chr22 50523568-50525604      - | NM_001169111
  9997               chr22 50523568-50526145      - | NM_001169110
  9997               chr22 50523568-50526439      - | NM_001169109
                SYMBOL
       <CharacterList>
     1            A1BG
    10            NAT2
   100             ADA
   100             ADA
   100             ADA
   ...             ...
  9994        CASP8AP2
  9997            SCO2
  9997            SCO2
  9997            SCO2
  9997            SCO2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

## or just chr19
> zz <- keepSeqlevels(z, "chr19", pruning.mode = "coarse")
> unlist(zz)
GRanges object with 4333 ranges and 2 metadata columns:
            seqnames            ranges strand |      tx_name          SYMBOL
               <Rle>         <IRanges>  <Rle> |  <character> <CharacterList>
          1    chr19 58345183-58353492      - |    NM_130786            A1BG
  100049587    chr19 51642561-51646819      - | NM_001098612        SIGLEC14
  100073347    chr19 56840902-56848556      + |    NR_024059           MIMT1
  100101266    chr19 24162193-24163447      - |    NR_003603        HAVCR1P1
  100113382    chr19 10109749-10109840      + |    NR_003688       SNORD105B
        ...      ...               ...    ... .          ...             ...
        976    chr19 14381144-14408725      + |    NM_001784          ADGRE5
        976    chr19 14381444-14408723      + |    NM_078481          ADGRE5
       9817    chr19 10486125-10502771      - |    NM_012289           KEAP1
       9817    chr19 10486125-10503356      - |    NM_203500           KEAP1
        997    chr19     531760-542087      + |    NM_004359           CDC34
  -------
  seqinfo: 1 sequence from hg38 genome
>