Genes for a given chromosome
1
0
Entering edit mode
Gamal • 0
@5adcc264
Last seen 14 months ago
United States

Dear all,

First of all. I am new to bioinformatics. I am trying to answer a question from an assignment regarding promoters from Chr19. I am using the following: txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene genes_txdb <- genes(txdb) however, the genes are listed by id. I want to add the gene names to my GRanges object. Is there a way to map the gene id to the gene name?

Thank you

GRangesobjectmodification • 870 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

If you want to do that sort of thing, you should use the Homo.sapiens package.

> library(Homo.sapiens)
## update the TxDb, using the refGene version because knownGene is Ensembl based
> library(TxDb.Hsapiens.UCSC.hg38.refGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.refGene
> genes(Homo.sapiens, columns = "SYMBOL")
  2136 genes were dropped because they have exons located on both strands
  of the same reference sequence or on more than one reference sequence,
  so cannot be represented by a single genomic range.
  Use 'single.strand.genes.only=FALSE' to get all the genes in a
  GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned 1:1 mapping between keys and columns
GRanges object with 26171 ranges and 1 metadata column:
            seqnames              ranges strand |          SYMBOL
               <Rle>           <IRanges>  <Rle> | <CharacterList>
          1    chr19   58345183-58353492      - |            A1BG
         10     chr8   18391282-18401215      + |            NAT2
        100    chr20   44619519-44651758      - |             ADA
       1000    chr18   27950966-28177130      - |            CDH2
  100008586     chrX   49341114-49577757      + |         GAGE12F
        ...      ...                 ...    ... .             ...
       9990    chr15   34229784-34338057      - |         SLC12A6
       9991     chr9 112217715-112333664      - |           PTBP3
       9992    chr21   34364006-34371381      + |           KCNE2
       9993    chr22   19036282-19122454      - |           DGCR2
       9997    chr22   50523568-50526439      - |            SCO2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

But do note that genes don't have promoters. In fact, genes aren't really a thing, as they are just a way to collapse all the transcripts (which do have promoters) down to a single item. You will note that there is a warning about 2136 genes getting dropped. That's because a gene (for Bioconductor at least) is the genomic distance between the farthest 3' and 5' positions for any transcript. And some transcripts (the PAR genes being notable examples) are found on more than one chromosome, so it's hard to give a single gene boundary for such a thing. But you can get transcripts!

>  z <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> z
GRangesList object of length 28307:
$`1`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]    chr19 58345183-58353492      - |   NM_130786            A1BG
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`10`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |     tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]     chr8 18391282-18401215      + |   NM_000015            NAT2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

$`100`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |      tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> |  <character> <CharacterList>
  [1]    chr20 44619519-44651758      - | NM_001322050             ADA
  [2]    chr20 44619519-44651758      - | NM_001322051             ADA
  [3]    chr20 44619519-44651758      - |    NR_136160             ADA
  [4]    chr20 44619522-44651699      - |    NM_000022             ADA
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

...
<28304 more elements>

## we can collapse to a GRanges object
> unlist(z)
GRanges object with 88819 ranges and 2 metadata columns:
                  seqnames            ranges strand |      tx_name
                     <Rle>         <IRanges>  <Rle> |  <character>
     1               chr19 58345183-58353492      - |    NM_130786
    10                chr8 18391282-18401215      + |    NM_000015
   100               chr20 44619519-44651758      - | NM_001322050
   100               chr20 44619519-44651758      - | NM_001322051
   100               chr20 44619519-44651758      - |    NR_136160
   ...                 ...               ...    ... .          ...
  9994 chr6_KV766194v1_fix       25225-72124      + |    NM_012115
  9997               chr22 50523568-50525598      - |    NM_005138
  9997               chr22 50523568-50525604      - | NM_001169111
  9997               chr22 50523568-50526145      - | NM_001169110
  9997               chr22 50523568-50526439      - | NM_001169109
                SYMBOL
       <CharacterList>
     1            A1BG
    10            NAT2
   100             ADA
   100             ADA
   100             ADA
   ...             ...
  9994        CASP8AP2
  9997            SCO2
  9997            SCO2
  9997            SCO2
  9997            SCO2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

## or just chr19
> zz <- keepSeqlevels(z, "chr19", pruning.mode = "coarse")
> unlist(zz)
GRanges object with 4333 ranges and 2 metadata columns:
            seqnames            ranges strand |      tx_name          SYMBOL
               <Rle>         <IRanges>  <Rle> |  <character> <CharacterList>
          1    chr19 58345183-58353492      - |    NM_130786            A1BG
  100049587    chr19 51642561-51646819      - | NM_001098612        SIGLEC14
  100073347    chr19 56840902-56848556      + |    NR_024059           MIMT1
  100101266    chr19 24162193-24163447      - |    NR_003603        HAVCR1P1
  100113382    chr19 10109749-10109840      + |    NR_003688       SNORD105B
        ...      ...               ...    ... .          ...             ...
        976    chr19 14381144-14408725      + |    NM_001784          ADGRE5
        976    chr19 14381444-14408723      + |    NM_078481          ADGRE5
       9817    chr19 10486125-10502771      - |    NM_012289           KEAP1
       9817    chr19 10486125-10503356      - |    NM_203500           KEAP1
        997    chr19     531760-542087      + |    NM_004359           CDC34
  -------
  seqinfo: 1 sequence from hg38 genome
>
ADD COMMENT
0
Entering edit mode

Dear James,

Thank you very much for your help. It is very much appreciated it. Indeed, my previous concept of genes was very rudimentary. As you indicated, genes are not physical entities; they are functional transcriptional conceptual entities.

Let me present you with my code instructions following your advice. I have two questions. The first question is regarding options (1) and (2) from the code below. What is the difference between (1) tscripts_ch19 <- keepSeqlevels(tscripts, "chr19", pruning.mode = "coarse") # option 1, and (2) tscripts_ch19 <- tscripts[seqnames(tscripts) == "chr19"]. # option 2

Using option (1), I get a total of 4333 transcripts; however, when I use option (2), I get a total of 4664 transcripts.

library("Homo.sapiens")
library("TxDb.Hsapiens.UCSC.hg38.refGene")
library("seqinr")
TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.refGene
genes(Homo.sapiens, columns = "SYMBOL")
tscripts <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
unlist(tscripts)
tscripts_ch19 <- keepSeqlevels(tscripts, "chr19", pruning.mode = "coarse") # option 1
tscripts_ch19 <- tscripts[seqnames(tscripts) == "chr19"].                               # option 2
unlist(tscripts_ch19)
promoters_ch19 <- promoters(tscripts_ch19)
unlist(promoters_ch19)
promoters_ch19
promoter_ch19_seq <- getSeq(Hsapiens,promoters_ch19)
unlist(promoter_ch19_seq)

The second question is regarding the function of the routine "promoters." The sequence length of all promoters is 2200; I am confused and I may be missing the concept of a promoter. Are these sequences the promoters, or the promoters' sequences are embedded within the 2200 sequence?

Your help is again very much appreciated it.

GA

ADD REPLY
0
Entering edit mode

I think what you have done with your alternative subsetting is equivalent to pruning.mode = "fine", except it doesn't correctly set the seqlevels. If you use coarse, then it removes all list items that have non-chr19 seqlevels. And since some of the genes are on alternative haplotypes, you lose those. As an example

> tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> tx1 <- keepSeqlevels(tx, "chr19", pruning.mode = "coarse")
> tx2 <- tx[seqnames(tx) == "chr19"]
## note that by subsetting this way, you retain many zero-length ranges
> table(lengths(tx2))

    0     1     2     3     4     5     6     7     8     9    10    11    12 
26515   890   366   192   122    74    37    21    18    12    23     4     7 
   13    14    15    16    17    19    20    21    26    29    34    41    51 
    1     2     3     3     3     3     1     1     2     2     2     1     1 
   76 
    1 
## having 26,615 empty ranges laying around isn't great, so let's remove them
> tx3 <- tx2[lengths(tx2) > 0L]
## now get the GRanges that are in tx3, but not in tx1
> tx[names(tx3[!tx3 %over% tx1])]
GRangesList object of length 117:
$`100132285`
GRanges object with 260 ranges and 2 metadata columns:
                    seqnames            ranges strand |      tx_name
                       <Rle>         <IRanges>  <Rle> |  <character>
    [1]                chr19 54738511-54848566      + | NM_001291695
    [2]                chr19 54738511-54848566      + | NM_001291696
    [3]                chr19 54738511-54848566      + | NM_001291700
    [4]                chr19 54738511-54848566      + | NM_001291701
    [5]                chr19 54738511-54848566      + |    NM_012312
    ...                  ...               ...    ... .          ...
  [256] chr19_KV575260v1_alt       40116-54428      + | NM_001291695  <--- These are not on chr19, so pruning.mode = coarse 
  [257] chr19_KV575260v1_alt       40116-54428      + | NM_001291696       removes this entire GRangesList item
  [258] chr19_KV575260v1_alt       40116-54428      + | NM_001291700
  [259] chr19_KV575260v1_alt       40116-54428      + | NM_001291701
  [260] chr19_KV575260v1_alt       40116-54428      + |    NM_012312
                 SYMBOL
        <CharacterList>
    [1]         KIR2DS2
    [2]         KIR2DS2
    [3]         KIR2DS2
    [4]         KIR2DS2
    [5]         KIR2DS2
    ...             ...
  [256]         KIR2DS2
  [257]         KIR2DS2
  [258]         KIR2DS2
  [259]         KIR2DS2
  [260]         KIR2DS2
  -------
  seqinfo: 640 sequences (1 circular) from hg38 genome

...
<116 more elements>

As for the promoters, you are using the defaults, which are described in the help page.

 ## S4 method for signature 'GenomicRanges'
     promoters(x, upstream=2000, downstream=200, use.names=TRUE)

So you are by default saying the promoters are in the range that extends upstream 2000 bases from the transcript start site (TSS), and downstream 200 bases from the TSS.

ADD REPLY

Login before adding your answer.

Traffic: 494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6