Question

Full Human gene TSS

0

Entering edit mode

karambe.a • 0

@karambea-18011

Last seen 6.0 years ago

Hello,

I am trying to get the list of full human genes name with there transcriptional start site.

Is there a direct list available anywhere? or is there a way to get it through R packages?

Thank you

HumanGene h19 • 786 views

ADD COMMENT • link updated 6.0 years ago by James W. MacDonald 67k • written 6.0 years ago by karambe.a • 0

score 0 · Answer 1 · 2018-10-25

You will have to define what you mean by 'human genes name' and 'transcriptional start site'. What is and isn't a gene is dependent on what annotation service you like (NCBI, GENCODE, EBI/EMBL), and what is a transcriptional start site isn't really something that is gene-specific, it's transcript specific (many genes have multiple transcripts, and the TSS for those transcripts aren't necessarily the same).

If you like NCBI's genes, and you think of the HGNC symbols as the 'human genes name', then you could use the Homo.sapiens package

> library(Homo.sapiens)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## update to use GRCh38, because it's like 2018 already
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene

> zz <- transcriptsBy(Homo.sapiens, "gene",columns = "SYMBOL")
'select()' returned many:many mapping between keys and columns

And then, for example, A1BG has 8 transcripts and 8 TSS:

> zz[[1]]
GRanges object with 8 ranges and 2 metadata columns:
      seqnames            ranges strand |     tx_name          SYMBOL
         <Rle>         <IRanges>  <Rle> | <character> <CharacterList>
  [1]    chr19 58345178-58347634      - |  uc061drj.1            A1BG
  [2]    chr19 58346850-58353499      - |  uc002qsd.5            A1BG
  [3]    chr19 58346854-58356225      - |  uc061drk.1            A1BG
  [4]    chr19 58346858-58353491      - |  uc061drl.1            A1BG
  [5]    chr19 58346860-58347657      - |  uc061drm.1            A1BG
  [6]    chr19 58348466-58362751      - |  uc061drs.1            A1BG
  [7]    chr19 58350594-58353129      - |  uc061drt.1            A1BG
  [8]    chr19 58353021-58356083      - |  uc061drv.1            A1BG
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome

If you just want the starts, you could do

> resize(unlist(zz), width = 1)
GRanges object with 164238 ranges and 2 metadata columns:
       seqnames    ranges strand |     tx_name          SYMBOL
          <Rle> <IRanges>  <Rle> | <character> <CharacterList>
     1    chr19  58347634      - |  uc061drj.1            A1BG
     1    chr19  58353499      - |  uc002qsd.5            A1BG
     1    chr19  58356225      - |  uc061drk.1            A1BG
     1    chr19  58353491      - |  uc061drl.1            A1BG
     1    chr19  58347657      - |  uc061drm.1            A1BG
   ...      ...       ...    ... .         ...             ...
  9997    chr22  50526145      - |  uc021wrz.2            SCO2
  9997    chr22  50526439      - |  uc021wsa.2            SCO2
  9997    chr22  50525604      - |  uc003bma.4            SCO2
  9997    chr22  50526145      - |  uc062fms.1            SCO2
  9997    chr22  50526439      - |  uc062fmt.1            SCO2
  -------
  seqinfo: 455 sequences (1 circular) from hg38 genome