I don't think the CDS start can be taken as TSS in general, only when the transcript has no 5' UTR.
Here is how you can extract the TSS for all transcripts in the UCSC knownGene track for hg38:
1. Extract all transcripts from the UCSC knownGene track for hg38:
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
tx <- transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns=c("tx_name", "gene_id"))
tx
# GRanges object with 247541 ranges and 2 metadata columns:
# seqnames ranges strand | tx_name gene_id
# <Rle> <IRanges> <Rle> | <character> <CharacterList>
# [1] chr1 11869-14409 + | ENST00000456328.2 100287102
# [2] chr1 12010-13670 + | ENST00000450305.2 100287102
# [3] chr1 29554-31097 + | ENST00000473358.1
# [4] chr1 30267-31109 + | ENST00000469289.1
# [5] chr1 30366-30503 + | ENST00000607096.1 100302278
# ... ... ... ... . ... ...
# [247537] chrUn_GL000220v1 155997-156149 + | ENST00000619779.1 109864274
# [247538] chrUn_KI270442v1 380608-380726 + | ENST00000620265.1
# [247539] chrUn_KI270442v1 217250-217401 - | ENST00000611690.1
# [247540] chrUn_KI270744v1 51009-51114 - | ENST00000616830.1
# [247541] chrUn_KI270750v1 148668-148843 + | ENST00000612925.1
# -------
# seqinfo: 595 sequences (1 circular) from hg38 genome
The gene_id
metadata column is a CharacterList object that links each transcript to 0 or 1 gene:
mcols(tx)$gene_id
# CharacterList of length 247541
# [[1]] 100287102
# [[2]] 100287102
# [[3]] character(0)
# [[4]] character(0)
# [[5]] 100302278
# [[6]] character(0)
# [[7]] character(0)
# [[8]] character(0)
# [[9]] character(0)
# [[10]] 79501
# ...
# <247531 more elements>
table(lengths(mcols(tx)$gene_id))
# 0 1
# 50159 197382
It might actually be more convenient to represent this mapping as a character vector:
mcols(tx)$gene_id <- as.character(mcols(tx)$gene_id)
tx
# GRanges object with 247541 ranges and 2 metadata columns:
# seqnames ranges strand | tx_name gene_id
# <Rle> <IRanges> <Rle> | <character> <character>
# [1] chr1 11869 + | ENST00000456328.2 100287102
# [2] chr1 12010 + | ENST00000450305.2 100287102
# [3] chr1 29554 + | ENST00000473358.1 <NA>
# [4] chr1 30267 + | ENST00000469289.1 <NA>
# [5] chr1 30366 + | ENST00000607096.1 100302278
# ... ... ... ... . ... ...
# [247537] chrUn_GL000220v1 155997 + | ENST00000619779.1 109864274
# [247538] chrUn_KI270442v1 380608 + | ENST00000620265.1 <NA>
# [247539] chrUn_KI270442v1 217401 - | ENST00000611690.1 <NA>
# [247540] chrUn_KI270744v1 51114 - | ENST00000616830.1 <NA>
# [247541] chrUn_KI270750v1 148668 + | ENST00000612925.1 <NA>
# -------
# seqinfo: 595 sequences (1 circular) from hg38 genome
2. Add the gene names/symbols to 'tx' as another metadata column:
The gene names/symbols are not stored in the TxDb.Hsapiens.UCSC.hg38.knownGene package so we cannot do something like transcripts(TxDb.Hsapiens.UCSC.hg38.knownGene, columns=c("tx_name", "gene_id", "SYMBOL"))
to obtained this information. Instead we get it from the org.Hs.eg.db package:
library(org.Hs.eg.db)
ENTREZID2SYMBOL <- select(org.Hs.eg.db, mcols(tx)$gene_id, c("ENTREZID", "SYMBOL"))
# 'select()' returned many:1 mapping between keys and columns
stopifnot(identical(ENTREZID2SYMBOL$ENTREZID, mcols(tx)$gene_id))
mcols(tx)$SYMBOL <- ENTREZID2SYMBOL$SYMBOL
Note that an alternative to the above would be to use the Homo.sapiens package, which bundles TxDb.Hsapiens.UCSC.hg19.knownGene and org.Hs.eg.db together:
library(Homo.sapiens)
Homo.sapiens
# OrganismDb Object:
## Includes GODb Object: GO.db
## With data about: Gene Ontology
## Includes OrgDb Object: org.Hs.eg.db
## Gene data about: Homo sapiens
## Taxonomy Id: 9606
## Includes TxDb Object: TxDb.Hsapiens.UCSC.hg19.knownGene
## Transcriptome data about: Homo sapiens
## Based on genome: hg19
## The OrgDb gene id ENTREZID is mapped to the TxDb gene id GENEID .
transcripts(Homo.sapiens, columns=c("ENTREZID", "SYMBOL"))
# 'select()' returned 1:1 mapping between keys and columns
# GRanges object with 82960 ranges and 2 metadata columns:
# seqnames ranges strand | SYMBOL ENTREZID
# <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList>
# [1] chr1 11874-14409 + | DDX11L1 100287102
# [2] chr1 11874-14409 + | DDX11L1 100287102
# [3] chr1 11874-14409 + | DDX11L1 100287102
# [4] chr1 69091-70008 + | OR4F5 79501
# [5] chr1 321084-321115 + | <NA> <NA>
# ... ... ... ... . ... ...
# [82956] chrUn_gl000237 1-2686 - | <NA> <NA>
# [82957] chrUn_gl000241 20433-36875 - | <NA> <NA>
# [82958] chrUn_gl000243 11501-11530 + | <NA> <NA>
# [82959] chrUn_gl000243 13608-13637 + | <NA> <NA>
# [82960] chrUn_gl000247 5787-5816 - | <NA> <NA>
# -------
# seqinfo: 93 sequences (1 circular) from hg19 genome
Note that the reason Homo.sapiens contains less transcripts than TxDb.Hsapiens.UCSC.hg38.knownGene is because by default Homo.sapiens is based on hg19 instead of hg38. This can be changed with:
TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
3. Use promoters()
to get the promoters or TSS:
We use upstream=0
and downstream=1
to get the TSS (see ?promoters
in the GenomicFeatures package for more information):
tss <- promoters(tx, upstream=0, downstream=1)
tss
# GRanges object with 247541 ranges and 3 metadata columns:
# seqnames ranges strand | tx_name gene_id SYMBOL
# <Rle> <IRanges> <Rle> | <character> <character> <character>
# [1] chr1 11869 + | ENST00000456328.2 100287102 DDX11L1
# [2] chr1 12010 + | ENST00000450305.2 100287102 DDX11L1
# [3] chr1 29554 + | ENST00000473358.1 <NA> <NA>
# [4] chr1 30267 + | ENST00000469289.1 <NA> <NA>
# [5] chr1 30366 + | ENST00000607096.1 100302278 MIR1302-2
# ... ... ... ... . ... ... ...
# [247537] chrUn_GL000220v1 155997 + | ENST00000619779.1 109864274 RNA5-8SN4
# [247538] chrUn_KI270442v1 380608 + | ENST00000620265.1 <NA> <NA>
# [247539] chrUn_KI270442v1 217401 - | ENST00000611690.1 <NA> <NA>
# [247540] chrUn_KI270744v1 51114 - | ENST00000616830.1 <NA> <NA>
# [247541] chrUn_KI270750v1 148668 + | ENST00000612925.1 <NA> <NA>
# -------
# seqinfo: 595 sequences (1 circular) from hg38 genome
H.
Should point out that the
Homo.sapiens
package by default containsTxDb.Hsapiens.UCSC.hg19.knownGene
, so you have to substitute in the hg38 version first.Thx Jim. I added a note about this in my original answer. H.