First of all. I am new to bioinformatics.
I am trying to answer a question from an assignment regarding promoters from Chr19.
I am using the following:
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
genes_txdb <- genes(txdb)
however, the genes are listed by id. I want to add the gene names to my GRanges object.
Is there a way to map the gene id to the gene name?
If you want to do that sort of thing, you should use the Homo.sapiens package.
> library(Homo.sapiens)
## update the TxDb, using the refGene version because knownGene is Ensembl based
> library(TxDb.Hsapiens.UCSC.hg38.refGene)
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.refGene
> genes(Homo.sapiens, columns = "SYMBOL")
2136 genes were dropped because they have exons located on both strands
of the same reference sequence or on more than one reference sequence,
so cannot be represented by a single genomic range.
Use 'single.strand.genes.only=FALSE' to get all the genes in a
GRangesList object, or use suppressMessages() to suppress this message.
'select()' returned 1:1 mapping between keys and columns
GRanges object with 26171 ranges and 1 metadata column:
seqnames ranges strand | SYMBOL
<Rle> <IRanges> <Rle> | <CharacterList>
1 chr19 58345183-58353492 - | A1BG
10 chr8 18391282-18401215 + | NAT2
100 chr20 44619519-44651758 - | ADA
1000 chr18 27950966-28177130 - | CDH2
100008586 chrX 49341114-49577757 + | GAGE12F
... ... ... ... . ...
9990 chr15 34229784-34338057 - | SLC12A6
9991 chr9 112217715-112333664 - | PTBP3
9992 chr21 34364006-34371381 + | KCNE2
9993 chr22 19036282-19122454 - | DGCR2
9997 chr22 50523568-50526439 - | SCO2
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
But do note that genes don't have promoters. In fact, genes aren't really a thing, as they are just a way to collapse all the transcripts (which do have promoters) down to a single item. You will note that there is a warning about 2136 genes getting dropped. That's because a gene (for Bioconductor at least) is the genomic distance between the farthest 3' and 5' positions for any transcript. And some transcripts (the PAR genes being notable examples) are found on more than one chromosome, so it's hard to give a single gene boundary for such a thing. But you can get transcripts!
> z <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> z
GRangesList object of length 28307:
$`1`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
<Rle> <IRanges> <Rle> | <character> <CharacterList>
[1] chr19 58345183-58353492 - | NM_130786 A1BG
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
$`10`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
<Rle> <IRanges> <Rle> | <character> <CharacterList>
[1] chr8 18391282-18401215 + | NM_000015 NAT2
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
$`100`
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
<Rle> <IRanges> <Rle> | <character> <CharacterList>
[1] chr20 44619519-44651758 - | NM_001322050 ADA
[2] chr20 44619519-44651758 - | NM_001322051 ADA
[3] chr20 44619519-44651758 - | NR_136160 ADA
[4] chr20 44619522-44651699 - | NM_000022 ADA
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
...
<28304 more elements>
## we can collapse to a GRanges object
> unlist(z)
GRanges object with 88819 ranges and 2 metadata columns:
seqnames ranges strand | tx_name
<Rle> <IRanges> <Rle> | <character>
1 chr19 58345183-58353492 - | NM_130786
10 chr8 18391282-18401215 + | NM_000015
100 chr20 44619519-44651758 - | NM_001322050
100 chr20 44619519-44651758 - | NM_001322051
100 chr20 44619519-44651758 - | NR_136160
... ... ... ... . ...
9994 chr6_KV766194v1_fix 25225-72124 + | NM_012115
9997 chr22 50523568-50525598 - | NM_005138
9997 chr22 50523568-50525604 - | NM_001169111
9997 chr22 50523568-50526145 - | NM_001169110
9997 chr22 50523568-50526439 - | NM_001169109
SYMBOL
<CharacterList>
1 A1BG
10 NAT2
100 ADA
100 ADA
100 ADA
... ...
9994 CASP8AP2
9997 SCO2
9997 SCO2
9997 SCO2
9997 SCO2
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
## or just chr19
> zz <- keepSeqlevels(z, "chr19", pruning.mode = "coarse")
> unlist(zz)
GRanges object with 4333 ranges and 2 metadata columns:
seqnames ranges strand | tx_name SYMBOL
<Rle> <IRanges> <Rle> | <character> <CharacterList>
1 chr19 58345183-58353492 - | NM_130786 A1BG
100049587 chr19 51642561-51646819 - | NM_001098612 SIGLEC14
100073347 chr19 56840902-56848556 + | NR_024059 MIMT1
100101266 chr19 24162193-24163447 - | NR_003603 HAVCR1P1
100113382 chr19 10109749-10109840 + | NR_003688 SNORD105B
... ... ... ... . ... ...
976 chr19 14381144-14408725 + | NM_001784 ADGRE5
976 chr19 14381444-14408723 + | NM_078481 ADGRE5
9817 chr19 10486125-10502771 - | NM_012289 KEAP1
9817 chr19 10486125-10503356 - | NM_203500 KEAP1
997 chr19 531760-542087 + | NM_004359 CDC34
-------
seqinfo: 1 sequence from hg38 genome
>
Thank you very much for your help. It is very much appreciated it.
Indeed, my previous concept of genes was very rudimentary.
As you indicated, genes are not physical entities; they are functional transcriptional conceptual entities.
Let me present you with my code instructions following your advice.
I have two questions.
The first question is regarding options (1) and (2) from the code below.
What is the difference between
(1) tscripts_ch19 <- keepSeqlevels(tscripts, "chr19", pruning.mode = "coarse") # option 1, and
(2) tscripts_ch19 <- tscripts[seqnames(tscripts) == "chr19"]. # option 2
Using option (1), I get a total of 4333 transcripts; however, when I use option (2), I get a total of 4664 transcripts.
The second question is regarding the function of the routine "promoters."
The sequence length of all promoters is 2200; I am confused and I may be missing the concept of a promoter.
Are these sequences the promoters, or the promoters' sequences are embedded within the 2200 sequence?
I think what you have done with your alternative subsetting is equivalent to pruning.mode = "fine", except it doesn't correctly set the seqlevels. If you use coarse, then it removes all list items that have non-chr19 seqlevels. And since some of the genes are on alternative haplotypes, you lose those. As an example
> tx <- transcriptsBy(Homo.sapiens, columns = "SYMBOL")
'select()' returned 1:1 mapping between keys and columns
> tx1 <- keepSeqlevels(tx, "chr19", pruning.mode = "coarse")
> tx2 <- tx[seqnames(tx) == "chr19"]
## note that by subsetting this way, you retain many zero-length ranges
> table(lengths(tx2))
0 1 2 3 4 5 6 7 8 9 10 11 12
26515 890 366 192 122 74 37 21 18 12 23 4 7
13 14 15 16 17 19 20 21 26 29 34 41 51
1 2 3 3 3 3 1 1 2 2 2 1 1
76
1
## having 26,615 empty ranges laying around isn't great, so let's remove them
> tx3 <- tx2[lengths(tx2) > 0L]
## now get the GRanges that are in tx3, but not in tx1
> tx[names(tx3[!tx3 %over% tx1])]
GRangesList object of length 117:
$`100132285`
GRanges object with 260 ranges and 2 metadata columns:
seqnames ranges strand | tx_name
<Rle> <IRanges> <Rle> | <character>
[1] chr19 54738511-54848566 + | NM_001291695
[2] chr19 54738511-54848566 + | NM_001291696
[3] chr19 54738511-54848566 + | NM_001291700
[4] chr19 54738511-54848566 + | NM_001291701
[5] chr19 54738511-54848566 + | NM_012312
... ... ... ... . ...
[256] chr19_KV575260v1_alt 40116-54428 + | NM_001291695 <--- These are not on chr19, so pruning.mode = coarse
[257] chr19_KV575260v1_alt 40116-54428 + | NM_001291696 removes this entire GRangesList item
[258] chr19_KV575260v1_alt 40116-54428 + | NM_001291700
[259] chr19_KV575260v1_alt 40116-54428 + | NM_001291701
[260] chr19_KV575260v1_alt 40116-54428 + | NM_012312
SYMBOL
<CharacterList>
[1] KIR2DS2
[2] KIR2DS2
[3] KIR2DS2
[4] KIR2DS2
[5] KIR2DS2
... ...
[256] KIR2DS2
[257] KIR2DS2
[258] KIR2DS2
[259] KIR2DS2
[260] KIR2DS2
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
...
<116 more elements>
As for the promoters, you are using the defaults, which are described in the help page.
## S4 method for signature 'GenomicRanges'
promoters(x, upstream=2000, downstream=200, use.names=TRUE)
So you are by default saying the promoters are in the range that extends upstream 2000 bases from the transcript start site (TSS), and downstream 200 bases from the TSS.
Dear James,
Thank you very much for your help. It is very much appreciated it. Indeed, my previous concept of genes was very rudimentary. As you indicated, genes are not physical entities; they are functional transcriptional conceptual entities.
Let me present you with my code instructions following your advice. I have two questions. The first question is regarding options (1) and (2) from the code below. What is the difference between (1) tscripts_ch19 <- keepSeqlevels(tscripts, "chr19", pruning.mode = "coarse") # option 1, and (2) tscripts_ch19 <- tscripts[seqnames(tscripts) == "chr19"]. # option 2
Using option (1), I get a total of 4333 transcripts; however, when I use option (2), I get a total of 4664 transcripts.
The second question is regarding the function of the routine "promoters." The sequence length of all promoters is 2200; I am confused and I may be missing the concept of a promoter. Are these sequences the promoters, or the promoters' sequences are embedded within the 2200 sequence?
Your help is again very much appreciated it.
GA
I think what you have done with your alternative subsetting is equivalent to pruning.mode = "fine", except it doesn't correctly set the seqlevels. If you use coarse, then it removes all list items that have non-chr19 seqlevels. And since some of the genes are on alternative haplotypes, you lose those. As an example
As for the promoters, you are using the defaults, which are described in the help page.
So you are by default saying the promoters are in the range that extends upstream 2000 bases from the transcript start site (TSS), and downstream 200 bases from the TSS.