EdgeR function nearestTSS() fails with org.Ss.eg.db
2
0
Entering edit mode
Jon • 0
@c5e76e4a
Last seen 9 weeks ago
France

Hi,

Interested in using the nearestTSS function from edgeR, I was not able to reproduce the provided example with Sus scrofa.

Here is the example with species="Hs" (default value), the output is as expected.

> nearestTSS(chr = c("1","1"), locus = c(1000000,2000000), species="Hs")
gene_id symbol width     tss strand distance
1   57801   HES4  1134 1000097      -      -97
2   85452 CFAP74 81830 2003786      -    -3786


However, it no longer works when switching to S. scrofa.

> nearestTSS(chr = c("1","1"), locus = c(1000000,2000000), species="Ss")
Error in nearestTSS(chr = c("1", "1"), locus = c(1e+06, 2e+06), species = "Ss") :
Can't find egCHRLOC gene location mappings in package org.Ss.eg.db


It seems that at least one R object is the problem here, but there might also be something else.

Thanks in advance for any suggestion that would allow me to test this function on pig data.

Kind regards, -Jon

Sus_scrofa edgeR • 479 views
0
Entering edit mode

From the man page of org.Ss.eg.db here it doesn't look like we report CHRLOC information for S. scrofa.

0
Entering edit mode
@james-w-macdonald-5106
Last seen 10 hours ago
United States

The positional location in the OrgDb packages are a relic of an earlier time, before we had the TxDb packages. I have been meaning to remove those data for a while now, but as you can see that would break some long-standing packages, so that might be a difficult thing to do at this point.

Anyway, you should really be using a TxDb package for that sort of thing, because there is a guarantee as to what genome build the positions came from. The OrgDb packages aren't tied to a build, and I don't even remember where the data come from, so there's a question of provenance. If I were to recapitulate what is done in the example for nearestTSS, I would do this:

> library(Homo.sapiens)
> library(TxDb.Hsapiens.UCSC.hg38.knownGene)
## put the newest TxDb in there...
> TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.knownGene
> tx <- transcripts(Homo.sapiens, columns = c("SYMBOL","GENEID"))
'select()' returned 1:1 mapping between keys and columns
> tss <- resize(tx, width = 1, fix = "start")
> gr <- GRanges(paste0("chr", c("1","1")), IRanges(c(1000000,2000000), width = 1))
> tss[nearest(gr, tss)]
GRanges object with 2 ranges and 2 metadata columns:
seqnames    ranges strand |          GENEID          SYMBOL
<Rle> <IRanges>  <Rle> | <CharacterList> <CharacterList>
[1]     chr1    999981      - |           57801            HES4
[2]     chr1   2003714      - |           85452          CFAP74
-------
seqinfo: 640 sequences (1 circular) from hg38 genome
>
> distanceToNearest(gr, tss)
Hits object with 2 hits and 1 metadata column:
queryHits subjectHits |  distance
<integer>   <integer> | <integer>
[1]         1       10974 |        18
[2]         2       11259 |      3713
-------
queryLength: 2 / subjectLength: 258145

## as compared to
>  nearestTSS(chr = c("1","1"), locus = c(1000000,2000000))
gene_id symbol width     tss strand distance
1   57801   HES4  1134 1000097      -      -97
2   85452 CFAP74 81830 2003786      -    -3786


But you have pig data, so it's a bit more complicated. Here's how I would do that.

> library(AnnotationHub)

> hub <- AnnotationHub()
|======================================================================| 100%

snapshotDate(): 2021-10-20
> query(hub, c("txdb","sus scrofa"))
AnnotationHub with 13 records
# snapshotDate(): 2021-10-20
# $dataprovider: UCSC #$species: Sus scrofa
# $rdataclass: TxDb # additional mcols(): taxonomyid, genome, description, # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, # rdatapath, sourceurl, sourcetype # retrieve records with, e.g., 'object[["AH52273"]]' title AH52273 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite AH61788 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite AH61800 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite AH66180 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite AH66181 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite ... ... AH75768 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite AH79598 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite AH79599 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite AH84144 | TxDb.Sscrofa.UCSC.susScr11.refGene.sqlite <-------------------------- Probably the newest one right there AH84145 | TxDb.Sscrofa.UCSC.susScr3.refGene.sqlite > txdb <- hub[["AH84144"]] downloading 1 resources retrieving 1 resource |======================================================================| 100% loading from cache > query(hub, c("orgdb","sus scrofa")) AnnotationHub with 1 record # snapshotDate(): 2021-10-20 # names(): AH95961 #$dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Sus scrofa #$rdataclass: OrgDb
# $rdatadateadded: 2021-10-08 #$title: org.Ss.eg.db.sqlite
# $description: NCBI gene ID based annotations about Sus scrofa #$taxonomyid: 9823
# $genome: NCBI genomes #$sourcetype: NCBI/ensembl
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.ensembl.org/p... #$sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation") # retrieve record with 'object[["AH95961"]]' > orgdb <- hub[["AH95961"]] downloading 1 resources retrieving 1 resource |======================================================================| 100% loading from cache > pigtx <- transcripts(txdb) ## now add in symbol and NCBI Gene ID > mcols(pigtx)$SYMBOL <- mapIds(orgdb, pigtx$tx_name, "SYMBOL", "REFSEQ") 'select()' returned 1:1 mapping between keys and columns > mcols(pigtx)$GENEID <- mapIds(orgdb, pigtx$tx_name, "ENTREZID", "REFSEQ") 'select()' returned 1:1 mapping between keys and columns > pigtss <- resize(pigtx, 1) > pigtss[nearest(gr, pigtss)] GRanges object with 2 ranges and 4 metadata columns: seqnames ranges strand | tx_id tx_name SYMBOL GENEID <Rle> <IRanges> <Rle> | <integer> <character> <character> <character> [1] chr1 1013802 - | 178 NR_128500 MIR9815 104796093 [2] chr1 1665643 - | 180 NR_128509 MIR9824 104796099 ------- seqinfo: 613 sequences (1 circular) from susScr11 genome >  0 Entering edit mode Dear James, I went to a lot of time and effort to compare the organism packages to the TxDb packages before I implemented nearestTSS() in the way that I did. The results from nearestTSS() agree exactly with the latest version of NCBI RefSeq, which is the world's most respected source of gene annotation. My function is providing annotation specifically for NCBI genes, so I consider it important to provide official NCBI annotation and not annotation from some other source such as UCSC, Ensembl or Gencode. If I was annotating Ensembl genes, then I would use Gencode or Ensembl, but here we are using NCBI genes. You can see the RefSeq annotation for HES4 from the NCBI page for that gene: https://www.ncbi.nlm.nih.gov/gene/57801 The relevant information is given under Reference GRCh38.p14 Primary Assembly, Genomic and says: "Range: 998964..1000097 complement". This shows that the CFAP74 RefSeq sequence starts at location 1000097 and is 1134 bases long, exactly as reported by nearestTSS(). You can see the RefSeq annotation for CFAP74 at https://www.ncbi.nlm.nih.gov/gene/85452. The relevant information is given under Reference GRCh38.p14 Primary Assembly, Genomic and says: "Range 1921957..2003786 complement". This shows that the CFAP74 RefSeq sequence starts at location 2003786 and is 81830 bases long, exactly as reported by nearestTSS(). The same RefSeq sequence information is also available from the UCSC Browser, see UCSC RefSeq page for HES4 and UCSC RefSeq page for CFAP74. On the other hand, the alternative TSS results that you give using the TxDb package do not agree with any annotation source as far as I can see. The provenance of the TxDb results is entirely unclear to me. To be a replacement for nearestTSS, your code would also need to resolve multiple TSS for the same gene. For the purposes of the gene-level analysis for which nearestTSS was designed, it is important to resolve multiple TSS before finding distance to nearest rather than afterwards. One can't just throw all the transcripts TSS individually at the desired locus because the occasional short transcript causes unwanted hits. I agree with you that genome build is an important issue that is undocumented in the organism packages. Version 3.15 of org.Mm.eg.db continues to give genomic locations for mm10 rather than mm39, and therefore the same applies to nearestTSS(). That is an important limitation that I now explain in the documentation. Best wishes Gordon ADD REPLY 0 Entering edit mode The main issue has to do with the source of the annotation for the TxDb packages. We have historically used the knownGene table at UCSC, and in the past it was always NCBI annotations with perhaps a bit of extra annotation by UCSC. Unfortunately, in the intervening period UCSC has switched from using NCBI for the knownGene table, first to Ensembl, and currently to GENCODE. All while keeping the GRCh37 knownGene table based on NCBI. And depending on the species, the knownGene table may or may not be NCBI. It is IMO quite the mess. In response to that, I started building TxDbs using the refGene table, which is still based on NCBI. But the convention is to use the actual UCSC table name, so it's the TxDb.Hsapiens.UCSC.hg38.refGene package, which is suboptimal, given the longstanding use of knownGene to mean 'NCBI'. But my point remains the same - a given TxDb package is guaranteed to be based on the genome build and UCSC table in the package name. It's unfortunate that UCSC is changing things like that, and maybe we should consider using a different source. But I don't see that using data from an OrgDb is optimal either. People have to realize that the default genome build for an OrgDb package is the latest, unless of course you are using org.Mm.eg.db, in which case it is not. At least with a TxDb, what's in it is what it says on the tin. Anyway, > library(Homo.sapiens) Loading required package: OrganismDbi Loading required package: GO.db Loading required package: org.Hs.eg.db Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene > TxDb(Homo.sapiens) <- TxDb.Hsapiens.UCSC.hg38.refGene > tx <- transcripts(Homo.sapiens, columns = c("SYMBOL","GENEID")) 'select()' returned 1:1 mapping between keys and columns > tss <- resize(tx, width = 1, fix = "start") > gr <- GRanges(paste0("chr", c("1","1")), IRanges(c(1000000,2000000), width = 1)) > tss[nearest(gr, tss)] GRanges object with 2 ranges and 2 metadata columns: seqnames ranges strand | GENEID SYMBOL <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList> [1] chr1 1000097 - | 57801 HES4 [2] chr1 2003786 - | 85452 CFAP74 ------- seqinfo: 595 sequences (1 circular) from hg38 genome >  The added complexity of using GENCODE is the fact that EBI tends to 'find' more transcripts than NCBI. I am unsure why you were unable to find the TSS results I got from using the TxDb.Hsapiens.UCSC.hg38.knownGene package, but I can query the knownGene table directly and get the same results: > library(RMariaDB) > con <- dbConnect(MariaDB(), username = "genome", host = "genome-mysql.soe.ucsc.edu", port = 3306, dbname = "hg38") > z <- dbGetQuery(con, "select geneSymbol, chrom, txStart, txEnd from knownGene inner join kgXref on kgID=name where geneSymbol='CFAP74';") > z$width  <- z$txEnd - z$txStart
> z$diss2TSS <- 2e6 - z$txEnd+1
> z
geneSymbol chrom txStart   txEnd width diss2TSS
1      CFAP74  chr1 1921950 2003837 81887    -3836
2      CFAP74  chr1 1921956 1927929  5973    72072
3      CFAP74  chr1 1921956 2003786 81830    -3785
4      CFAP74  chr1 1922082 1927403  5321    72598
5      CFAP74  chr1 1946347 1960106 13759    39895
6      CFAP74  chr1 1953312 2003837 50525    -3836
7      CFAP74  chr1 1955147 2003714 48567    -3713
8      CFAP74  chr1 1955710 1960108  4398    39893
9      CFAP74  chr1 1983670 2003837 20167    -3836
10     CFAP74  chr1 1987961 1990929  2968     9072


And the shortest distance to a TSS is the 3713 we already got. As for deciding a priori which TSS is the 'right' one, that's something that Bioconductor has never done. We are in the business of providing the existing annotation data as provided by a given source, without modification.

0
Entering edit mode

Thanks for the extra background information, I appreciate that. I agree that annotation is a very complex moving target.

I have found the TxDb results now, they are actually Ensembl transcripts ENST00000484667 and ENST00000682832. They do not correspond to any NCBI transcripts.

The TxDb packages may contain what is says on the tin, but there isn't a lot written on the tin. I don't know much about UCSC knownGene tables or how to get them, so it is not immediately self-explanatory to me.

I did not ask Bioconductor to make a priori judgements about which is the best TSS. But choosing the best TSS is part of my job and I cannot do so correctly unless I have a perfect understanding of where the annotation is from. TxDb.Hsapiens.UCSC.hg38.knownGene seems to contain Ensembl transcripts but with NCBI gene IDs. That isn't a workable combination for me.

0
Entering edit mode

The second Ensembl transcript you mention is a MANE select transcript, so by definition it does have a matching NCBI transcript The first Ensembl transcript is only GENCODE basic. The MANE select transcript is actually ENST00000304952.11, so if you are going to pick a canonical one to use, that's arguably the one to choose. As you note, UCSC doesn't have either in their knownGene table. I don't know why that is. We unfortunately cannot vouch for the data provided in a given annotation package, other than to say we think it matches with what you would get from the original source.

I agree with you about using Ensembl data with an NCBI key. I have argued against that sort of mapping many times in the past. If it were up to me we wouldn't provide the knownGene data for hg38 for that reason. That's why I started making the TxDb.Hsapiens.UCSC.hg38.refGene package, which is a pure NCBI mapping, so people have an alternative that matches up with what we provided in the past.

The name of the package says exactly what it is. TxDb.Hsapiens.UCSC.hg38.refGene is a TxDb package for Homo sapiens that was collected from the UCSC refGene table for hg38. Is there something else required to unambiguously state what is contained therein? Having no knowledge of the table from which the data are derived seems orthogonal to whether it is an accurate description or not. The show method provides a bit more detail

> TxDb.Hsapiens.UCSC.hg38.refGene
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: UCSC
# Genome: hg38
# Organism: Homo sapiens
# Taxonomy ID: 9606
# UCSC Table: refGene
# UCSC Track: NCBI RefSeq
# Resource URL: http://genome.ucsc.edu/
# Type of Gene ID: Entrez Gene ID
# Full dataset: yes
# miRBase build ID: NA
# Nb of transcripts: 88816
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2021-04-28 16:30:46 +0000 (Wed, 28 Apr 2021)
# GenomicFeatures version at creation time: 1.41.3
# RSQLite version at creation time: 2.2.6
# DBSCHEMAVERSION: 1.2


Is there something else that could be added to make the tin better reflect the contents?

0
Entering edit mode

Well, I'd like to have the URL of the data file used, rather than just the generic URL for the UCSC browser. And I'd like to know the date of the annotation rather than the date of the package build.

Just by the way, I had a debate recently with the other Rsubread authors about whether it was ok to use UCSC's version of RefSeq rather than the primary NCBI RefSeq files. The debate was in the context of mm39 rather than hg38. I was open to using UCSC but we decided to stick with NCBI. This is just another reflection of the complexities between the different annotation sources.

0
Entering edit mode

I spent some time this morning reviewing exactly how the positional data are inserted in the OrgDb' packages. If you wish to do so yourself, you can find all the scripts here. The short story is that we download the UCSC refGene table dump from here. Which hasn't been updated since 2019, because that table has been static since then. The README at the top of the page states that the table dumps are updated daily, which I infer to mean that any time a given table is updated, they produce a new table dump. In comparison, the knownGene table dump is from January 2022.

In other words, what you could get from the TxDb.Hsapiens.UCSC.hg38.refGene will be identical to what is in the org.Hs.eg.db package.

We could download the table dumps for the TxDb packages as well, in which case we could provide the URL of the data file used, but that's not a panacea. Those files are overwritten whenever the underlying database is updated (the same happens at NCBI, btw). So having a URL pointing to an FTP site where the data are updated at random intervals isn't particularly useful unless the data haven't been updated in the intervening period. If we say, 'we got these data here on this date' and you visit the URL and find the timestamp on the file is updated, all you know is that the data on that FTP site no longer matches what we used.

We don't provide a URL for the TxDb packages because there isn't one. We directly access the underlying database and pull data from the listed table. That is a safer option because there is no guarantee that the table dumps will continue to be updated. We got burned in the past when GO stopped updating the SQL dumps we were parsing and only updated their OWL files. So for a number of releases the GO data we provided were accurate (inasmuch as they reflected what we downloaded) but not really because the data kept getting more and more stale. In the intervening period they also updated their DAG and much of the terms, which is how I first figured out that there was a problem. It then took somewhere around 50 hours of my time and I don't know how much of Dan Van Twisk's time to convert the data parser from the SQL dumps to OWL files.

Relying on processed files is more dangerous than querying the DB directly, and if GO gave access to their DB directly we would do that as well. Unless an annotation service ceases to exist, the DB will be the gold standard because that's where they keep the data. The DB dumps are just something they do as a favor to the community and can change or stop whenever.

0
Entering edit mode

Thanks for investigating and thanks for the extra background information. Appreciate it.

0
Entering edit mode
@gordon-smyth
Last seen 25 minutes ago
WEHI, Melbourne, Australia

Hi Jon,

As you will know from the help page, edgeR::nearestTSS uses the corresponding Bioconductor organism package. Most organism packages, like those for human and mouse, include TSS genomic locations but a few do not. The S. scrofa organism package doesn't include genomic locations, so there is no information for nearestTSS to extract. So you have no choice but to follow the more complicated pipeline suggested by James MacDonald.

nearestTSS currently works for species` equal to "Hs", "Mm", "Rn", "Dm", "Dr", "Ce", "Bt", "Gg", "Mmu", "Cf" or "Pt" but not for "Ss", "EcK12", "Xl", "Ag" or "EcSakai".

Best wishes
Gordon

0
Entering edit mode

Hi,

Thank you to everyone who replied to this thread.

Thanks also for the ongoing discussion above which helps clarify to a wider audience some implicit things for experienced users.

Kind regards, -Jon