Search
Question: Generating a proper TxDb instance from NCBI GFF Annotations File
1
2.5 years ago by
Germany
gokcen.eraslan10 wrote:

I have some RefSeq transcript IDs e.g. NM_123 and NR_123 and some genomic positions from hg19 assembly. I'm trying to use the NCBI Genome Annotations in my analysis. Since I couldn't find a ready-to-use NCBI annotation TxDb file in Bioconductor, I tried to generate my own from the following GFF file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.105/GFF/ref_GRCh37.p13_top_level.gff3.gz

This is NCBI annotation release 105 file and it is different than UCSC refGenes table which can be accessed:

txdb <- makeTxDbFromUCSC(genome='hg19', tablename='refGene')

However, I would like to use the original NCBI one, not the UCSC refGene table, since there are some discrepancies between them. So I generate my own TxDb from GFF:

txdb <- makeTxDbFromGFF('ref_GRCh37.p13_top_level.gff3', organism='Homo sapiens')

But in this case transcripts have NC_ id instead of chromosome names:

> txdb.transcripts <- transcripts(txdb)
> txdb.transcripts
GRanges object with 94469 ranges and 2 metadata columns:
seqnames           ranges strand   |     tx_id        tx_name
<Rle>        <IRanges>  <Rle>   | <integer>    <character>
[1]   NC_000001.10 [ 11874,  14409]      +   |         1    NR_046018.2
[2]   NC_000001.10 [ 30366,  30503]      +   |         2    NR_036051.1
[3]   NC_000001.10 [ 30438,  30458]      +   |         3           <NA>
[4]   NC_000001.10 [ 69091,  70008]      +   |         4 NM_001005484.1
[5]   NC_000001.10 [320162, 326154]      +   |         5    XR_246673.1
...            ...              ...    ... ...       ...            ...
[94465] NW_004775435.1 [178851, 181488]      -   |     94465         IMMTP1
[94466] NW_004775435.1 [184897, 185484]      -   |     94466    NM_181686.1
[94467] NW_004775435.1 [271901, 305157]      -   |     94467 NM_001202489.1
[94468] NW_004775435.1 [271901, 305157]      -   |     94468    NM_003343.5
[94469] NW_004775435.1 [271901, 305157]      -   |     94469    NM_182688.2
-------
seqinfo: 258 sequences from an unspecified genome; no seqlengths


 There some things that I don't understand about the resulting GRanges: How can I convert these seqnames to chr1-chr22 etc. properly? I tried  GenomeInfoDb::fetchExtendedChromInfoFromUCSC('hg19') but it doesn't provide a mapping between these NC_ ids and chr1-chr22. Should I use GenomeInfoDb:::fetch_assembly_report('GCF_000001405.25') and somehow map RefSeqAccn and UCSCStyleName columns manually? How should I handle NW_ ids then? Should I ignore all transcripts with NW_ seqnames? That seems so hacky...    Some transcripts such as NM_001286239 do not exist it in TxDb or in GFF file, although they can be accessed on the web interface: https://www.ncbi.nlm.nih.gov/gene/?term=NM_001286239 does anyone know why?

modified 2.5 years ago by Hervé Pagès ♦♦ 13k • written 2.5 years ago by gokcen.eraslan10
1

I expect your 'missing' transcript isn't defined in the GRCh37 annotation, but was only added after the move to GRCh38.

Thanks for the reply. I also thought about that, but they also report hg19-based coordinates on the webpage. Is it the case that they also map new transcripts to older assemblies?

1

I don't think it does give transcript specific coordinates on that page.  I can only see coordinates for the start and end of the set of all transcripts relative to GRCh37 & GRCh38.  If you select GRCh37 from the Genomic regions, transcripts, and products section there is no entry for NM_001286239 in the diagram.  Maybe I'm not seeing something though.

3
2.5 years ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Unfortunately grabbing NCBI annotations for GRCh37.p13 and trying to use them with hg19 is painful and has some pitfalls. The GRCh37.p13 assembly contains 297 sequences (see ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.25.assembly.txt), and hg19 contains only 93. Only 92 sequences are shared between the 2 assemblies: the 83 sequences in the Primary Assembly + the 9 sequences in the ALT_REF_LOCI_* assembly units. Be aware that chromosome M is not the
same in the 2 assemblies (NC_012920.1 in GRCh37.p13 and NC_001807 in hg19).

As you found out, you can use GRCh37.p13 assembly report (that you can fetch with unexported/undocumented internal utility GenomeInfoDb:::fetch_assembly_report, congrats for finding this!) to translate from the RefSeq accessions used in the NCBI GFF file to hg19 chromosome names. Something like this:

## Import GFF file as a GRanges object.
library(rtracklayer)
gr <- import("ref_GRCh37.p13_top_level.gff3")

## Turn GRanges object from GRCh37.p13-based to hg19-based.
library(GenomeInfoDb)
assembly_report <- GenomeInfoDb:::fetch_assembly_report("GCF_000001405.25")
m <- match(seqlevels(gr), assembly_report$RefSeqAccn) new_seqlevels <- assembly_report$UCSCStyleName[m]
names(new_seqlevels) <- assembly_report\$RefSeqAccn[m]
new_seqlevels <- new_seqlevels[!(new_seqlevels %in% c("na", "chrM"))]
seqlevels(gr, force=TRUE) <- new_seqlevels
genome(gr) <- "hg19"

## Turn GRanges object into a TxDb object.
library(GenomicFeatures)
txdb <- makeTxDbFromGRanges(gr)

Yes, a little bit too hacky at the moment, unfortunately. Hopefully we'll be able to make this easier in the future.

Cheers,
H.

Thanks for the reply. Would it make sense to add resulting TxDb to Bioconductor as an official annotation package :) People do not suffer as much as I did, at least.

1
2.5 years ago by
United States
James W. MacDonald46k wrote:

You can't map all the data to chr 1-22 'properly' without dropping all the unplaced/alternate/unknown scaffolds. And AFAIK, all the getter/setter functions in GenomeInfoDb assume you just have genes on the 'regular' chromosomes (1-24+MT). So if you just want the placed scaffolds, you could remove all the 233 extra sequences from your GRanges and then map using seqlevels() and this. If you want to map everything, then that is going to get a bit tricky, as the nomenclature for the alternate and unplaced scaffolds seems a bit lacking. As an example, here are the first few lines from the GCRh37 alternate scaffold mapping:

1    6    NT_167249.1 224515582 GL000256.1 224183346
2    4    NT_167250.1 224515583 GL000257.1 224183347
3   17    NT_167251.1 224515584 GL000258.1 224183348
4    6    NT_167244.1 224515576 GL000250.1 224183340
5    6    NT_167245.1 224515578 GL000252.1 224183342
6    6    NT_167246.1 224515579 GL000253.1 224183343
7    6    NT_167247.1 224515580 GL000254.1 224183344
8    6    NT_167248.1 224515581 GL000255.1 224183345
9    6    NT_113891.2 224515577 GL000251.1 224183341

The first column is just the row name - the second column is the chromosome. I have no idea how you would map all those alternate chr6 scaffolds to say, UCSC style names:

> grep("chr6", seqlevels(tx2), value = TRUE)
[1] "chr6"           "chr6_apd_hap1"  "chr6_cox_hap2"  "chr6_dbb_hap3"
[5] "chr6_mann_hap4" "chr6_mcf_hap5"  "chr6_qbl_hap6"  "chr6_ssto_hap7"

Probably you can't, as there appear to be more unmapped and unplaced chr6 scaffolds from NCBI as compared to UCSC. You could just append something like chr6_alt1, chr6_alt2, etc.

Thanks for the reply. GenomeInfoDb:::fetch_assembly_report('GCF_000001405.25') retrieves and parses

ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000001405.25.assembly.txt

So RefSeqAccn and UCSCStyleName columns can be used to convert NC_ and NT_ ids to UCSC ones, but no luck for NW_ ids. But is this is easiest way to use NCBI GFF files? Seems quite tricky...