I have some RefSeq transcript IDs e.g. NM_123 and NR_123 and some genomic positions from hg19 assembly. I'm trying to use the NCBI Genome Annotations in my analysis. Since I couldn't find a ready-to-use NCBI annotation TxDb file in Bioconductor, I tried to generate my own from the following GFF file:
This is NCBI annotation release 105 file and it is different than UCSC refGenes table which can be accessed:
txdb <- makeTxDbFromUCSC(genome='hg19', tablename='refGene')
However, I would like to use the original NCBI one, not the UCSC refGene table, since there are some discrepancies between them. So I generate my own TxDb from GFF:
txdb <- makeTxDbFromGFF('ref_GRCh37.p13_top_level.gff3', organism='Homo sapiens')
But in this case transcripts have NC_ id instead of chromosome names:
> txdb.transcripts <- transcripts(txdb) > txdb.transcripts GRanges object with 94469 ranges and 2 metadata columns: seqnames ranges strand | tx_id tx_name <Rle> <IRanges> <Rle> | <integer> <character> [1] NC_000001.10 [ 11874, 14409] + | 1 NR_046018.2 [2] NC_000001.10 [ 30366, 30503] + | 2 NR_036051.1 [3] NC_000001.10 [ 30438, 30458] + | 3 <NA> [4] NC_000001.10 [ 69091, 70008] + | 4 NM_001005484.1 [5] NC_000001.10 [320162, 326154] + | 5 XR_246673.1 ... ... ... ... ... ... ... [94465] NW_004775435.1 [178851, 181488] - | 94465 IMMTP1 [94466] NW_004775435.1 [184897, 185484] - | 94466 NM_181686.1 [94467] NW_004775435.1 [271901, 305157] - | 94467 NM_001202489.1 [94468] NW_004775435.1 [271901, 305157] - | 94468 NM_003343.5 [94469] NW_004775435.1 [271901, 305157] - | 94469 NM_182688.2 ------- seqinfo: 258 sequences from an unspecified genome; no seqlengths |
|
|
I expect your 'missing' transcript isn't defined in the GRCh37 annotation, but was only added after the move to GRCh38.
Thanks for the reply. I also thought about that, but they also report hg19-based coordinates on the webpage. Is it the case that they also map new transcripts to older assemblies?
I don't think it does give transcript specific coordinates on that page. I can only see coordinates for the start and end of the set of all transcripts relative to GRCh37 & GRCh38. If you select GRCh37 from the Genomic regions, transcripts, and products section there is no entry for NM_001286239 in the diagram. Maybe I'm not seeing something though.