Question: error in makeTxDbFromUCSC
0
6 months ago by
alanchenslm0 wrote:

Hi GenomicFeatures support,

I am a phD student at the University of Tokyo, using GenomicFeatures for ChIPSeeker. After I run

txdb=makeTxDbFromUCSC(genome="hg19",tablename="refGene")


I got this error.

Download the refGene table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... Error in .check_foreign_key(transcripts_tx_chrom, NA, "transcripts$tx_chrom", : all the values in 'transcripts$tx_chrom' must be present in 'chrominfo$chrom'  I consider the problem is that refGene version was upgraded last November,however, GenomicFeatures haven't done corresponding changes to the new refGene release. Genomefeatures "x" "1" "chr1" "2" "chr1gl000191random" "3" "chr1gl000192random" "4" "chr10" "5" "chr11" "6" "chr11gl000202random" "7" "chr12" "8" "chr13" "9" "chr14" "10" "chr15" "11" "chr16" "12" "chr17" "13" "chr17ctg5hap1" "14" "chr17gl000203random" "15" "chr17gl000204random" "16" "chr17gl000205random" "17" "chr17gl000206random" "18" "chr18" "19" "chr18gl000207random" "20" "chr19" "21" "chr19gl000208random" "22" "chr19gl000209random" "23" "chr2" "24" "chr20" "25" "chr21" "26" "chr21gl000210random" "27" "chr22" "28" "chr3" "29" "chr4" "30" "chr4ctg9hap1" "31" "chr4gl000193random" "32" "chr4gl000194random" "33" "chr5" "34" "chr6" "35" "chr6apdhap1" "36" "chr6coxhap2" "37" "chr6dbbhap3" "38" "chr6mannhap4" "39" "chr6mcfhap5" "40" "chr6qblhap6" "41" "chr6sstohap7" "42" "chr7" "43" "chr7gl000195random" "44" "chr8" "45" "chr8gl000196random" "46" "chr8gl000197random" "47" "chr9" "48" "chr9gl000198random" "49" "chr9gl000199random" "50" "chr9gl000200random" "51" "chr9gl000201random" "52" "chrM" "53" "chrUngl000211" "54" "chrUngl000212" "55" "chrUngl000213" "56" "chrUngl000214" "57" "chrUngl000215" "58" "chrUngl000216" "59" "chrUngl000217" "60" "chrUngl000218" "61" "chrUngl000219" "62" "chrUngl000220" "63" "chrUngl000221" "64" "chrUngl000222" "65" "chrUngl000223" "66" "chrUngl000224" "67" "chrUngl000225" "68" "chrUngl000226" "69" "chrUngl000227" "70" "chrUngl000228" "71" "chrUngl000229" "72" "chrUngl000230" "73" "chrUngl000231" "74" "chrUngl000232" "75" "chrUngl000233" "76" "chrUngl000234" "77" "chrUngl000235" "78" "chrUngl000236" "79" "chrUngl000237" "80" "chrUngl000238" "81" "chrUngl000239" "82" "chrUngl000240" "83" "chrUngl000241" "84" "chrUngl000242" "85" "chrUngl000243" "86" "chrUngl000244" "87" "chrUngl000245" "88" "chrUngl000246" "89" "chrUngl000247" "90" "chrUngl000248" "91" "chrUn_gl000249" "92" "chrX" "93" "chrY" UCSCrefgene "x" "1" "chr1" "2" "chr1gl000191random" "3" "chr1gl000192random" "4" "chr1gl383519alt" "5" "chr1gl949741fix" "6" "chr1jh636052fix" "7" "chr1jh636054fix" "8" "chr10" "9" "chr10gl383543fix" "10" "chr10jh591181fix" "11" "chr10jh636060fix" "12" "chr11" "13" "chr11gl949744fix" "14" "chr11jh159138fix" "15" "chr11jh159142fix" "16" "chr12" "17" "chr13" "18" "chr14" "19" "chr14kb021645fix" "20" "chr15" "21" "chr16" "22" "chr17" "23" "chr17ctg5hap1" "24" "chr17gl000205random" "25" "chr17gl383560fix" "26" "chr17gl582976fix" "27" "chr17jh159145fix" "28" "chr18" "29" "chr18gl383571alt" "30" "chr19" "31" "chr19gl000209random" "32" "chr19gl383575alt" "33" "chr19gl582977fix" "34" "chr19gl949746alt" "35" "chr19gl949747alt" "36" "chr19gl949748alt" "37" "chr19gl949749alt" "38" "chr19gl949750alt" "39" "chr19gl949751alt" "40" "chr19gl949752alt" "41" "chr19gl949753alt" "42" "chr19jh159149fix" "43" "chr19kb021647fix" "44" "chr2" "45" "chr2kb663603fix" "46" "chr20" "47" "chr20gl582979fix" "48" "chr21" "49" "chr21ke332506fix" "50" "chr22" "51" "chr22gl383582alt" "52" "chr22jh720449fix" "53" "chr3" "54" "chr3gl383523fix" "55" "chr3jh159132fix" "56" "chr4" "57" "chr4ctg9hap1" "58" "chr4gl000193random" "59" "chr4gl000194random" "60" "chr4gl877872fix" "61" "chr4ke332496fix" "62" "chr5" "63" "chr5gl339449alt" "64" "chr5jh159133fix" "65" "chr5ke332497fix" "66" "chr6" "67" "chr6apdhap1" "68" "chr6coxhap2" "69" "chr6dbbhap3" "70" "chr6jh636056fix" "71" "chr6kb663604fix" "72" "chr6mannhap4" "73" "chr6mcfhap5" "74" "chr6qblhap6" "75" "chr6sstohap7" "76" "chr7" "77" "chr7gl000195random" "78" "chr7gl582971fix" "79" "chr7jh159134fix" "80" "chr8" "81" "chr8gl383535fix" "82" "chr8gl383536fix" "83" "chr9" "84" "chr9gl339450fix" "85" "chrM" "86" "chrUngl000211" "87" "chrUngl000212" "88" "chrUngl000213" "89" "chrUngl000215" "90" "chrUngl000218" "91" "chrUngl000219" "92" "chrUngl000220" "93" "chrUngl000222" "94" "chrUngl000223" "95" "chrUngl000224" "96" "chrUngl000227" "97" "chrUngl000228" "98" "chrUngl000241" "99" "chrX" "100" "chrXjh159150fix" "101" "chrXjh806587fix" "102" "chrXjh806590fix" "103" "chrXjh806593fix" "104" "chrXjh806594fix" "105" "chrXjh806595fix" "106" "chrXjh806597fix" "107" "chrXjh806599fix" "108" "chrXjh806600fix" "109" "chrXjh806601fix" "110" "chrXkb021648_fix" "111" "chrY" If anyone have any clues, please let me know. Your help is much appreciated. Thank you so much! sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134) Matrix products: default locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C LC_TIME=English_United States.1252 attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] cowplot_0.9.4 reshape_0.8.8 ggplot2_3.1.0 clusterProfiler_3.10.1 GenomicFeatures_1.34.1 [6] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1 org.Hs.eg.db_3.7.0 AnnotationDbi_1.44.0 IRanges_2.16.0 [11] S4Vectors_0.20.1 Biobase_2.42.0 BiocGenerics_0.28.0 ChIPseeker_1.18.0  ADD COMMENTlink modified 6 months ago by Hervé Pagès ♦♦ 14k • written 6 months ago by alanchenslm0 2 Please resist the temptation to post in multiple locations (I think you got them all, here, bioc-devel, GitHub, and the maintainer email address)! This seems to have been reported before https://support.bioconductor.org/p/114901/ https://support.bioconductor.org/p/107839/ . We'll work on this over the next several days. ADD REPLYlink written 6 months ago by Martin Morgan ♦♦ 23k Hi,Martin Sorry for posting in multiple locations. I will cancel posting in other places. Lets keep the discussion here. Thank you for trying to help me out. If there is any progress, please let me know. Thank you for your time and help again. ADD REPLYlink modified 6 months ago • written 6 months ago by chen.shihang0 Do NOT post in multiple places; all locations are monitored by the same people. ADD REPLYlink written 6 months ago by Martin Morgan ♦♦ 23k I have had a look at this, and certainly confirm the error event with devel branch. A similar error occurs with the request for refGene with hg38. Browse[6]> where where 1: .check_foreign_key(transcripts_tx_chrom, NA, "transcripts$tx_chrom",
chrominfo$chrom, NA, "chrominfo$chrom")
where 2: .makeTxDb_normarg_chrominfo(chrominfo, transcripts$tx_chrom, splicings$exon_chrom)
where 3: makeTxDb(transcripts, splicings, genes = genes, chrominfo = chrominfo,
where 4: .makeTxDbFromUCSCTxTable(ucsc_txtable, txname2geneid$genes, genome, tablename, track, txname2geneid$gene_id_type, full_dataset = is.null(transcript_ids),
circ_seqs = circ_seqs, goldenPath_url = goldenPath_url, taxonomyId = taxonomyId,
miRBaseBuild = miRBaseBuild)
where 5: makeTxDbFromUCSC(genome = "hg19", tablename = "refGene")

Browse[6]> length(setdiff(referring_vals, referred_vals))
[1] 57
Browse[6]> setdiff(referring_vals, referred_vals)
[1] "chr19_gl949749_alt" "chr19_gl949746_alt" "chr17_gl582976_fix"
[4] "chr17_jh159145_fix" "chr11_jh159138_fix" "chr5_jh159133_fix"
...


for hg19, .fetchUCSCtxtable returns a table with 111 unique values for chrom

BUT

Browse[4]> GenomeInfoDb:::fetch_ChromInfo_from_UCSC
{
url <- paste(goldenPath_url, genome, "database/chromInfo.txt.gz",
sep = "/")
destfile <- tempfile()


has entries for only 93 'chromosomes' when genome == "hg19". So the real problem seems to be synchronization upstream. However it should be possible to devise a soft landing for this event?

Hi Vincent, thank you for taking a look! I still no idea what I can do. Actually, I am a medical student and don't know much about programming. If you know any clues for dealing with this problem, please let me know! Thanks so much for your time and help, much appreciated!

3
6 months ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Thanks for the report.

The refGene tables in UCSC databases hg19 and hg38 were last updated in Nov 2018 and now contain transcripts located on sequences that do NOT belong to the corresponding genomes (GRCh37 and GRCh38, respectively). More precisely some transcripts in these tables now belong to patched versions of these genomes: GRCh37.p13 for hg19 and GRCh38.p11 for hg38. Note that this also causes errors on the Genome Browser itself e.g. if you go to https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19 , enter transcript NM_001910 in the search box, click on GO, then click on the NM_001910 at chr1_jh636054_fix:118-14749 link, you'll get the following error:

Sorry, couldn't locate chr1_jh636054_fix:118-14749 in Human Feb. 2009 (GRCh37/hg19)


I just committed a fix to GenomicFeatures. The fix is to drop these foreign transcripts with a warning. For example calling makeTxDbFromUCSC(genome="hg38", tablename="refGene") now displays the following warning message:

  113 transcripts were dropped because they are on unknown sequences
(e.g. transcripts NM_024081, NM_001001437, NM_012101, NR_146066, ...)


The fix is in GenomicFeatures 1.35.6 (master branch, see fix here) and GenomicFeatures 1.34.3 (RELEASE_3_8 branch).

These 2 new versions of GenomicFeatures should become available via BiocManager::install() in the next 36 hours or so.

Cheers, H.