Issue creating a genomic state object in derfinder from an Ensembl GTF file

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 8 weeks ago

United States

Hi,

I received the following question by Andrew Lamb which I'm posting here since it might be of interest to other users. I'll answer it shortly. He is using the following GTF file ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz.

Best,

Leonardo

-----

The only issue is that we're still having trouble annotating it using GRCH38 gtf file. Our RNASeq data was aligned using this version so we would like to be able to annotate with it as well. The issue comes when trying the makeGenomicState function, although it may be the issue was with the makeTxDbFromGFF function. Please see below:

library(GenomicFeatures)
library(derfinder)

gtf_file <-'/udd/reala/reference_files/DE/Homo_sapiens.GRCh38.81.gtf'
TXDB <- makeTxDbFromGFF(gtf_file, format = 'gtf')

Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
2: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
3: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
4: Named parameters not used in query: internal_id, name, chrom, strand, start, end
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
7: Named parameters not used in query: gene_id, internal_tx_id 

genomic_state <- makeGenomicState(TXDB)

extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
Error in .testForValidKeys(x, keys, keytype, fks) :
  None of the keys entered are valid keys for 'TXID'. Please use the keys method to see a listing of valid arguments.

I also tried:

genomic_state <- makeGenomicState(TXDB, style = "chrsStyle")

extendedMapSeqlevels: the 'style' chrsStyle is currently not supported for the 'species' homo_sapiens in GenomeInfoDb. Check valid naming styles by running GenomeInfoDb::genomeStyles(species). If it's not present, consider adding your genome by following the information at http://www.bioconductor.org/packages/release/bioc/vignettes/GenomeInfoDb/inst/doc/Accept-organism-for-GenomeInfoDb.pdf
'select()' returned 1:1 mapping between keys and columns
Error: logical subscript contains NAs

Any thought on what is happening here?

derfinder genomicfeatures • 4.0k views

ADD COMMENT • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 8 weeks ago

United States

Hi Andrew,

It took a bit of effort, but I found what was the problem. The initial error message was not informative and in the end, the problem is that the chromosome lengths are missing when using GenomicFeatures::makeTxDbFromGFF(). This other thread A: Setting genome information on a TxDb object helped me figure out how to specify the chromosome lengths since you currently can't do so for a txdb object. In your particular case since you are using Ensembl's notation for chromosomes, you are better off using style = 'Ensembl', currentStyle = 'Ensembl' so that the chromosome names won't be modified (by default derfinder tries to use UCSC names). You also have many chromosomes in your GTF file and will need to figure out the correct lengths for them, or drop the non-canonical ones in case you don't need the extra chromosomes/contigs.

## Un-evaluated code

library('GenomicFeatures')
library('derfinder')
library('devtools')

txdb <- makeTxDbFromGFF('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')

## Check chr names
seqlevels(txdb)

## Keep just chrs 1 to 22, X, Y, and MT
txdb <- keepSeqlevels(txdb, c(1:22, 'X', 'Y', 'MT'))
seqlevels(txdb)

## For speed testing: just use a portion of the txdb data
test_txdb <- keepSeqlevels(txdb, c('Y', 'MT'))

## Attempt to create genomic state object
gs <- makeGenomicState(test_txdb, style = 'Ensembl', currentStyle = 'Ensembl')
traceback()

## Issue is with the seqlengths being missing
seqlengths(test_txdb)

## Adding the sequence lengths doesn't work
hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
new_info <- hg38_chrominfo$UCSC_seqlength[match(seqlevels(test_txdb),
    mapSeqlevels(hg38_chrominfo$UCSC_seqlevel, 'Ensembl'))]
names(new_info) <- names(seqlengths(test_txdb))
seqlengths(test_txdb) <- new_info

## Using the approach described in https://support.bioconductor.org/p/67680/#67766
## That is, loading the GFF as a GRanges object and setting the lengths there
## before using makeTxDbFromGRanges()
library('rtracklayer')
gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')

## Make a small GRanges object for testing -- pruning.mode is not used on Bioc-release, just Bioc-devel
gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')

## Set the chromosome lengths, genome 
seqlengths(gr_small) <- new_info
genome(gr_small) <- 'hg38'
txdb2 <- makeTxDbFromGRanges(gr_small)

## Now create the genomic state object
gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
gs

## Reproducibility information
print('Reproducibility information:')
Sys.time()
options(width = 120)
session_info()

## Evaluated code

	> library('GenomicFeatures')
	Loading required package: BiocGenerics
	Loading required package: parallel

	Attaching package: ‘BiocGenerics’

	The following objects are masked from ‘package:parallel’:

	clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport, clusterMap, parApply, parCapply, parLapply,
	parLapplyLB, parRapply, parSapply, parSapplyLB

	The following objects are masked from ‘package:stats’:

	IQR, mad, sd, var, xtabs

	The following objects are masked from ‘package:base’:

	anyDuplicated, append, as.data.frame, cbind, colMeans, colnames, colSums, do.call, duplicated, eval, evalq, Filter,
	Find, get, grep, grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match, mget, order, paste, pmax,
	pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort, table,
	tapply, union, unique, unsplit, which, which.max, which.min

	Loading required package: S4Vectors
	Loading required package: stats4

	Attaching package: ‘S4Vectors’

	The following object is masked from ‘package:base’:

	expand.grid

	Loading required package: IRanges
	Loading required package: GenomeInfoDb
	Loading required package: GenomicRanges
	Loading required package: AnnotationDbi
	Loading required package: Biobase
	Welcome to Bioconductor

	Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see
	'citation("Biobase")', and for packages 'citation("pkgname")'.

	> library('derfinder')
	> library('devtools')
	>
	> txdb <- makeTxDbFromGFF('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
	Import genomic features from the file as a GRanges object ... trying URL 'ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz'
	ftp data connection made, file length 49757610 bytes
	==================================================
	downloaded 47.5 MB

	OK
	Prepare the 'metadata' data frame ... OK
	Make the TxDb object ... OK
	Warning messages:
	1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
	2: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
	3: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
	4: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
	7: Named parameters not used in query: gene_id, internal_tx_id
	>
	> ## Check chr names
	> seqlevels(txdb)
	[1] "CHR_HG126_PATCH" "CHR_HG1342_HG2282_PATCH" "CHR_HG1362_PATCH"
	[4] "CHR_HG142_HG150_NOVEL_TEST" "CHR_HG151_NOVEL_TEST" "CHR_HG1651_PATCH"
	[7] "CHR_HG1832_PATCH" "CHR_HG2021_PATCH" "CHR_HG2030_PATCH"
	[10] "CHR_HG2058_PATCH" "CHR_HG2062_PATCH" "CHR_HG2066_PATCH"
	[13] "CHR_HG2095_PATCH" "CHR_HG2104_PATCH" "CHR_HG2128_PATCH"
	[16] "CHR_HG2191_PATCH" "CHR_HG2217_PATCH" "CHR_HG2232_PATCH"
	[19] "CHR_HG2233_PATCH" "CHR_HG2235_PATCH" "CHR_HG2247_PATCH"
	[22] "CHR_HG2249_PATCH" "CHR_HG2288_HG2289_PATCH" "CHR_HG2290_PATCH"
	[25] "CHR_HG2291_PATCH" "CHR_HG986_PATCH" "CHR_HSCHR10_1_CTG1"
	[28] "CHR_HSCHR10_1_CTG2" "CHR_HSCHR10_1_CTG3" "CHR_HSCHR10_1_CTG4"
	[31] "CHR_HSCHR11_1_CTG1_2" "CHR_HSCHR11_1_CTG5" "CHR_HSCHR11_1_CTG6"
	[34] "CHR_HSCHR11_1_CTG7" "CHR_HSCHR11_1_CTG8" "CHR_HSCHR11_2_CTG1"
	[37] "CHR_HSCHR11_2_CTG1_1" "CHR_HSCHR11_3_CTG1" "CHR_HSCHR12_1_CTG1"
	[40] "CHR_HSCHR12_1_CTG2_1" "CHR_HSCHR12_2_CTG2" "CHR_HSCHR12_2_CTG2_1"
	[43] "CHR_HSCHR12_3_CTG2" "CHR_HSCHR12_3_CTG2_1" "CHR_HSCHR12_4_CTG2"
	[46] "CHR_HSCHR12_4_CTG2_1" "CHR_HSCHR12_5_CTG2" "CHR_HSCHR12_5_CTG2_1"
	[49] "CHR_HSCHR12_6_CTG2_1" "CHR_HSCHR13_1_CTG1" "CHR_HSCHR13_1_CTG3"
	[52] "CHR_HSCHR13_1_CTG5" "CHR_HSCHR14_1_CTG1" "CHR_HSCHR14_2_CTG1"
	[55] "CHR_HSCHR14_3_CTG1" "CHR_HSCHR14_7_CTG1" "CHR_HSCHR15_1_CTG1"
	[58] "CHR_HSCHR15_1_CTG3" "CHR_HSCHR15_1_CTG8" "CHR_HSCHR15_2_CTG3"
	[61] "CHR_HSCHR15_2_CTG8" "CHR_HSCHR15_3_CTG3" "CHR_HSCHR15_3_CTG8"
	[64] "CHR_HSCHR15_4_CTG8" "CHR_HSCHR15_5_CTG8" "CHR_HSCHR15_6_CTG8"
	[67] "CHR_HSCHR16_1_CTG1" "CHR_HSCHR16_1_CTG3_1" "CHR_HSCHR16_2_CTG3_1"
	[70] "CHR_HSCHR16_3_CTG1" "CHR_HSCHR16_4_CTG1" "CHR_HSCHR16_CTG2"
	[73] "CHR_HSCHR17_10_CTG4" "CHR_HSCHR17_1_CTG1" "CHR_HSCHR17_1_CTG2"
	[76] "CHR_HSCHR17_1_CTG4" "CHR_HSCHR17_1_CTG5" "CHR_HSCHR17_1_CTG9"
	[79] "CHR_HSCHR17_2_CTG1" "CHR_HSCHR17_2_CTG2" "CHR_HSCHR17_2_CTG4"
	[82] "CHR_HSCHR17_2_CTG5" "CHR_HSCHR17_3_CTG2" "CHR_HSCHR17_3_CTG4"
	[85] "CHR_HSCHR17_4_CTG4" "CHR_HSCHR17_5_CTG4" "CHR_HSCHR17_6_CTG4"
	[88] "CHR_HSCHR17_7_CTG4" "CHR_HSCHR17_8_CTG4" "CHR_HSCHR17_9_CTG4"
	[91] "CHR_HSCHR18_1_CTG1_1" "CHR_HSCHR18_1_CTG2_1" "CHR_HSCHR18_2_CTG1_1"
	[94] "CHR_HSCHR18_2_CTG2" "CHR_HSCHR18_2_CTG2_1" "CHR_HSCHR18_3_CTG2_1"
	[97] "CHR_HSCHR18_ALT21_CTG2_1" "CHR_HSCHR18_ALT2_CTG2_1" "CHR_HSCHR19KIR_ABC08_A1_HAP_CTG3_1"
	[100] "CHR_HSCHR19KIR_ABC08_AB_HAP_C_P_CTG3_1" "CHR_HSCHR19KIR_ABC08_AB_HAP_T_P_CTG3_1" "CHR_HSCHR19KIR_FH05_A_HAP_CTG3_1"
	[103] "CHR_HSCHR19KIR_FH05_B_HAP_CTG3_1" "CHR_HSCHR19KIR_FH06_A_HAP_CTG3_1" "CHR_HSCHR19KIR_FH06_BA1_HAP_CTG3_1"
	[106] "CHR_HSCHR19KIR_FH08_A_HAP_CTG3_1" "CHR_HSCHR19KIR_FH08_BAX_HAP_CTG3_1" "CHR_HSCHR19KIR_FH13_A_HAP_CTG3_1"
	[109] "CHR_HSCHR19KIR_FH13_BA2_HAP_CTG3_1" "CHR_HSCHR19KIR_FH15_A_HAP_CTG3_1" "CHR_HSCHR19KIR_FH15_B_HAP_CTG3_1"
	[112] "CHR_HSCHR19KIR_G085_A_HAP_CTG3_1" "CHR_HSCHR19KIR_G085_BA1_HAP_CTG3_1" "CHR_HSCHR19KIR_G248_A_HAP_CTG3_1"
	[115] "CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1" "CHR_HSCHR19KIR_GRC212_AB_HAP_CTG3_1" "CHR_HSCHR19KIR_GRC212_BA1_HAP_CTG3_1"
	[118] "CHR_HSCHR19KIR_LUCE_A_HAP_CTG3_1" "CHR_HSCHR19KIR_LUCE_BDEL_HAP_CTG3_1" "CHR_HSCHR19KIR_RP5_B_HAP_CTG3_1"
	[121] "CHR_HSCHR19KIR_RSH_A_HAP_CTG3_1" "CHR_HSCHR19KIR_RSH_BA2_HAP_CTG3_1" "CHR_HSCHR19KIR_T7526_A_HAP_CTG3_1"
	[124] "CHR_HSCHR19KIR_T7526_BDEL_HAP_CTG3_1" "CHR_HSCHR19LRC_COX1_CTG3_1" "CHR_HSCHR19LRC_COX2_CTG3_1"
	[127] "CHR_HSCHR19LRC_LRC_I_CTG3_1" "CHR_HSCHR19LRC_LRC_J_CTG3_1" "CHR_HSCHR19LRC_LRC_S_CTG3_1"
	[130] "CHR_HSCHR19LRC_LRC_T_CTG3_1" "CHR_HSCHR19LRC_PGF1_CTG3_1" "CHR_HSCHR19LRC_PGF2_CTG3_1"
	[133] "CHR_HSCHR19_1_CTG2" "CHR_HSCHR19_1_CTG3_1" "CHR_HSCHR19_2_CTG2"
	[136] "CHR_HSCHR19_2_CTG3_1" "CHR_HSCHR19_3_CTG2" "CHR_HSCHR19_3_CTG3_1"
	[139] "CHR_HSCHR19_4_CTG2" "CHR_HSCHR19_4_CTG3_1" "CHR_HSCHR19_5_CTG2"
	[142] "CHR_HSCHR1_1_CTG11" "CHR_HSCHR1_1_CTG3" "CHR_HSCHR1_1_CTG31"
	[145] "CHR_HSCHR1_1_CTG32_1" "CHR_HSCHR1_2_CTG3" "CHR_HSCHR1_2_CTG31"
	[148] "CHR_HSCHR1_2_CTG32_1" "CHR_HSCHR1_3_CTG31" "CHR_HSCHR1_3_CTG32_1"
	[151] "CHR_HSCHR1_4_CTG31" "CHR_HSCHR1_ALT2_1_CTG32_1" "CHR_HSCHR20_1_CTG1"
	[154] "CHR_HSCHR20_1_CTG2" "CHR_HSCHR20_1_CTG3" "CHR_HSCHR20_1_CTG4"
	[157] "CHR_HSCHR21_2_CTG1_1" "CHR_HSCHR21_3_CTG1_1" "CHR_HSCHR21_4_CTG1_1"
	[160] "CHR_HSCHR21_5_CTG2" "CHR_HSCHR21_6_CTG1_1" "CHR_HSCHR21_8_CTG1_1"
	[163] "CHR_HSCHR22_1_CTG1" "CHR_HSCHR22_1_CTG2" "CHR_HSCHR22_1_CTG3"
	[166] "CHR_HSCHR22_1_CTG4" "CHR_HSCHR22_1_CTG5" "CHR_HSCHR22_1_CTG6"
	[169] "CHR_HSCHR22_1_CTG7" "CHR_HSCHR22_2_CTG1" "CHR_HSCHR22_3_CTG1"
	[172] "CHR_HSCHR22_4_CTG1" "CHR_HSCHR22_5_CTG1" "CHR_HSCHR2_1_CTG1"
	[175] "CHR_HSCHR2_1_CTG15" "CHR_HSCHR2_1_CTG5" "CHR_HSCHR2_1_CTG7"
	[178] "CHR_HSCHR2_1_CTG7_2" "CHR_HSCHR2_2_CTG1" "CHR_HSCHR2_2_CTG15"
	[181] "CHR_HSCHR2_2_CTG7" "CHR_HSCHR2_2_CTG7_2" "CHR_HSCHR2_3_CTG1"
	[184] "CHR_HSCHR2_3_CTG15" "CHR_HSCHR2_3_CTG7_2" "CHR_HSCHR2_4_CTG1"
	[187] "CHR_HSCHR3_1_CTG1" "CHR_HSCHR3_1_CTG2_1" "CHR_HSCHR3_1_CTG3"
	[190] "CHR_HSCHR3_2_CTG2_1" "CHR_HSCHR3_2_CTG3" "CHR_HSCHR3_3_CTG1"
	[193] "CHR_HSCHR3_3_CTG3" "CHR_HSCHR3_4_CTG2_1" "CHR_HSCHR3_4_CTG3"
	[196] "CHR_HSCHR3_5_CTG2_1" "CHR_HSCHR3_5_CTG3" "CHR_HSCHR3_6_CTG3"
	[199] "CHR_HSCHR3_7_CTG3" "CHR_HSCHR3_8_CTG3" "CHR_HSCHR3_9_CTG3"
	[202] "CHR_HSCHR4_1_CTG12" "CHR_HSCHR4_1_CTG4" "CHR_HSCHR4_1_CTG6"
	[205] "CHR_HSCHR4_1_CTG9" "CHR_HSCHR4_2_CTG12" "CHR_HSCHR4_3_CTG12"
	[208] "CHR_HSCHR4_4_CTG12" "CHR_HSCHR4_5_CTG12" "CHR_HSCHR4_6_CTG12"
	[211] "CHR_HSCHR4_7_CTG12" "CHR_HSCHR5_1_CTG1" "CHR_HSCHR5_1_CTG1_1"
	[214] "CHR_HSCHR5_1_CTG5" "CHR_HSCHR5_2_CTG1" "CHR_HSCHR5_2_CTG1_1"
	[217] "CHR_HSCHR5_2_CTG5" "CHR_HSCHR5_3_CTG1" "CHR_HSCHR5_3_CTG5"
	[220] "CHR_HSCHR5_4_CTG1" "CHR_HSCHR5_4_CTG1_1" "CHR_HSCHR5_5_CTG1"
	[223] "CHR_HSCHR5_6_CTG1" "CHR_HSCHR5_7_CTG1" "CHR_HSCHR6_1_CTG2"
	[226] "CHR_HSCHR6_1_CTG3" "CHR_HSCHR6_1_CTG4" "CHR_HSCHR6_1_CTG5"
	[229] "CHR_HSCHR6_1_CTG6" "CHR_HSCHR6_1_CTG7" "CHR_HSCHR6_1_CTG8"
	[232] "CHR_HSCHR6_1_CTG9" "CHR_HSCHR6_8_CTG1" "CHR_HSCHR6_MHC_APD_CTG1"
	[235] "CHR_HSCHR6_MHC_COX_CTG1" "CHR_HSCHR6_MHC_DBB_CTG1" "CHR_HSCHR6_MHC_MANN_CTG1"
	[238] "CHR_HSCHR6_MHC_MCF_CTG1" "CHR_HSCHR6_MHC_QBL_CTG1" "CHR_HSCHR6_MHC_SSTO_CTG1"
	[241] "CHR_HSCHR7_1_CTG1" "CHR_HSCHR7_1_CTG4_4" "CHR_HSCHR7_1_CTG6"
	[244] "CHR_HSCHR7_1_CTG7" "CHR_HSCHR7_2_CTG1" "CHR_HSCHR7_2_CTG4_4"
	[247] "CHR_HSCHR7_2_CTG6" "CHR_HSCHR7_2_CTG7" "CHR_HSCHR7_3_CTG6"
	[250] "CHR_HSCHR8_1_CTG1" "CHR_HSCHR8_1_CTG6" "CHR_HSCHR8_1_CTG7"
	[253] "CHR_HSCHR8_2_CTG1" "CHR_HSCHR8_2_CTG7" "CHR_HSCHR8_3_CTG1"
	[256] "CHR_HSCHR8_3_CTG7" "CHR_HSCHR8_4_CTG1" "CHR_HSCHR8_4_CTG7"
	[259] "CHR_HSCHR8_5_CTG1" "CHR_HSCHR8_5_CTG7" "CHR_HSCHR8_6_CTG1"
	[262] "CHR_HSCHR8_7_CTG1" "CHR_HSCHR8_8_CTG1" "CHR_HSCHR8_9_CTG1"
	[265] "CHR_HSCHR9_1_CTG1" "CHR_HSCHR9_1_CTG2" "CHR_HSCHR9_1_CTG3"
	[268] "CHR_HSCHR9_1_CTG4" "CHR_HSCHR9_1_CTG5" "CHR_HSCHRX_1_CTG3"
	[271] "CHR_HSCHRX_2_CTG12" "CHR_HSCHRX_2_CTG3" "1"
	[274] "2" "3" "4"
	[277] "5" "6" "7"
	[280] "8" "9" "10"
	[283] "11" "12" "13"
	[286] "14" "15" "16"
	[289] "17" "18" "19"
	[292] "20" "21" "22"
	[295] "X" "Y" "MT"
	[298] "GL000008.2" "GL000009.2" "GL000194.1"
	[301] "GL000195.1" "GL000205.2" "GL000213.1"
	[304] "GL000216.2" "GL000218.1" "GL000219.1"
	[307] "GL000220.1" "GL000224.1" "GL000225.1"
	[310] "KI270442.1" "KI270706.1" "KI270707.1"
	[313] "KI270708.1" "KI270711.1" "KI270713.1"
	[316] "KI270714.1" "KI270721.1" "KI270722.1"
	[319] "KI270723.1" "KI270724.1" "KI270726.1"
	[322] "KI270727.1" "KI270728.1" "KI270731.1"
	[325] "KI270733.1" "KI270734.1" "KI270741.1"
	[328] "KI270743.1" "KI270744.1" "KI270750.1"
	[331] "KI270752.1"
	>
	> ## Keep just chrs 1 to 22, X, Y, and MT
	> txdb <- keepSeqlevels(txdb, c(1:22, 'X', 'Y', 'MT'))
	> seqlevels(txdb)
	[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "X" "Y" "MT"
	>
	> ## For speed testing: just use a portion of the txdb data
	> test_txdb <- keepSeqlevels(txdb, c('Y', 'MT'))
	>
	> ## Attempt to create genomic state object
	> gs <- makeGenomicState(test_txdb, style = 'Ensembl', currentStyle = 'Ensembl')
	'select()' returned 1:1 mapping between keys and columns
	Error: logical subscript contains NAs
	> traceback()
	10: stop(wmsg(...), call. = FALSE)
	9: .subscript_error("logical subscript contains NAs")
	8: NSBS(i, x, exact = exact, strict.upper.bound = !allow.append,
	allow.NAs = allow.NAs)
	7: NSBS(i, x, exact = exact, strict.upper.bound = !allow.append,
	allow.NAs = allow.NAs)
	6: normalizeSingleBracketSubscript(i, x)
	5: extractROWS(x, i)
	4: extractROWS(x, i)
	3: x[width(x) != seqlengths(fullInterGR)[i]]
	2: x[width(x) != seqlengths(fullInterGR)[i]]
	1: makeGenomicState(test_txdb, style = "Ensembl", currentStyle = "Ensembl")
	>
	> ## Issue is with the seqlengths being missing
	> seqlengths(test_txdb)
	Y MT
	NA NA
	>
	> ## Adding the sequence lengths doesn't work
	> hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
	> new_info <- hg38_chrominfo$UCSC_seqlength[match(seqlevels(test_txdb),
	+ mapSeqlevels(hg38_chrominfo$UCSC_seqlevel, 'Ensembl'))]
	> names(new_info) <- names(seqlengths(test_txdb))
	> seqlengths(test_txdb) <- new_info
	Error in (function (classes, fdef, mtable) :
	unable to find an inherited method for function ‘seqinfo<-’ for signature ‘"TxDb"’
	>
	> ## Using the approach described in https://support.bioconductor.org/p/67680/#67766
	> ## That is, loading the GFF as a GRanges object and setting the lengths there
	> ## before using makeTxDbFromGRanges()
	> library('rtracklayer')
	> gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
	trying URL 'ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz'
	ftp data connection made, file length 49757610 bytes
	==================================================
	downloaded 47.5 MB

	>
	> ## Make a small GRanges object for testing
	> gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')
	>
	> ## Set the chromosome lengths, genome
	> seqlengths(gr_small) <- new_info
	> genome(gr_small) <- 'hg38'
	> txdb2 <- makeTxDbFromGRanges(gr_small)
	Warning messages:
	1: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
	2: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
	3: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	4: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	5: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
	6: Named parameters not used in query: gene_id, internal_tx_id
	>
	> ## Now create the genomic state object
	> gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
	'select()' returned 1:1 mapping between keys and columns
	> gs
	GRangesList object of length 2:
	$fullGenome
	GRanges object with 4180 ranges and 4 metadata columns:
	seqnames ranges strand \| theRegion tx_id tx_name gene
	<Rle> <IRanges> <Rle> \| <character> <IntegerList> <CharacterList> <IntegerList>
	1 Y [2784749, 2784853] + \| exon 1 ENST00000516032 431
	2 Y [2789827, 2790328] + \| exon 2 ENST00000454281 377
	3 Y [2827982, 2828218] + \| exon 3 ENST00000430735 267
	4 Y [2841486, 2841627] + \| exon 4 ENST00000250784 10
	5 Y [2841628, 2841919] + \| intron 4,5 ENST00000250784,ENST00000430575 10
	... ... ... ... . ... ... ... ...
	4176 Y [26454093, 26508212] * \| intergenic
	4177 Y [26579691, 26586641] * \| intergenic
	4178 Y [26591602, 26594850] * \| intergenic
	4179 Y [26634653, 56855243] * \| intergenic
	4180 Y [56855489, 57227415] * \| intergenic

	...
	<1 more element>
	-------
	seqinfo: 1 sequence from hg38 genome
	>
	> ## Reproducibility information
	> print('Reproducibility information:')
	[1] "Reproducibility information:"
	> Sys.time()
	[1] "2017-03-01 12:12:14 EST"
	> options(width = 120)
	> session_info()
	Session info -----------------------------------------------------------------------------------------------------------
	setting value
	version R Under development (unstable) (2016-10-26 r71594)
	system x86_64, darwin13.4.0
	ui AQUA
	language (EN)
	collate en_US.UTF-8
	tz America/New_York
	date 2017-03-01

	Packages ---------------------------------------------------------------------------------------------------------------
	package * version date source
	acepack 1.4.1 2016-10-29 CRAN (R 3.4.0)
	AnnotationDbi * 1.37.3 2017-02-09 Bioconductor
	assertthat 0.1 2013-12-06 CRAN (R 3.4.0)
	backports 1.0.5 2017-01-18 CRAN (R 3.4.0)
	base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0)
	Biobase * 2.35.1 2017-02-23 Bioconductor
	BiocGenerics * 0.21.3 2017-01-12 Bioconductor
	BiocParallel 1.9.5 2017-01-24 Bioconductor
	biomaRt 2.31.4 2017-01-13 Bioconductor
	Biostrings 2.43.4 2017-02-02 Bioconductor
	bitops 1.0-6 2013-08-17 CRAN (R 3.4.0)
	BSgenome 1.43.5 2017-02-02 Bioconductor
	bumphunter 1.15.0 2016-10-23 Bioconductor
	checkmate 1.8.2 2016-11-02 CRAN (R 3.4.0)
	cluster 2.0.5 2016-10-08 CRAN (R 3.4.0)
	codetools 0.2-15 2016-10-05 CRAN (R 3.4.0)
	colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
	data.table 1.10.4 2017-02-01 CRAN (R 3.4.0)
	DBI 0.5-1 2016-09-10 CRAN (R 3.4.0)
	DelayedArray 0.1.7 2017-02-17 Bioconductor
	derfinder * 1.9.6 2017-01-13 Bioconductor
	derfinderHelper 1.9.3 2016-11-29 Bioconductor
	devtools * 1.12.0 2016-12-05 CRAN (R 3.4.0)
	digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
	doRNG 1.6 2014-03-07 CRAN (R 3.4.0)
	foreach 1.4.3 2015-10-13 CRAN (R 3.4.0)
	foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
	Formula 1.2-1 2015-04-07 CRAN (R 3.4.0)
	GenomeInfoDb * 1.11.9 2017-02-08 Bioconductor
	GenomeInfoDbData 0.99.0 2017-02-14 Bioconductor
	GenomicAlignments 1.11.9 2017-02-02 Bioconductor
	GenomicFeatures * 1.27.8 2017-02-11 Bioconductor
	GenomicFiles 1.11.3 2016-11-29 Bioconductor
	GenomicRanges * 1.27.22 2017-02-02 Bioconductor
	ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0)
	gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0)
	gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
	Hmisc 4.0-2 2016-12-31 CRAN (R 3.4.0)
	htmlTable 1.9 2017-01-26 CRAN (R 3.4.0)
	htmltools 0.3.5 2016-03-21 CRAN (R 3.4.0)
	htmlwidgets 0.8 2016-11-09 CRAN (R 3.4.0)
	IRanges * 2.9.18 2017-02-02 Bioconductor
	iterators 1.0.8 2015-10-13 CRAN (R 3.4.0)
	knitr 1.15.1 2016-11-22 CRAN (R 3.4.0)
	lattice 0.20-34 2016-09-06 CRAN (R 3.4.0)
	latticeExtra 0.6-28 2016-02-09 CRAN (R 3.4.0)
	lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
	locfit 1.5-9.1 2013-04-20 CRAN (R 3.4.0)
	magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
	Matrix 1.2-8 2017-01-20 CRAN (R 3.4.0)
	matrixStats 0.51.0 2016-10-09 CRAN (R 3.4.0)
	memoise 1.0.0 2016-01-29 CRAN (R 3.4.0)
	munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
	nnet 7.3-12 2016-02-02 CRAN (R 3.4.0)
	pkgmaker 0.22 2014-05-14 CRAN (R 3.4.0)
	plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
	qvalue 2.7.0 2016-10-23 Bioconductor
	RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
	Rcpp 0.12.9 2017-01-14 CRAN (R 3.4.0)
	RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.0)
	registry 0.3 2015-07-08 CRAN (R 3.4.0)
	reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
	rngtools 1.2.4 2014-03-06 CRAN (R 3.4.0)
	rpart 4.1-10 2015-06-29 CRAN (R 3.4.0)
	Rsamtools 1.27.12 2017-01-24 Bioconductor
	RSQLite 1.1-2 2017-01-08 CRAN (R 3.4.0)
	rtracklayer * 1.35.6 2017-02-19 cran (@1.35.6)
	S4Vectors * 0.13.15 2017-02-14 cran (@0.13.15)
	scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
	stringi 1.1.2 2016-10-01 CRAN (R 3.4.0)
	stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
	SummarizedExperiment 1.5.7 2017-02-23 Bioconductor
	survival 2.40-1 2016-10-30 CRAN (R 3.4.0)
	tibble 1.2 2016-08-26 CRAN (R 3.4.0)
	VariantAnnotation 1.21.17 2017-02-12 Bioconductor
	withr 1.0.2 2016-06-20 CRAN (R 3.4.0)
	XML 3.98-1.5 2016-11-10 CRAN (R 3.4.0)
	xtable 1.8-2 2016-02-05 CRAN (R 3.4.0)
	XVector 0.15.2 2017-02-02 Bioconductor
	zlibbioc 1.21.0 2016-10-23 Bioconductor

view raw make_genomicState_from_ensembl.R hosted with ❤ by GitHub

In derfinder 1.8.1 (release) and 1.9.7 (devel), the user will now get an error message if the chromosome lengths are missing. The help for the 'txdb' argument also points users to this thread. The two new versions will take 1-2 days to be available via biocLite(), but if you fix your txdb object you can continue right away since the new derfinder versions just show a more informative error and have more info on the docs.

Best,

Leonardo

ADD COMMENT • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Thanks for all your help Leo.

library(GenomicFeatures)
library(derfinder)
library(devtools)
library(rtracklayer)
gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
gr_small <- keepSeqlevels(gr, c('Y', 'MT'), pruning.mode = 'tidy')

Error in keepSeqlevels(gr, c("Y", "MT"), pruning.mode = "tidy") : 
  unused argument (pruning.mode = "tidy")

We don't have the devel version of bioC on our cluster(we're at 3.4), is the pruning mode available without using 3.5?

gr_small <- keepSeqlevels(gr, c('Y', 'MT'))
genome(gr_small) <- 'hg38'
txdb2 <- makeTxDbFromGRanges(gr_small)
gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')

> gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
'select()' returned 1:1 mapping between keys and columns
Error: logical subscript contains NAs

I'm going to guess that not using the pruning.mode is why I'm getting an error here?

ADD REPLY • link 8.1 years ago andrewelamb • 0

Entering edit mode

Hi Andrew,

"pruning.mode" is a recent argument and has to be specified in Bioc-devel, but I meant to say previously that everything would work in Bioc-release if you dropped that part. You were close in your last attempt, but you forgot the crucial step of adding the chromosome lengths. I assume that you were using derfinder 1.8.0 because 1.8.1 would have printed an error related to the missing chromosome lengths. Like I said before, derfinder 1.8.1 will be available via biocLite() in like a day or two. In any case, even with derfinder 1.8.0, you can get everything to run if you specify the chromosome lengths.

Here's the code and output using Bioc-release (3.4).

	## This gist is related to https://support.bioconductor.org/p/93235/ and
	## https://gist.github.com/lcolladotor/f6b5285a5cd196fb28835789d98dded8
	> library('GenomicFeatures')
	Loading required package: BiocGenerics
	Loading required package: parallel

	Attaching package: ‘BiocGenerics’

	The following objects are masked from ‘package:parallel’:

	clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
	clusterExport, clusterMap, parApply, parCapply, parLapply,
	parLapplyLB, parRapply, parSapply, parSapplyLB

	The following objects are masked from ‘package:stats’:

	IQR, mad, xtabs

	The following objects are masked from ‘package:base’:

	anyDuplicated, append, as.data.frame, cbind, colnames, do.call,
	duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
	is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
	paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
	Reduce, rownames, sapply, setdiff, sort, table, tapply, union,
	unique, unsplit, which, which.max, which.min

	Loading required package: S4Vectors
	Loading required package: stats4

	Attaching package: ‘S4Vectors’

	The following objects are masked from ‘package:base’:

	colMeans, colSums, expand.grid, rowMeans, rowSums

	Loading required package: IRanges
	Loading required package: GenomeInfoDb
	Loading required package: GenomicRanges
	Loading required package: AnnotationDbi
	Loading required package: Biobase
	Welcome to Bioconductor

	Vignettes contain introductory material; view with
	'browseVignettes()'. To cite Bioconductor, see
	'citation("Biobase")', and for packages 'citation("pkgname")'.

	> library('derfinder')
	> library('devtools')
	> library('rtracklayer')
	> gr <- import('ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz')
	trying URL 'ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz'
	ftp data connection made, file length 49757610 bytes
	==================================================
	downloaded 47.5 MB

	> gr_small <- keepSeqlevels(gr, c('Y', 'MT'))
	> hg38_chrominfo <- fetchExtendedChromInfoFromUCSC("hg38")
	> new_info <- hg38_chrominfo$UCSC_seqlength[match(seqlevels(gr_small),
	+ mapSeqlevels(hg38_chrominfo$UCSC_seqlevel, 'Ensembl'))]
	>
	> ## Set the chromosome lengths, genome
	> seqlengths(gr_small) <- new_info
	> genome(gr_small) <- 'hg38'
	> txdb2 <- makeTxDbFromGRanges(gr_small)
	Warning messages:
	1: RSQLite::dbGetPreparedQuery() is deprecated, please switch to DBI::dbGetQuery(params = bind.data).
	2: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
	3: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
	4: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
	6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
	7: Named parameters not used in query: gene_id, internal_tx_id
	> ## Now create the genomic state object
	> gs <- makeGenomicState(txdb2, style = 'Ensembl', currentStyle = 'Ensembl')
	'select()' returned 1:1 mapping between keys and columns
	> gs
	GRangesList object of length 2:
	$fullGenome
	GRanges object with 4180 ranges and 4 metadata columns:
	seqnames ranges strand \| theRegion tx_id
	<Rle> <IRanges> <Rle> \| <character> <IntegerList>
	1 Y [2784749, 2784853] + \| exon 1
	2 Y [2789827, 2790328] + \| exon 2
	3 Y [2827982, 2828218] + \| exon 3
	4 Y [2841486, 2841627] + \| exon 4
	5 Y [2841628, 2841919] + \| intron 4,5
	... ... ... ... . ... ...
	4176 Y [26454093, 26508212] * \| intergenic
	4177 Y [26579691, 26586641] * \| intergenic
	4178 Y [26591602, 26594850] * \| intergenic
	4179 Y [26634653, 56855243] * \| intergenic
	4180 Y [56855489, 57227415] * \| intergenic
	tx_name gene
	<CharacterList> <IntegerList>
	1 ENST00000516032 431
	2 ENST00000454281 377
	3 ENST00000430735 267
	4 ENST00000250784 10
	5 ENST00000250784,ENST00000430575 10
	... ... ...
	4176
	4177
	4178
	4179
	4180

	...
	<1 more element>
	-------
	seqinfo: 1 sequence from hg38 genome
	>
	> ## Reproducibility information
	> Sys.time()
	[1] "2017-03-02 10:46:42 EST"
	> options(width = 120)
	> session_info()
	Session info -----------------------------------------------------------------------------------------------------------
	setting value
	version R version 3.3.2 RC (2016-10-26 r71594)
	system x86_64, darwin13.4.0
	ui X11
	language (EN)
	collate en_US.UTF-8
	tz America/New_York
	date 2017-03-02

	Packages ---------------------------------------------------------------------------------------------------------------
	package * version date source
	acepack 1.4.1 2016-10-29 CRAN (R 3.3.0)
	AnnotationDbi * 1.36.2 2017-01-30 Bioconductor
	assertthat 0.1 2013-12-06 cran (@0.1)
	backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
	base64enc 0.1-3 2015-07-28 cran (@0.1-3)
	Biobase * 2.34.0 2016-10-18 Bioconductor
	BiocGenerics * 0.20.0 2016-10-18 Bioconductor
	BiocParallel 1.8.1 2016-10-30 Bioconductor
	biomaRt 2.30.0 2016-10-18 Bioconductor
	Biostrings 2.42.1 2016-12-01 Bioconductor
	bitops 1.0-6 2013-08-17 cran (@1.0-6)
	BSgenome 1.42.0 2016-10-18 Bioconductor
	bumphunter 1.14.0 2016-10-18 Bioconductor
	checkmate 1.8.2 2016-11-02 CRAN (R 3.3.1)
	cluster 2.0.5 2016-10-08 CRAN (R 3.3.2)
	codetools 0.2-15 2016-10-05 CRAN (R 3.3.2)
	colorout * 1.1-2 2016-10-19 Github (jalvesaq/colorout@6d84420)
	colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
	data.table 1.10.4 2017-02-01 CRAN (R 3.3.2)
	DBI 0.5-1 2016-09-10 cran (@0.5-1)
	derfinder * 1.8.1 2017-03-02 Github (Bioconductor-mirror/derfinder@2df57fd)
	derfinderHelper 1.8.0 2016-10-18 Bioconductor
	devtools * 1.12.0 2016-06-24 CRAN (R 3.3.0)
	digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
	doRNG 1.6 2014-03-07 CRAN (R 3.3.0)
	foreach 1.4.3 2015-10-13 CRAN (R 3.3.0)
	foreign 0.8-67 2016-09-13 CRAN (R 3.3.2)
	Formula 1.2-1 2015-04-07 CRAN (R 3.3.0)
	GenomeInfoDb * 1.10.3 2017-02-07 Bioconductor
	GenomicAlignments 1.10.0 2016-10-18 Bioconductor
	GenomicFeatures * 1.26.3 2017-02-22 Bioconductor
	GenomicFiles 1.10.3 2016-10-21 Bioconductor
	GenomicRanges * 1.26.3 2017-02-25 Bioconductor
	ggplot2 2.2.1 2016-12-30 CRAN (R 3.3.2)
	gridExtra 2.2.1 2016-02-29 CRAN (R 3.3.0)
	gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
	Hmisc 4.0-2 2016-12-31 CRAN (R 3.3.2)
	htmlTable 1.9 2017-01-26 CRAN (R 3.3.2)
	htmltools 0.3.5 2016-03-21 cran (@0.3.5)
	htmlwidgets 0.8 2016-11-09 CRAN (R 3.3.2)
	IRanges * 2.8.1 2016-11-08 Bioconductor
	iterators 1.0.8 2015-10-13 CRAN (R 3.3.0)
	knitr 1.15.1 2016-11-22 CRAN (R 3.3.2)
	lattice 0.20-34 2016-09-06 CRAN (R 3.3.2)
	latticeExtra 0.6-28 2016-02-09 CRAN (R 3.3.0)
	lazyeval 0.2.0 2016-06-12 cran (@0.2.0)
	locfit 1.5-9.1 2013-04-20 CRAN (R 3.3.0)
	magrittr 1.5 2014-11-22 cran (@1.5)
	Matrix 1.2-8 2017-01-20 CRAN (R 3.3.2)
	matrixStats 0.51.0 2016-10-09 CRAN (R 3.3.0)
	memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
	munsell 0.4.3 2016-02-13 cran (@0.4.3)
	nnet 7.3-12 2016-02-02 CRAN (R 3.3.2)
	pkgmaker 0.22 2014-05-14 CRAN (R 3.3.0)
	plyr 1.8.4 2016-06-08 cran (@1.8.4)
	qvalue 2.6.0 2016-10-18 Bioconductor
	RColorBrewer 1.1-2 2014-12-07 cran (@1.1-2)
	Rcpp 0.12.9 2017-01-14 CRAN (R 3.3.2)
	RCurl 1.95-4.8 2016-03-01 cran (@1.95-4.)
	registry 0.3 2015-07-08 CRAN (R 3.3.0)
	reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0)
	rngtools 1.2.4 2014-03-06 CRAN (R 3.3.0)
	rpart 4.1-10 2015-06-29 CRAN (R 3.3.2)
	Rsamtools 1.26.1 2016-10-22 Bioconductor
	RSQLite 1.1-2 2017-01-08 CRAN (R 3.3.2)
	rtracklayer * 1.34.2 2017-03-01 Bioconductor
	S4Vectors * 0.12.1 2016-12-01 Bioconductor
	scales 0.4.1 2016-11-09 CRAN (R 3.3.2)
	stringi 1.1.2 2016-10-01 cran (@1.1.2)
	stringr 1.2.0 2017-02-18 CRAN (R 3.3.2)
	SummarizedExperiment 1.4.0 2016-10-18 Bioconductor
	survival 2.40-1 2016-10-30 CRAN (R 3.3.0)
	tibble 1.2 2016-08-26 cran (@1.2)
	VariantAnnotation 1.20.2 2016-12-01 Bioconductor
	withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
	XML 3.98-1.5 2016-11-10 CRAN (R 3.3.2)
	xtable 1.8-2 2016-02-05 CRAN (R 3.3.0)
	XVector 0.14.0 2016-10-18 Bioconductor
	zlibbioc 1.20.0 2016-10-18 Bioconductor
	>

view raw make_genomicState_from_ensembl_bioc3.4.R hosted with ❤ by GitHub

Best,

Leo

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Oh I see, somehow I missed that line.

Thanks!

ADD REPLY • link 8.1 years ago andrewelamb • 0

Entering edit mode

No problem ^^

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Hi Leo, everything above worked, but I'm having another issue now. Using the script that wa sworking before but with the new GRCH38 txdb, the out put of matchGenes isn't quite right. The, columns, name, annotation and geneL, are all NA's. Any Idea what this might be?

annotation_obj <- annotateTranscripts(txdb)
nearest_annotation_df <- matchGenes(der_grange_obj, subject = annotation_obj)

write_tsv(nearest_annotation_df, str_c(output_dir, '/nearest_annotation_df.tsv'))

head -n 2 nearest_annotation_df.tsv
name annotation description region distance subregion insideDistance exonnumber nexons UTR strand geneL codingL Entrez subjectHits
NA NA inside intron inside 14741.0 inside intron 175.0 10.0 11 inside transcription region - 15166.0 NA ENSG00000227232 13715

ADD REPLY • link 8.1 years ago andrewelamb • 0

Entering edit mode

Hi Andrew,

You should post this as a new question instead of a comment.

I'm not sure, but this would be a bumphunter issue, not a derfinder one. Most likely those functions assume that the TxDb object has a specific structure. For example, that the gene names are Entrez IDs instead of Gencode ids.

Best,

Leo

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Thanks! I'll see if the Bumphunter documentation has anything to say.

ADD REPLY • link 8.1 years ago andrewelamb • 0

Entering edit mode

Yeah, it's definitely a bumphunter issue as you can see at https://github.com/Bioconductor-mirror/bumphunter/blob/515c9c45a6a1b1fa03d1936fb946d8fbea184710/R/matchGenes.R#L182-L202. The current code assumes that you are working with Entrez gene ids. It should be possible to make the code more flexible to work with other types of gene ids.

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k

Entering edit mode

Check https://github.com/rafalab/bumphunter/issues/15

ADD REPLY • link 8.1 years ago Leonardo Collado Torres ★ 1.1k