possible conflict between renameSeqLevels (TxDb) and isActiveSeq (TxDb)
2
0
Entering edit mode
@robert-k-bradley-5997
Last seen 19 months ago
United States

I have encountered an unexpected interaction between renameSeqLevels and isActiveSeq when called on a TxDb. Specifically, when I use both renameSeqLevels (or seqlevels directly as a setter function) and isActiveSeq, I am then unable to access transcript information via select().

Here is code to reproduce the behavior:

  library (GenomicFeatures)

  refSeqDb = suppressWarnings (makeTranscriptDbFromUCSC ("hg19", tablename = "refGene"))
  isActiveSeq (refSeqDb)[seqlevels (refSeqDb)] = FALSE
  assembledchromosomes = grep ("^chr([0-9]+|[MXY])$", seqlevels (refSeqDb), value = TRUE)
  isActiveSeq (refSeqDb)[intersect (seqlevels (refSeqDb), assembledchromosomes)] = TRUE
  head (select (refSeqDb, keys = names (exonsBy (refSeqDb, by = "tx")), columns = "TXNAME", keytype = "TXID"))

This code block returns:

  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

If I instead invoke renameSeqLevels after creating refSeqDb as follows:

  refSeqDb = suppressWarnings (makeTranscriptDbFromUCSC ("hg19", tablename = "refGene"))
  refSeqDb = renameSeqlevels (refSeqDb, gsub ("^chr", "", seqlevels (refSeqDb)))
  isActiveSeq (refSeqDb)[seqlevels (refSeqDb)] = FALSE
  assembledchromosomes = grep ("^([0-9]+|[MXY])$", seqlevels (refSeqDb), value = TRUE)
  isActiveSeq (refSeqDb)[intersect (seqlevels (refSeqDb), assembledchromosomes)] = TRUE
  head (select (refSeqDb, keys = names (exonsBy (refSeqDb, by = "tx")), columns = "TXNAME", keytype = "TXID"))

then I get:

[1] TXID   TXNAME
<0 rows> (or 0-length row.names)

I obtain the same result if I replace the call to renameSeqLevels with:

seqlevels (refSeqDb) = gsub ("^chr", "", seqlevels (refSeqDb))

My version information is:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] C

attached base packages:
[1] stats4    parallel  grDevices utils     datasets  stats     graphics  methods   base     

other attached packages:
[1] GenomicFeatures_1.20.3 AnnotationDbi_1.30.1   Biobase_2.28.0         GenomicRanges_1.20.5   GenomeInfoDb_1.4.2     IRanges_2.2.7          S4Vectors_0.6.3       
[8] BiocGenerics_0.14.0   

loaded via a namespace (and not attached):
 [1] XVector_0.8.0           zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.20     tools_3.2.2             DBI_0.3.1               lambda.r_1.1.7         
 [8] futile.logger_1.4.1     rtracklayer_1.28.9      futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0          RSQLite_1.0.0          
[15] compiler_3.2.2          Biostrings_2.36.3       Rsamtools_1.20.4        XML_3.98-1.3           

 

bug txdb genomicfeatures • 1.8k views
ADD COMMENT
0
Entering edit mode
Sonali Arora ▴ 390
@sonali-arora-6563
Last seen 9.6 years ago
United States

Hi Robert, 

My Understanding is that you're trying to do the following

1) make a TranscriptDb from UCSC for the ref gene table
2) obtain exons for chr 1-25,X,Y,M 

Firstly, makeTranscriptDbFromUCSC is defunct - Please use makeTxDbFromUCSC instead.

> refSeqDb3 = suppressWarnings (makeTxDbFromUCSC ("hg19", tablename = "refGene"))
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK 

Secondly, please use 'GenomeInfoDb::keepStandardChromosomes()' to get only chr1-25,X,Y,M
like this -

> refSeqDb3 <- keepStandardChromosomes(refSeqDb3)
> seqlevels(refSeqDb3)
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10"
[11] "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20"
[21] "chr21" "chr22" "chrX"  "chrY"  "chrM" 

Thirdly, to rename seqlevels from UCSC style to NCBI style - Please use 'seqlevelsStyle'

> seqlevelsStyle(refSeqDb3)   #check current style
[1] "UCSC"
> seqlevelsStyle(refSeqDb3) <- "NCBI"   #change style to NCBI
> seqlevelsStyle(refSeqDb3)   #check current style 
[1] "NCBI"


renameSeqlevels takes in 2 arguments where the second argument should be a named character vector.   From your code above, 

refSeqDb = renameSeqlevels (refSeqDb, gsub ("^chr", "", seqlevels (refSeqDb)))

the gsub does something unexpected. As you can see below 'chrUn_gl000211' becomes 'Un_gl000211'.

gsub ("^chr", "", seqlevels (refSeqDb2))
 [1] "1"                  "2"                  "3"                 
 [4] "4"                  "5"                  "6"                 
 [7] "7"                  "8"                  "9"                 
[10] "10"                 "11"                 "12"                
[13] "13"                 "14"                 "15"                
[16] "16"                 "17"                 "18"                
[19] "19"                 "20"                 "21"                
[22] "22"                 "X"                  "Y"                 
[25] "M"                  "1_gl000191_random"  "1_gl000192_random" 
[28] "4_ctg9_hap1"        "4_gl000193_random"  "4_gl000194_random" 
[31] "6_apd_hap1"         "6_cox_hap2"         "6_dbb_hap3"        
[34] "6_mann_hap4"        "6_mcf_hap5"         "6_qbl_hap6"        
[37] "6_ssto_hap7"        "7_gl000195_random"  "8_gl000196_random" 
[40] "8_gl000197_random"  "9_gl000198_random"  "9_gl000199_random" 
[43] "9_gl000200_random"  "9_gl000201_random"  "11_gl000202_random"
[46] "17_ctg5_hap1"       "17_gl000203_random" "17_gl000204_random"
[49] "17_gl000205_random" "17_gl000206_random" "18_gl000207_random"
[52] "19_gl000208_random" "19_gl000209_random" "21_gl000210_random"
[55] "Un_gl000211"        "Un_gl000212"        "Un_gl000213"       
[58] "Un_gl000214"        "Un_gl000215"        "Un_gl000216"       
[61] "Un_gl000217"        "Un_gl000218"        "Un_gl000219"       
[64] "Un_gl000220"        "Un_gl000221"        "Un_gl000222"       
[67] "Un_gl000223"        "Un_gl000224"        "Un_gl000225"       
[70] "Un_gl000226"        "Un_gl000227"        "Un_gl000228"       
[73] "Un_gl000229"        "Un_gl000230"        "Un_gl000231"       
[76] "Un_gl000232"        "Un_gl000233"        "Un_gl000234"       
[79] "Un_gl000235"        "Un_gl000236"        "Un_gl000237"       
[82] "Un_gl000238"        "Un_gl000239"        "Un_gl000240"       
[85] "Un_gl000241"        "Un_gl000242"        "Un_gl000243"       
[88] "Un_gl000244"        "Un_gl000245"        "Un_gl000246"       
[91] "Un_gl000247"        "Un_gl000248"        "Un_gl000249"

Thus, it is better to use keepStandardChromosomes and SeqlevelsStyle which do the right thing.


Lastly, with NCBI style, I get the correct result - 

> head (select (refSeqDb3, keys = names (exonsBy (refSeqDb3, by = "tx")), columns = "TXNAME", keytype = "TXID"))
'select()' returned 1:1 mapping between keys and columns
  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

Even if I change style back to UCSC, I still get consistent results. 

> seqlevelsStyle(refSeqDb3) <- "UCSC"
> seqlevels(refSeqDb3)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
> head (select (refSeqDb3, keys = names (exonsBy (refSeqDb3, by = "tx")), columns = "TXNAME", keytype = "TXID"))
'select()' returned 1:1 mapping between keys and columns
  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

Hope that helps!

Sonali. 

 

ADD COMMENT
0
Entering edit mode
@robert-k-bradley-5997
Last seen 19 months ago
United States

Hi Sonali,

Thank you for the great explanation. I wasn't aware of some of the functions that you mentioned and I've integrated them into my workflow. They work well.

I do have one follow-up question. Invoking

> seqlevelsStyle(refSeqDb3) <- "NCBI"   #change style to NCBI

removes the prefix "chr" and renames the mitochondrial chromosome. However, it does not appear to rename the unassembled contigs, e.g., Un_gl000247 -> GL000247.1 or the like. Do you have plans to integrate such functionality? (I generally don't use the unassembled contigs for completed assemblies, but need to for work with, e.g., genomes like zebrafish.)

Best,

Rob

ADD COMMENT
0
Entering edit mode

Hi Robert,

One limitation of seqlevelsStyle() at the moment is that it knows how to rename the chromosomes only, but not the scaffolds. This is something that we plan to work on in the future but that won't happen before the next release (scheduled in 1 month).

BTW, what you originally reported in this thread is buggy behavior of select() after renaming of the chromosomes. There are at least 2 problems, which can simply be reproduced with:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
#       TXNAME TXCHROM
# 1 uc001aaa.3    chr1

First problem. After renaming of the chromosomes, select() still returns the original names.

seqlevelsStyle(txdb) <- "NCBI"
head(seqlevels(txdb))
# [1] "1" "2" "3" "4" "5" "6"
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
#       TXNAME TXCHROM
# 1 uc001aaa.3    chr1

Second problem. After renaming of the chromosomes, select() breaks if we specify a set of active sequences.

isActiveSeq(txdb) <- seqlevels(txdb) == "1"
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
# [1] TXNAME  TXCHROM
# <0 rows> (or 0-length row.names)

So we need to fix that.

Cheers,

H.

ADD REPLY

Login before adding your answer.

Traffic: 1122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6