Search
Question: possible conflict between renameSeqLevels (TxDb) and isActiveSeq (TxDb)
0
gravatar for Robert K. Bradley
2.2 years ago by
United States
Robert K. Bradley40 wrote:

I have encountered an unexpected interaction between renameSeqLevels and isActiveSeq when called on a TxDb. Specifically, when I use both renameSeqLevels (or seqlevels directly as a setter function) and isActiveSeq, I am then unable to access transcript information via select().

Here is code to reproduce the behavior:

  library (GenomicFeatures)

  refSeqDb = suppressWarnings (makeTranscriptDbFromUCSC ("hg19", tablename = "refGene"))
  isActiveSeq (refSeqDb)[seqlevels (refSeqDb)] = FALSE
  assembledchromosomes = grep ("^chr([0-9]+|[MXY])$", seqlevels (refSeqDb), value = TRUE)
  isActiveSeq (refSeqDb)[intersect (seqlevels (refSeqDb), assembledchromosomes)] = TRUE
  head (select (refSeqDb, keys = names (exonsBy (refSeqDb, by = "tx")), columns = "TXNAME", keytype = "TXID"))

This code block returns:

  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

If I instead invoke renameSeqLevels after creating refSeqDb as follows:

  refSeqDb = suppressWarnings (makeTranscriptDbFromUCSC ("hg19", tablename = "refGene"))
  refSeqDb = renameSeqlevels (refSeqDb, gsub ("^chr", "", seqlevels (refSeqDb)))
  isActiveSeq (refSeqDb)[seqlevels (refSeqDb)] = FALSE
  assembledchromosomes = grep ("^([0-9]+|[MXY])$", seqlevels (refSeqDb), value = TRUE)
  isActiveSeq (refSeqDb)[intersect (seqlevels (refSeqDb), assembledchromosomes)] = TRUE
  head (select (refSeqDb, keys = names (exonsBy (refSeqDb, by = "tx")), columns = "TXNAME", keytype = "TXID"))

then I get:

[1] TXID   TXNAME
<0 rows> (or 0-length row.names)

I obtain the same result if I replace the call to renameSeqLevels with:

seqlevels (refSeqDb) = gsub ("^chr", "", seqlevels (refSeqDb))

My version information is:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] C

attached base packages:
[1] stats4    parallel  grDevices utils     datasets  stats     graphics  methods   base     

other attached packages:
[1] GenomicFeatures_1.20.3 AnnotationDbi_1.30.1   Biobase_2.28.0         GenomicRanges_1.20.5   GenomeInfoDb_1.4.2     IRanges_2.2.7          S4Vectors_0.6.3       
[8] BiocGenerics_0.14.0   

loaded via a namespace (and not attached):
 [1] XVector_0.8.0           zlibbioc_1.14.0         GenomicAlignments_1.4.1 BiocParallel_1.2.20     tools_3.2.2             DBI_0.3.1               lambda.r_1.1.7         
 [8] futile.logger_1.4.1     rtracklayer_1.28.9      futile.options_1.0.0    bitops_1.0-6            RCurl_1.95-4.7          biomaRt_2.24.0          RSQLite_1.0.0          
[15] compiler_3.2.2          Biostrings_2.36.3       Rsamtools_1.20.4        XML_3.98-1.3           

 

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Robert K. Bradley40
0
gravatar for Sonali Arora
2.2 years ago by
Sonali Arora360
United States
Sonali Arora360 wrote:

Hi Robert, 

My Understanding is that you're trying to do the following

1) make a TranscriptDb from UCSC for the ref gene table
2) obtain exons for chr 1-25,X,Y,M 

Firstly, makeTranscriptDbFromUCSC is defunct - Please use makeTxDbFromUCSC instead.

> refSeqDb3 = suppressWarnings (makeTxDbFromUCSC ("hg19", tablename = "refGene"))
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK 

Secondly, please use 'GenomeInfoDb::keepStandardChromosomes()' to get only chr1-25,X,Y,M
like this -

> refSeqDb3 <- keepStandardChromosomes(refSeqDb3)
> seqlevels(refSeqDb3)
[1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9"  "chr10"
[11] "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" "chr19" "chr20"
[21] "chr21" "chr22" "chrX"  "chrY"  "chrM" 

Thirdly, to rename seqlevels from UCSC style to NCBI style - Please use 'seqlevelsStyle'

> seqlevelsStyle(refSeqDb3)   #check current style
[1] "UCSC"
> seqlevelsStyle(refSeqDb3) <- "NCBI"   #change style to NCBI
> seqlevelsStyle(refSeqDb3)   #check current style 
[1] "NCBI"


renameSeqlevels takes in 2 arguments where the second argument should be a named character vector.   From your code above, 

refSeqDb = renameSeqlevels (refSeqDb, gsub ("^chr", "", seqlevels (refSeqDb)))

the gsub does something unexpected. As you can see below 'chrUn_gl000211' becomes 'Un_gl000211'.

gsub ("^chr", "", seqlevels (refSeqDb2))
 [1] "1"                  "2"                  "3"                 
 [4] "4"                  "5"                  "6"                 
 [7] "7"                  "8"                  "9"                 
[10] "10"                 "11"                 "12"                
[13] "13"                 "14"                 "15"                
[16] "16"                 "17"                 "18"                
[19] "19"                 "20"                 "21"                
[22] "22"                 "X"                  "Y"                 
[25] "M"                  "1_gl000191_random"  "1_gl000192_random" 
[28] "4_ctg9_hap1"        "4_gl000193_random"  "4_gl000194_random" 
[31] "6_apd_hap1"         "6_cox_hap2"         "6_dbb_hap3"        
[34] "6_mann_hap4"        "6_mcf_hap5"         "6_qbl_hap6"        
[37] "6_ssto_hap7"        "7_gl000195_random"  "8_gl000196_random" 
[40] "8_gl000197_random"  "9_gl000198_random"  "9_gl000199_random" 
[43] "9_gl000200_random"  "9_gl000201_random"  "11_gl000202_random"
[46] "17_ctg5_hap1"       "17_gl000203_random" "17_gl000204_random"
[49] "17_gl000205_random" "17_gl000206_random" "18_gl000207_random"
[52] "19_gl000208_random" "19_gl000209_random" "21_gl000210_random"
[55] "Un_gl000211"        "Un_gl000212"        "Un_gl000213"       
[58] "Un_gl000214"        "Un_gl000215"        "Un_gl000216"       
[61] "Un_gl000217"        "Un_gl000218"        "Un_gl000219"       
[64] "Un_gl000220"        "Un_gl000221"        "Un_gl000222"       
[67] "Un_gl000223"        "Un_gl000224"        "Un_gl000225"       
[70] "Un_gl000226"        "Un_gl000227"        "Un_gl000228"       
[73] "Un_gl000229"        "Un_gl000230"        "Un_gl000231"       
[76] "Un_gl000232"        "Un_gl000233"        "Un_gl000234"       
[79] "Un_gl000235"        "Un_gl000236"        "Un_gl000237"       
[82] "Un_gl000238"        "Un_gl000239"        "Un_gl000240"       
[85] "Un_gl000241"        "Un_gl000242"        "Un_gl000243"       
[88] "Un_gl000244"        "Un_gl000245"        "Un_gl000246"       
[91] "Un_gl000247"        "Un_gl000248"        "Un_gl000249"

Thus, it is better to use keepStandardChromosomes and SeqlevelsStyle which do the right thing.


Lastly, with NCBI style, I get the correct result - 

> head (select (refSeqDb3, keys = names (exonsBy (refSeqDb3, by = "tx")), columns = "TXNAME", keytype = "TXID"))
'select()' returned 1:1 mapping between keys and columns
  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

Even if I change style back to UCSC, I still get consistent results. 

> seqlevelsStyle(refSeqDb3) <- "UCSC"
> seqlevels(refSeqDb3)
 [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
[10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
[19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM" 
> head (select (refSeqDb3, keys = names (exonsBy (refSeqDb3, by = "tx")), columns = "TXNAME", keytype = "TXID"))
'select()' returned 1:1 mapping between keys and columns
  TXID       TXNAME
1    1    NR_046018
2    2    NR_036051
3    3    NR_036266
4    4    NR_036267
5    5    NR_036268
6    6 NM_001005484

Hope that helps!

Sonali. 

 

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Sonali Arora360
0
gravatar for Robert K. Bradley
2.2 years ago by
United States
Robert K. Bradley40 wrote:

Hi Sonali,

Thank you for the great explanation. I wasn't aware of some of the functions that you mentioned and I've integrated them into my workflow. They work well.

I do have one follow-up question. Invoking

> seqlevelsStyle(refSeqDb3) <- "NCBI"   #change style to NCBI

removes the prefix "chr" and renames the mitochondrial chromosome. However, it does not appear to rename the unassembled contigs, e.g., Un_gl000247 -> GL000247.1 or the like. Do you have plans to integrate such functionality? (I generally don't use the unassembled contigs for completed assemblies, but need to for work with, e.g., genomes like zebrafish.)

Best,

Rob

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Robert K. Bradley40

Hi Robert,

One limitation of seqlevelsStyle() at the moment is that it knows how to rename the chromosomes only, but not the scaffolds. This is something that we plan to work on in the future but that won't happen before the next release (scheduled in 1 month).

BTW, what you originally reported in this thread is buggy behavior of select() after renaming of the chromosomes. There are at least 2 problems, which can simply be reproduced with:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
#       TXNAME TXCHROM
# 1 uc001aaa.3    chr1

First problem. After renaming of the chromosomes, select() still returns the original names.

seqlevelsStyle(txdb) <- "NCBI"
head(seqlevels(txdb))
# [1] "1" "2" "3" "4" "5" "6"
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
#       TXNAME TXCHROM
# 1 uc001aaa.3    chr1

Second problem. After renaming of the chromosomes, select() breaks if we specify a set of active sequences.

isActiveSeq(txdb) <- seqlevels(txdb) == "1"
select(txdb, keys="uc001aaa.3", columns=c("TXNAME", "TXCHROM"), keytype="TXNAME")
# [1] TXNAME  TXCHROM
# <0 rows> (or 0-length row.names)

So we need to fix that.

Cheers,

H.

ADD REPLYlink written 2.2 years ago by Hervé Pagès ♦♦ 13k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 213 users visited in the last hour