problem with characters in chromosome names in DiffBind dba
1
0
Entering edit mode
Georg Otto ▴ 120
@georg-otto-6333
Last seen 5.4 years ago
United Kingdom

Dear all,

I have bed files with ChIP-seq data, with sequence (chromosome) names like this:

chr1, chr10, chr11, chr12, chr13, chr13_random, chr14, . . ., chrX

I  import the bed file data into a DiffBind DBA object using the command

> dbaObj <-  dba(sampleSheet = "samples.csv")

The problem is that the chromosome names in the DBA object apparently get stripped of their characters, so "chr1" becomes "1" and the chromosomes "X" and "Y" are missing altogether. More specifically, in

> dbaObj$chrmap
 [1] "chr1"         "chr10"        "chr11"        "chr12"        "chr13"       
 [6] "chr14"        "chr15"        "chr16"        "chr17"        "chr18"       
[11] "chr19"        "chr2"         "chr3"         "chr4"         "chr5"        
[16] "chr6"         "chr7"         "chr8"         "chr9"         "chrUn_random"
[21] "chrX"         "chrY"

the chromosome names are intact, while in

> head(dbaObj$allvectors)
  CHR   START     END     sample1     sample2     sample3     sample4
1   1 3435940 3436366 -1.00000000  0.01958029  0.04800027  0.07549102
2   1 3441807 3442009 -1.00000000 -1.00000000  0.01884463 -1.00000000
3   1 4408039 4408343 -1.00000000 -1.00000000  0.02658240 -1.00000000
4   1 4486341 4486691 -1.00000000  0.03415241 -1.00000000 -1.00000000
5   1 4561545 4562084  0.02217287  0.07207616 -1.00000000  0.02224536
6   1 4592472 4592914  0.03316482  0.06758506 -1.00000000  0.03507404

the chromosome names have changed, and chromosome X is missing.

Maybe I miss something here? I would be happy if someone could give me a hint how to fix this.
Best wishes,
Georg


> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] DiffBind_1.12.0         limma_3.22.1            ChIPpeakAnno_2.16.2    
 [4] AnnotationDbi_1.28.1    Biobase_2.26.0          RSQLite_1.0.0          
 [7] DBI_0.3.1               biomaRt_2.22.0          VennDiagram_1.6.9      
[10] chipseq_1.16.0          ShortRead_1.24.0        GenomicAlignments_1.2.1
[13] Rsamtools_1.18.2        BiocParallel_1.0.0      BSgenome_1.34.0        
[16] rtracklayer_1.26.2      Biostrings_2.34.0       XVector_0.6.0          
[19] GenomicRanges_1.18.3    GenomeInfoDb_1.2.3      IRanges_2.0.0          
[22] S4Vectors_0.4.0         BiocGenerics_0.12.1    

loaded via a namespace (and not attached):
 [1] amap_0.8-12            base64enc_0.1-2        BatchJobs_1.5         
 [4] BBmisc_1.8             bitops_1.0-6           brew_1.0-6            
 [7] caTools_1.17.1         checkmate_1.5.0        codetools_0.2-9       
[10] compiler_3.1.1         digest_0.6.4           edgeR_3.8.3           
[13] fail_1.2               foreach_1.4.2          gdata_2.13.3          
[16] GenomicFeatures_1.18.2 GO.db_3.0.0            gplots_2.14.2         
[19] gtools_3.4.1           hwriter_1.3.2          iterators_1.0.7       
[22] KernSmooth_2.23-13     lattice_0.20-29        latticeExtra_0.6-26   
[25] MASS_7.3-35            multtest_2.22.0        RColorBrewer_1.0-5    
[28] RCurl_1.95-4.3         sendmailR_1.2-1        splines_3.1.1         
[31] stringr_0.6.2          survival_2.37-7        tools_3.1.1           
[34] XML_3.98-1.1           zlibbioc_1.12.0       

 

DiffBind ChIP-Seq • 1.2k views
ADD COMMENT
2
Entering edit mode
Rory Stark ★ 5.1k
@rory-stark-5741
Last seen 8 weeks ago
Cambridge, UK

Hi Georg-

Directly accessing the internal fields, like dbaObj$allvectors, isn't really supported, but I can tell you what is going on and how to do what you want to do.

The officially supported way of getting the data in dbaObj$vectors is via the dba.peakset() function, with bRetrieve=TRUE. This will map the chromosome names back to what you expect to see.

The internal structures use factor numbers which make certain merging and sorting operations more efficient . As you have sees, the "real" chromosome labels are stored in the $chrmap field.You can map them as follows:

> chromosomes = dbaObj$chrmap[dbaObj$allvectors[,1]]

This is what dba.peakset() does internally when bRetrieve=TRUE.

Hope this helps-

Rory

ADD COMMENT

Login before adding your answer.

Traffic: 703 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6