Hi,
I am trying to use derfinder to analyze samples from Saccharomyces cerevisiae. Something about the chromosome names is causing me issues when I get to building a regionMatrix.
However, in trying to bypass that issue to see if I could even get some expressed regions data, I think I uncovered an issue people working with organisms not presently in the "GenomeInfoDbData" may encounter.
In trying to troubleshoot I saw that A: derfinder: fullCoverage() problems that Leonardo had: "Introduced function extendedMapSeqlevels() for using GenomeInfoDb when there is information regarding the species and naming style of interest. Otherwise sequence names are left unchanged." - Source.
Reading, "otherwise sequence names are left unchanged", I thought rather than take the time to figure out how to make my chromosome names be in the same order as GenomeInfoDb needs them, if I simply insure my source organism is not in the database, I'll be able to at leat build a regionMatrix because it won't touch my sequence names. So I thought I could just get `derfinder` to not try and touch my chromosome names by putting a species that it doesn't know and then have it build the regionMatrix.
The unevaluated code I used is:
## Load libraries
library('derfinder')
library('GenomicRanges')
file_designations <- c("WT1","WT2","WT3", "MT1", "MT2", "MT3")
chrs_all = c('I', 'II','III','IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI')
## Load the data from disk
files <- file.path("/data/alignments", file_designations , paste0(file_designations,"Aligned.sorted.bam"))
system.time(fullCov <- fullCoverage(files = files, chrs = chrs_all,
totalMapped = sapply(files,getTotalMapped, chrs = chrs_all), targetSize = 8e+07, species = "ReallySaccharomyces_cerevisiaeBUTiWANTtoSEEifCANgetREGIONMATRIXtoWORK"))
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 75))
However, after working for many of them on the tenth chromosome data it ends with this error below.
...left off many previous lines... 2017-05-03 17:22:52 filterData: originally there were 562643 rows, now there are 523276 rows. Meaning that 7 percent was filtered. 2017-05-03 17:22:52 findRegions: identifying potential segments 2017-05-03 17:22:52 findRegions: segmenting information 2017-05-03 17:22:52 findRegions: identifying candidate regions extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. 2017-05-03 17:22:52 findRegions: identifying region clusters extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. 2017-05-03 17:22:53 getRegionCoverage: processing VIII 2017-05-03 17:22:53 getRegionCoverage: done processing VIII 2017-05-03 17:22:53 regionMatrix: calculating coverageMatrix 2017-05-03 17:22:53 regionMatrix: adjusting coverageMatrix for 'L' 2017-05-03 17:22:53 regionMatrix: processing X 2017-05-03 17:22:54 filterData: originally there were 745751 rows, now there are 696110 rows. Meaning that 6.66 percent was filtered. 2017-05-03 17:22:54 findRegions: identifying potential segments 2017-05-03 17:22:54 findRegions: segmenting information 2017-05-03 17:22:54 findRegions: identifying candidate regions extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens 2017-05-03 17:22:54 findRegions: identifying region clusters Error in .normargSeqlengths(value, seqnames(x)) : when the supplied 'seqlengths' vector is named, the names must match the seqnames Timing stopped at: 18.168 0.468 18.64
I'd guess it might still crash if it is given an unknown species with a tenth chromosome represented 'X' chromsome because it seems it is not leaving the chromosome names unchanged.
I have posted the full details of the evaluated code, traceback, and R session information in this gist. Below is the most pertinent parts for reproducibility:
Session info ----------------------------------------------------------------------------- setting value version R version 3.3.1 (2016-06-21) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 tz <NA> date 2017-05-03 derfinder * 1.8.5 2017-05-01 Bioconductor derfinderHelper 1.8.1 2017-05-01 Bioconductor
It shouldn't relate, but since it may come up as it will be listed as a yeast chromosome. For this test, I have left off the mitochondrial DNA chromosome, 'MT', because there is a lot of related data in the samples and I was getting a memory error when trying to load `devtools` after running with it included.
Originally the commands had indeed included it:
chrs_all = c('I', 'II','III','IV', 'IX', 'MT', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI')
And I saw the same error that I am seeing when gets to `tenth` chromosome would arise when evaluating got 'MT'. And the `traceback()` result also looked the same.
Thanks,
Wayne

THIS IS MEANT TO REFERENCE A COMMENT I MADE IN MY QUESTION ABOUT WHAT I WAS ORIGINALLY TRYING TO DO:
By the way, for anybody who came here because you are also trying this with yeast, or other non-human model organism, by providing the correct species name, but still encountering the error:
I think you get that because your chromosomes supplied are out of order, and the solution to address this was in Leonardo's reply -> C: derfinder: fullCoverage() problems. You need to find out the order and formats `GenomeInfoDb` has by installing, loading, and querying `GenomeInfoDb` in R.
Leonardo's reply showed it for a couple of chromosomes for Populus trichocarpa .
Here I'll show the updated result for Populus and illustrate doing that for Saccharomyces cervisiae with the evaluated R code below:
> source("https://bioconductor.org/biocLite.R") > biocLite("GenomeInfoDbData") > library(GenomeInfoDb) ... > genomeStyles()$Populus_trichocarpa[c(2, 11), ] circular auto sex NCBI JGI2.F Ensembl 2 FALSE TRUE FALSE LGII 2 2 11 FALSE TRUE FALSE LGXI 11 11 > genomeStyles()$Saccharomyces_cerevisiae circular auto sex NCBI UCSC Ensembl 1 FALSE TRUE FALSE I chrI I 2 FALSE TRUE FALSE II chrII II 3 FALSE TRUE FALSE III chrIII III 4 FALSE TRUE FALSE IV chrIV IV 5 FALSE TRUE FALSE V chrV V 6 FALSE TRUE FALSE VI chrVI VI 7 FALSE TRUE FALSE VII chrVII VII 8 FALSE TRUE FALSE VIII chrVIII VIII 9 FALSE TRUE FALSE IX chrIX IX 10 FALSE TRUE FALSE X chrX X 11 FALSE TRUE FALSE XI chrXI XI 12 FALSE TRUE FALSE XII chrXII XII 13 FALSE TRUE FALSE XIII chrXIII XIII 14 FALSE TRUE FALSE XIV chrXIV XIV 15 FALSE TRUE FALSE XV chrXV XV 16 FALSE TRUE FALSE XVI chrXVI XVI 17 TRUE FALSE FALSE MT chrM Mito 18 TRUE FALSE TRUE 2uM 2micron <NA> >This will tell you what order to supply them and allow you to determine what style matches your chromosome names. See unevaluated code example here --> (derfinder fullCoverage() changes chromosome style and then regionMatrix() complains and ignores provided 'currentStyle').
The rest of this post is about getting things to behave when you are providing a non-model organism name.