Hi,
I am trying to use derfinder to analyze samples from Saccharomyces cerevisiae. Something about the chromosome names is causing me issues when I get to building a regionMatrix.
However, in trying to bypass that issue to see if I could even get some expressed regions data, I think I uncovered an issue people working with organisms not presently in the "GenomeInfoDbData" may encounter.
In trying to troubleshoot I saw that A: derfinder: fullCoverage() problems that Leonardo had: "Introduced function extendedMapSeqlevels() for using GenomeInfoDb when there is information regarding the species and naming style of interest. Otherwise sequence names are left unchanged." - Source.
Reading, "otherwise sequence names are left unchanged", I thought rather than take the time to figure out how to make my chromosome names be in the same order as GenomeInfoDb needs them, if I simply insure my source organism is not in the database, I'll be able to at leat build a regionMatrix because it won't touch my sequence names. So I thought I could just get `derfinder` to not try and touch my chromosome names by putting a species that it doesn't know and then have it build the regionMatrix.
The unevaluated code I used is:
## Load libraries library('derfinder') library('GenomicRanges') file_designations <- c("WT1","WT2","WT3", "MT1", "MT2", "MT3") chrs_all = c('I', 'II','III','IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI') ## Load the data from disk files <- file.path("/data/alignments", file_designations , paste0(file_designations,"Aligned.sorted.bam")) system.time(fullCov <- fullCoverage(files = files, chrs = chrs_all, totalMapped = sapply(files,getTotalMapped, chrs = chrs_all), targetSize = 8e+07, species = "ReallySaccharomyces_cerevisiaeBUTiWANTtoSEEifCANgetREGIONMATRIXtoWORK")) system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 75))
However, after working for many of them on the tenth chromosome data it ends with this error below.
...left off many previous lines... 2017-05-03 17:22:52 filterData: originally there were 562643 rows, now there are 523276 rows. Meaning that 7 percent was filtered. 2017-05-03 17:22:52 findRegions: identifying potential segments 2017-05-03 17:22:52 findRegions: segmenting information 2017-05-03 17:22:52 findRegions: identifying candidate regions extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. 2017-05-03 17:22:52 findRegions: identifying region clusters extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'. 2017-05-03 17:22:53 getRegionCoverage: processing VIII 2017-05-03 17:22:53 getRegionCoverage: done processing VIII 2017-05-03 17:22:53 regionMatrix: calculating coverageMatrix 2017-05-03 17:22:53 regionMatrix: adjusting coverageMatrix for 'L' 2017-05-03 17:22:53 regionMatrix: processing X 2017-05-03 17:22:54 filterData: originally there were 745751 rows, now there are 696110 rows. Meaning that 6.66 percent was filtered. 2017-05-03 17:22:54 findRegions: identifying potential segments 2017-05-03 17:22:54 findRegions: segmenting information 2017-05-03 17:22:54 findRegions: identifying candidate regions extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens 2017-05-03 17:22:54 findRegions: identifying region clusters Error in .normargSeqlengths(value, seqnames(x)) : when the supplied 'seqlengths' vector is named, the names must match the seqnames Timing stopped at: 18.168 0.468 18.64
I'd guess it might still crash if it is given an unknown species with a tenth chromosome represented 'X' chromsome because it seems it is not leaving the chromosome names unchanged.
I have posted the full details of the evaluated code, traceback, and R session information in this gist. Below is the most pertinent parts for reproducibility:
Session info ----------------------------------------------------------------------------- setting value version R version 3.3.1 (2016-06-21) system x86_64, linux-gnu ui X11 language (EN) collate en_US.UTF-8 tz <NA> date 2017-05-03 derfinder * 1.8.5 2017-05-01 Bioconductor derfinderHelper 1.8.1 2017-05-01 Bioconductor
It shouldn't relate, but since it may come up as it will be listed as a yeast chromosome. For this test, I have left off the mitochondrial DNA chromosome, 'MT', because there is a lot of related data in the samples and I was getting a memory error when trying to load `devtools` after running with it included.
Originally the commands had indeed included it:
chrs_all = c('I', 'II','III','IV', 'IX', 'MT', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI')
And I saw the same error that I am seeing when gets to `tenth` chromosome would arise when evaluating got 'MT'. And the `traceback()` result also looked the same.
Thanks,
Wayne
THIS IS MEANT TO REFERENCE A COMMENT I MADE IN MY QUESTION ABOUT WHAT I WAS ORIGINALLY TRYING TO DO:
By the way, for anybody who came here because you are also trying this with yeast, or other non-human model organism, by providing the correct species name, but still encountering the error:
I think you get that because your chromosomes supplied are out of order, and the solution to address this was in Leonardo's reply -> C: derfinder: fullCoverage() problems. You need to find out the order and formats `GenomeInfoDb` has by installing, loading, and querying `GenomeInfoDb` in R.
Leonardo's reply showed it for a couple of chromosomes for Populus trichocarpa .
Here I'll show the updated result for Populus and illustrate doing that for Saccharomyces cervisiae with the evaluated R code below:
This will tell you what order to supply them and allow you to determine what style matches your chromosome names. See unevaluated code example here --> (derfinder fullCoverage() changes chromosome style and then regionMatrix() complains and ignores provided 'currentStyle').
The rest of this post is about getting things to behave when you are providing a non-model organism name.