Question

regionMatrix for derfinder crashes with seqnames issue on non-human data

0

Entering edit mode

Wayne • 0

@wayne-12952

Last seen 8.8 years ago

Hi,
I am trying to use derfinder to analyze samples from Saccharomyces cerevisiae. Something about the chromosome names is causing me issues when I get to building a regionMatrix.
However, in trying to bypass that issue to see if I could even get some expressed regions data, I think I uncovered an issue people working with organisms not presently in the "GenomeInfoDbData" may encounter.
In trying to troubleshoot I saw that A: derfinder: fullCoverage() problems that Leonardo had: "Introduced function extendedMapSeqlevels() for using GenomeInfoDb when there is information regarding the species and naming style of interest. Otherwise sequence names are left unchanged." - Source.

Reading, "otherwise sequence names are left unchanged", I thought rather than take the time to figure out how to make my chromosome names be in the same order as GenomeInfoDb needs them, if I simply insure my source organism is not in the database, I'll be able to at leat build a regionMatrix because it won't touch my sequence names. So I thought I could just get `derfinder` to not try and touch my chromosome names by putting a species that it doesn't know and then have it build the regionMatrix.
The unevaluated code I used is:

## Load libraries
library('derfinder')
library('GenomicRanges')
file_designations <- c("WT1","WT2","WT3", "MT1", "MT2", "MT3")
chrs_all = c('I', 'II','III','IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI')
## Load the data from disk
files <- file.path("/data/alignments", file_designations , paste0(file_designations,"Aligned.sorted.bam"))
system.time(fullCov <- fullCoverage(files = files, chrs = chrs_all,
    totalMapped = sapply(files,getTotalMapped, chrs = chrs_all), targetSize = 8e+07, species = "ReallySaccharomyces_cerevisiaeBUTiWANTtoSEEifCANgetREGIONMATRIXtoWORK")) 
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 75))

However, after working for many of them on the tenth chromosome data it ends with this error below.

...left off many previous lines...
2017-05-03 17:22:52 filterData: originally there were 562643 rows, now there are 523276 rows. Meaning that 7 percent was filtered.
2017-05-03 17:22:52 findRegions: identifying potential segments
2017-05-03 17:22:52 findRegions: segmenting information
2017-05-03 17:22:52 findRegions: identifying candidate regions
extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'.
2017-05-03 17:22:52 findRegions: identifying region clusters
extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'.
extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'.
2017-05-03 17:22:53 getRegionCoverage: processing VIII
2017-05-03 17:22:53 getRegionCoverage: done processing VIII
2017-05-03 17:22:53 regionMatrix: calculating coverageMatrix
2017-05-03 17:22:53 regionMatrix: adjusting coverageMatrix for 'L'
2017-05-03 17:22:53 regionMatrix: processing X
2017-05-03 17:22:54 filterData: originally there were 745751 rows, now there are 696110 rows. Meaning that 6.66 percent was filtered.
2017-05-03 17:22:54 findRegions: identifying potential segments
2017-05-03 17:22:54 findRegions: segmenting information
2017-05-03 17:22:54 findRegions: identifying candidate regions
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-05-03 17:22:54 findRegions: identifying region clusters
Error in .normargSeqlengths(value, seqnames(x)) :
when the supplied 'seqlengths' vector is named, the names must match the seqnames
Timing stopped at: 18.168 0.468 18.64

I'd guess it might still crash if it is given an unknown species with a tenth chromosome represented 'X' chromsome because it seems it is not leaving the chromosome names unchanged.

I have posted the full details of the evaluated code, traceback, and R session information in this gist. Below is the most pertinent parts for reproducibility:

Session info -----------------------------------------------------------------------------
setting  value                       
version  R version 3.3.1 (2016-06-21)
system   x86_64, linux-gnu           
ui       X11                         
language (EN)                        
collate  en_US.UTF-8                 
tz       <NA>                        
date     2017-05-03     
derfinder            * 1.8.5    2017-05-01 Bioconductor  
derfinderHelper        1.8.1    2017-05-01 Bioconductor

It shouldn't relate, but since it may come up as it will be listed as a yeast chromosome. For this test, I have left off the mitochondrial DNA chromosome, 'MT', because there is a lot of related data in the samples and I was getting a memory error when trying to load `devtools` after running with it included.

Originally the commands had indeed included it:

chrs_all = c('I', 'II','III','IV', 'IX', 'MT', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI')

And I saw the same error that I am seeing when gets to `tenth` chromosome would arise when evaluating got 'MT'. And the `traceback()` result also looked the same.

Thanks,
Wayne

derfinder GenomeInfoDb GenomicRanges • 3.1k views

ADD COMMENT • link 8.8 years ago Wayne • 0

0

Entering edit mode

THIS IS MEANT TO REFERENCE A COMMENT I MADE IN MY QUESTION ABOUT WHAT I WAS ORIGINALLY TRYING TO DO:

By the way, for anybody who came here because you are also trying this with yeast, or other non-human model organism, by providing the correct species name, but still encountering the error:

Error in .normargSeqlengths(value, seqnames(x)) : 
  when the supplied 'seqlengths' vector is named, the names must match the seqnames

I think you get that because your chromosomes supplied are out of order, and the solution to address this was in Leonardo's reply -> C: derfinder: fullCoverage() problems. You need to find out the order and formats `GenomeInfoDb` has by installing, loading, and querying `GenomeInfoDb` in R.

Leonardo's reply showed it for a couple of chromosomes for Populus trichocarpa .

Here I'll show the updated result for Populus and illustrate doing that for Saccharomyces cervisiae with the evaluated R code below:

> source("https://bioconductor.org/biocLite.R")
> biocLite("GenomeInfoDbData")
> library(GenomeInfoDb)
...
> genomeStyles()$Populus_trichocarpa[c(2, 11), ]
   circular auto   sex NCBI JGI2.F Ensembl
2     FALSE TRUE FALSE LGII      2       2
11    FALSE TRUE FALSE LGXI     11      11

> genomeStyles()$Saccharomyces_cerevisiae
   circular  auto   sex NCBI    UCSC Ensembl
1     FALSE  TRUE FALSE    I    chrI       I
2     FALSE  TRUE FALSE   II   chrII      II
3     FALSE  TRUE FALSE  III  chrIII     III
4     FALSE  TRUE FALSE   IV   chrIV      IV
5     FALSE  TRUE FALSE    V    chrV       V
6     FALSE  TRUE FALSE   VI   chrVI      VI
7     FALSE  TRUE FALSE  VII  chrVII     VII
8     FALSE  TRUE FALSE VIII chrVIII    VIII
9     FALSE  TRUE FALSE   IX   chrIX      IX
10    FALSE  TRUE FALSE    X    chrX       X
11    FALSE  TRUE FALSE   XI   chrXI      XI
12    FALSE  TRUE FALSE  XII  chrXII     XII
13    FALSE  TRUE FALSE XIII chrXIII    XIII
14    FALSE  TRUE FALSE  XIV  chrXIV     XIV
15    FALSE  TRUE FALSE   XV   chrXV      XV
16    FALSE  TRUE FALSE  XVI  chrXVI     XVI
17     TRUE FALSE FALSE   MT    chrM    Mito
18     TRUE FALSE  TRUE  2uM 2micron    <NA>
>

This will tell you what order to supply them and allow you to determine what style matches your chromosome names. See unevaluated code example here --> (derfinder fullCoverage() changes chromosome style and then regionMatrix() complains and ignores provided 'currentStyle').

The rest of this post is about getting things to behave when you are providing a non-model organism name.

ADD REPLY • link 8.8 years ago Wayne • 0

score 0 · Answer 1 · 2017-05-04

0

Entering edit mode

Leonardo Collado Torres ★ 1.1k

@lcolladotor

Last seen 7 months ago

United States

EDIT 2017-05-08: correct way to disable `extendedMapSeqlevels()` is using `chrsStyle = NULL`. See comments farther below for more information. Versions 1.10.3 and 1.11.3 have the correct documentation.

Hi,

If you want to disable the automatic behavior of `extendedMapSeqlevels()` use the same `chrsStyle` and `currentStyle`. The defaults are `chrStyle = 'UCSC'` and `currentStyle = NULL`. So specifying `currentStyle = 'UCSC'` turns this function off. I know that it's a bit tricky to find this, but if you look at the help file of `fullCoverage()` you'll notice that `...` gets passed to `extendedMapSeqlevels()`. I'll try to make this a bit clearer in the documentation.

Best,

Leonardo

ADD COMMENT • link 8.8 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Versions 1.10.1 and 1.11.1 now have better documentation for extendedMapSeqlevels(). See https://github.com/lcolladotor/derfinder/commit/9913ff01fd129705e5fcf9e410c51fc26fc2d546 for details.

ADD REPLY • link 8.8 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Okay I tried this command below:

## Load libraries
library('derfinder')
library('GenomicRanges')
file_designations <- c("WT1","WT2","WT3", "MT1", "MT2", "MT3")
chrs_all = c('I', 'II','III','IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII')
## Load the data from disk
files <- file.path("/data/alignments", file_designations , paste0(file_designations,"Aligned.sorted.bam"))
system.time(fullCov <- fullCoverage(files = files, chrs = chrs_all, totalMapped = sapply(files,getTotalMapped, chrs = chrs_all), targetSize = 8e+07, currentStyle = 'UCSC'))
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 75))

Still got same issue when it hit the tenth chromosome in the regionMatrix() command:

...lots of lines before this of it working....
2017-05-04 19:23:41 getRegionCoverage: processing VIII
2017-05-04 19:23:41 getRegionCoverage: done processing VIII
2017-05-04 19:23:41 regionMatrix: calculating coverageMatrix
2017-05-04 19:23:42 regionMatrix: adjusting coverageMatrix for 'L'
2017-05-04 19:23:42 regionMatrix: processing X
2017-05-04 19:23:43 filterData: originally there were 745751 rows, now there are 694290 rows. Meaning that 6.90000000000001 percent was filtered.
2017-05-04 19:23:43 findRegions: identifying potential segments
2017-05-04 19:23:43 findRegions: segmenting information
2017-05-04 19:23:43 findRegions: identifying candidate regions
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-05-04 19:23:43 findRegions: identifying region clusters
Error in .normargSeqlengths(value, seqnames(x)) :
  when the supplied 'seqlengths' vector is named, the names must match the seqnames
Timing stopped at: 19.4 0.064 19.463
>

Even with that setting, regionMatrix() seems to ignore the fact it shouldn't touch the chromosome names when it gets to that one.
I just tried using the test data and put `fullCov <- fullCoverage(files=files[1], chrs=c('21','X', '22'), currentStyle = 'UCSC')` to see if the regionMatrix() command. `system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76))` then says `extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens`, but it didn't. It also find no regions and so I don't know if it is a valid comparison?
Unevaluated code:

library(derfinder)
datadir <- system.file('extdata', 'genomeData', package='derfinder')
files <- rawFiles(datadir=datadir, samplepatt='*accepted_hits.bam$',
     fileterm=NULL)
## Shorten the column names
names(files) <- gsub('_accepted_hits.bam', '', names(files))
## Read and filter the data, only for 1 file
fullCov <- fullCoverage(files=files[1], chrs=c('21','X','22'), currentStyle = 'UCSC')
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76))

Here is the result:

> system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 76))
2017-05-04 21:05:05 regionMatrix: processing 21
2017-05-04 21:05:05 filterData: originally there were 48129895 rows, now there are  rows. Meaning that  percent was filtered.
2017-05-04 21:05:05 findRegions: identifying potential segments
2017-05-04 21:05:05 regionMatrix: processing X
2017-05-04 21:05:05 filterData: originally there were 155270560 rows, now there are  rows. Meaning that  percent was filtered.
2017-05-04 21:05:05 findRegions: identifying potential segments
2017-05-04 21:05:05 regionMatrix: processing 22
2017-05-04 21:05:05 filterData: originally there were 51304566 rows, now there are  rows. Meaning that  percent was filtered.
2017-05-04 21:05:05 findRegions: identifying potential segments
   user  system elapsed 
  0.052   0.000   0.049 
Warning messages:
1: In findRegions(position = filt$position, fstats = filt$meanCoverage,  :
  Found no regions
2: In findRegions(position = filt$position, fstats = filt$meanCoverage,  :
  Found no regions
3: In findRegions(position = filt$position, fstats = filt$meanCoverage,  :
  Found no regions

>

ADD REPLY • link 8.8 years ago Wayne • 0

0

Entering edit mode

Try specifying currentStyle = 'UCSC' in the regionMatrix() call too. The ... argument gets passed around quite a bit.

ADD REPLY • link 8.8 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Thanks, Leonardo. That eliminated the `Error in .normargSeqlengths(value, seqnames(x)) : when the supplied 'seqlengths' vector is named, the names must match the seqnames` error that was stopping the `regionMartrix()` command at the tenth chromosome, producing no results. Now I do indeed produce a regionMatrix.

I should have tried that but in addition to not being on my typical machine at the time time, I had read in the documentation for `fullCoverage` that the `...` settings were "Passed to loadCoverage, define_cluster and extendedMapSeqlevels." So I thought it was a one time thing that one sets.

Beside that issue, I did note though that for the tenth chromosome it still said it was 'touching' the name. Below is the pertinent section with similar lines before and after replaced with `...`, prefaced by the unevaluated code to get to that point:

## Load libraries
library('derfinder')
library('GenomicRanges')
file_designations <- c("WT1","WT2","WT3", "MT1", "MT2", "MT3")
chrs_all = c('I', 'II','III','IV', 'IX', 'V', 'VI', 'VII', 'VIII', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII')
## Load the data from disk
files <- file.path("/data/alignments", file_designations , paste0(file_designations,"Aligned.sorted.bam"))
system.time(fullCov <- fullCoverage(files = files, chrs = chrs_all, totalMapped = sapply(files,getTotalMapped, chrs = chrs_all), targetSize = 8e+07, currentStyle = 'UCSC'))
system.time(regionMat <- regionMatrix(fullCov, cutoff = 30, L = 75, currentStyle = 'UCSC'))

...
2017-05-08 14:39:32 getRegionCoverage: processing VIII
2017-05-08 14:39:32 getRegionCoverage: done processing VIII
2017-05-08 14:39:32 regionMatrix: calculating coverageMatrix
2017-05-08 14:39:32 regionMatrix: adjusting coverageMatrix for 'L'
2017-05-08 14:39:32 regionMatrix: processing X
2017-05-08 14:39:33 filterData: originally there were 745751 rows, now there are 694290 rows. Meaning that 6.90000000000001 percent was filtered.
2017-05-08 14:39:33 findRegions: identifying potential segments
2017-05-08 14:39:33 findRegions: segmenting information
2017-05-08 14:39:33 findRegions: identifying candidate regions
2017-05-08 14:39:33 findRegions: identifying region clusters
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-05-08 14:39:34 getRegionCoverage: processing chrX
2017-05-08 14:39:34 getRegionCoverage: done processing chrX
2017-05-08 14:39:34 regionMatrix: calculating coverageMatrix
2017-05-08 14:39:35 regionMatrix: adjusting coverageMatrix for 'L'
2017-05-08 14:39:35 regionMatrix: processing XI
2017-05-08 14:39:35 filterData: originally there were 666816 rows, now there are 619790 rows. Meaning that 7.05 percent was filtered.
2017-05-08 14:39:35 findRegions: identifying potential segments
2017-05-08 14:39:35 findRegions: segmenting information
2017-05-08 14:39:35 findRegions: identifying candidate regions
2017-05-08 14:39:35 findRegions: identifying region clusters
extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'.
extendedMapSeqlevels: the current naming style could not be guessed. Consider supplying 'currentStyle'.
...

See how for the tenth chromosome it is still reporting `extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens`. I only point it out because it didn't happen in A: regionMatrix for derfinder crashes with seqnames issue on non-human data even though a similar change `sequence names mapped from NCBI to UCSC for species homo_sapiens` did when you didn't have `currentStyle = 'UCSC'` in the command call. And the code you deposited at the time you introduced the `extendedMapSeqlevels` said "Introduced function extendedMapSeqlevels() for using GenomeInfoDb when there is information regarding the species and naming style of interest. Otherwise sequence names are left unchanged." So am wondering if for chromosome X it may be still doing something you didn't intend?

ADD REPLY • link 8.8 years ago Wayne • 0

0

Entering edit mode

Hi,

Sorry, the correct way to disable `extendedMapSeqlevels()` is by using `chrsStyle = NULL`. I have updated derfinder's documentation (1.10.3 and 1.11.3) to reflect this. So if you use regionMatrix(chrsStyle = NULL) it should all work without giving you those messages. In any case, you don't need to re-run your code since the regionMat object should be complete.

The reason why `currentStyle = 'UCSC'` didn't work, is because that argument is not passed to getRegionCoverage() in https://github.com/lcolladotor/derfinder/blob/abd435d56ce1dcc4828c1a75b166e930e2ec879b/R/regionMatrix.R#L194-L196. Only `chrsStyle` is passed.

Best,

Leonardo

ADD REPLY • link 8.8 years ago Leonardo Collado Torres ★ 1.1k