Question

IRanges error from derfinder analyzeChr with custom TxDb?

0

Entering edit mode

fruce_ki ▴ 20

@fruce_ki-13220

Last seen 6.5 years ago

Austria/Vienna/Research Institute for M…

My ultimate goal is to be able to run derfinder on a custom GTF. But to get a hang of how to do it, I am first trying to succeed at it using a completely stock Ensembl GTF file (so no biomaRt suggestions please).

First I create a custom TxDb:

txdb <- makeTxDbFromGFF(file='~/devtestdata/Arabidopsis_thaliana.TAIR10.36.gtf',
                        organism='Arabidopsis thaliana', 
                        chrominfo=Seqinfo(seqnames=c('1','2','3','4','5','Mt','Pt'),          seqlengths=c(30427671,19698289,23459830,18585056,26975502,366924,154478), isCircular=c(FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE), genome='TAIR10') )

No errors reported up to here.
Then (following coverage loading from 4 BAM files using loadCoverage, setting up a dataframe with two conditions, calculating the sample depths and building the models) I call derfinder:

analyzeChr(chr='4', 
           coverageInfo=chr4Cov, 
           txdb=txdb, 
           models=fwdModels, 
           groupInfo=covariates$condition, 
           returnOutput=TRUE, 
           chrsStyle=NULL, 
           cutoffPre=3, 
           cutoffFstat=5e-02, 
           nPermute=20, 
           seeds=NULL, 
           writeOutput=FALSE, 
           lowMemDir=NULL, 
           smooth=FALSE, 
           weights=NULL, 
           smoothFunction=bumphunter::locfitByCluster )

And I get this error:

2017-08-18 15:02:32 analyzeChr: Pre-processing the coverage data
2017-08-18 15:02:35 analyzeChr: Calculating statistics
2017-08-18 15:02:35 calculateStats: calculating the F-statistics
2017-08-18 15:02:46 analyzeChr: Calculating pvalues
2017-08-18 15:02:46 analyzeChr: Using the following theoretical cutoff for the F-statistics 161.447638797588
2017-08-18 15:02:46 calculatePvalues: identifying data segments
Error in attr(x, "tsp") <- c(1, NROW(x), 1) : 
  attempt to set an attribute on NULL

This is the stack trace:

9: hasTsp(x)
8: start.default(position)
7: start(position)
6: start(position)
5: is(start, "Ranges")
4: IRanges(start = start(position)[runValue(position)], end = end(position)[runValue(position)])
3: .clusterMakerRle(position = position, maxGap = .advanced_argument("maxRegionGap", 
       0L, ...), ranges = TRUE)
2: calculatePvalues(coveragePrep = prep, models = models, fstats = fstats, 
       nPermute = nPermute, seeds = seeds, chr = chr, cutoff = cutoff, 
       lowMemDir = lowMemDir, smooth = smooth, smoothFunction = smoothFunction, 
       weights = weights, ...)
1: analyzeChr(chr = "4", coverageInfo = fwdCov[["4"]], txdb = txdb, 
       models = fwdModels, groupInfo = experiment$covariates$condition[fwd], 
       returnOutput = TRUE, chrsStyle = NULL, cutoffPre = 3, cutoffFstat = 0.05, 
       nPermute = 20, seeds = NULL, writeOutput = FALSE, lowMemDir = NULL, 
       smooth = FALSE, weights = NULL, smoothFunction = bumphunter::locfitByCluster)

And this is my session:

R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ensembldb_2.0.4                         AnnotationFilter_1.0.0                  biomaRt_2.32.1                          TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] BiocInstaller_1.26.0                    Rsamtools_1.28.0                        Biostrings_2.44.2                       XVector_0.16.0                         
 [9] GenomicFeatures_1.28.4                  AnnotationDbi_1.38.2                    treat_0.0.1                             DESeq2_1.16.1                          
[13] SummarizedExperiment_1.6.3              DelayedArray_0.2.7                      matrixStats_0.52.2                      Biobase_2.36.2                         
[17] GenomicRanges_1.28.4                    GenomeInfoDb_1.12.2                     IRanges_2.10.2                          S4Vectors_0.14.3                       
[21] BiocGenerics_0.22.0                     data.table_1.10.4                       derfinder_1.10.5                       

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.8.0            bitops_1.0-6                  bit64_0.9-7                   httr_1.3.0                    RColorBrewer_1.1-2           
 [6] GenomicFiles_1.12.0           tools_3.4.1                   backports_1.1.0               doRNG_1.6.6                   R6_2.2.2                     
[11] rpart_4.1-11                  Hmisc_4.0-3                   DBI_0.7                       lazyeval_0.2.0                colorspace_1.3-2             
[16] nnet_7.3-12                   gridExtra_2.2.1               curl_2.8.1                    bit_1.1-12                    compiler_3.4.1               
[21] htmlTable_1.9                 pkgmaker_0.22                 rtracklayer_1.36.4            scales_0.4.1                  checkmate_1.8.3              
[26] genefilter_1.58.1             stringr_1.2.0                 digest_0.6.12                 foreign_0.8-69                base64enc_0.1-3              
[31] pkgconfig_2.0.1               htmltools_0.3.6               BSgenome_1.44.0               htmlwidgets_0.9               rlang_0.1.2                  
[36] RSQLite_2.0                   shiny_1.0.4                   BiocParallel_1.10.1           acepack_1.4.1                 VariantAnnotation_1.22.3     
[41] RCurl_1.95-4.8                magrittr_1.5                  GenomeInfoDbData_0.99.0       Formula_1.2-2                 Matrix_1.2-11                
[46] Rcpp_0.12.12                  munsell_0.4.3                 yaml_2.1.14                   stringi_1.1.5                 zlibbioc_1.22.0              
[51] plyr_1.8.4                    qvalue_2.8.0                  bumphunter_1.16.0             AnnotationHub_2.8.2           grid_3.4.1                   
[56] blob_1.1.0                    crayon_1.3.2                  lattice_0.20-35               splines_3.4.1                 annotate_1.54.0              
[61] derfinderHelper_1.10.0        locfit_1.5-9.1                knitr_1.17                    rngtools_1.2.4                geneplotter_1.54.0           
[66] reshape2_1.4.2                codetools_0.2-15              XML_3.98-1.9                  latticeExtra_0.6-28           httpuv_1.3.5                 
[71] foreach_1.4.3                 testthat_1.0.2                gtable_0.2.0                  ggplot2_2.2.1                 mime_0.5                     
[76] xtable_1.8-2                  survival_2.41-3               tibble_1.3.3                  iterators_1.0.8               GenomicAlignments_1.12.2     
[81] registry_0.3                  memoise_1.1.0                 cluster_2.0.6                 interactiveDisplayBase_1.14.0

I am assuming I'm not setting up the TxDb properly but it is not obvious to me what is going on.
(The BAM files are aligned to the TAIR10 assembly, so there should be no mismatch from that)
Any help much appreciated!

derfinder genomicfeatures genomicranges • 3.7k views

ADD COMMENT • link updated 8.4 years ago by Leonardo Collado Torres ★ 1.1k • written 8.4 years ago by fruce_ki ▴ 20

0

Entering edit mode

Hi,

The TxDb is not an issue here since the TxDb is used later by analyzeChr().

It looks like the F-stat cutoff might be too high, leaving no regions. But that use case should work. Here I run the example code with a very high F-stat cutoff:

```

> ## Collapse the coverage information
> collapsedFull <- collapseFullCoverage(list(genomeData$coverage), 
+     verbose = TRUE)
2017-08-18 12:11:42 collapseFullCoverage: Sorting fullCov
2017-08-18 12:11:42 collapseFullCoverage: Collapsing chromosomes information by sample
> 
> ## Calculate library size adjustments
> sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero=TRUE, 
+     verbose=TRUE)
2017-08-18 12:11:42 sampleDepth: Calculating sample quantiles
2017-08-18 12:11:42 sampleDepth: Calculating sample adjustments
> 
> ## Build the models
> groupInfo <- genomeInfo$pop
> adjustvars <- data.frame(genomeInfo$gender)
> models <- makeModels(sampleDepths, testvars=groupInfo, adjustvars=adjustvars)
> 
> ## Analyze the chromosome
> results <- analyzeChr(chr='21', coverageInfo=genomeData, models=models, 
+     cutoffFstat=1e10, cutoffType='manual', groupInfo=groupInfo, mc.cores=1, 
+     writeOutput=FALSE, returnOutput=TRUE, runAnnotation = FALSE)
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-08-18 12:11:43 analyzeChr: Pre-processing the coverage data
2017-08-18 12:11:43 analyzeChr: Calculating statistics
2017-08-18 12:11:43 calculateStats: calculating the F-statistics
2017-08-18 12:11:43 analyzeChr: Calculating pvalues
2017-08-18 12:11:43 analyzeChr: Using the following manual cutoff for the F-statistics 1e+10
2017-08-18 12:11:43 calculatePvalues: identifying data segments
2017-08-18 12:11:43 findRegions: segmenting information
2017-08-18 12:11:43 findRegions: found no segments to work with!!
2017-08-18 12:11:43 analyzeChr: Annotating regions
> results$regions
$regions
NULL

$nullStats
NULL

$nullWidths
NULL

$nullPermutation
NULL

> packageVersion('derfinder')
[1] ‘1.11.6’

```

I wonder how the output of the following function call looks like:

```

prep <- preprocessCoverage(coverageInfo = chr4Cov,  groupInfo = covariates$condition, cutoff = 3, lowMemDir = NULL)

```

Best,

Leonardo

ADD REPLY • link 8.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

Hi Leonardo,

preprocessCoverage() seems to be working fine:

$coverageProcessed
DataFrame with 18585056 rows and 4 columns
         col0_Sample_01-Col-0_fwd.bam col0_Sample_02-Col-0_fwd.bam fpa8_Sample_19-fpa-8_fwd.bam fpa8_Sample_20-fpa-8_fwd.bam
                                <Rle>                        <Rle>                        <Rle>                        <Rle>
1                                   5                            5                            5                            5
2                                   5                            5                            5                            5
3                                   5                            5                            5                            5
4                                   5                            5                            5                            5
5                                   5                            5                            5                            5
...                               ...                          ...                          ...                          ...
18585052                            5                            5                            5                            5
18585053                            5                            5                            5                            5
18585054                            5                            5                            5                            5
18585055                            5                            5                            5                            5
18585056                            5                            5                            5                            5

$mclapplyIndex
$mclapplyIndex[[1]]
logical-Rle of length 18585056 with 2 runs
  Lengths:  5000000 13585056
  Values :     TRUE    FALSE

$mclapplyIndex[[2]]
logical-Rle of length 18585056 with 3 runs
  Lengths: 5000000 5000000 8585056
  Values :   FALSE    TRUE   FALSE

$mclapplyIndex[[3]]
logical-Rle of length 18585056 with 3 runs
  Lengths: 10000000  5000000  3585056
  Values :    FALSE     TRUE    FALSE

$mclapplyIndex[[4]]
logical-Rle of length 18585056 with 2 runs
  Lengths: 15000000  3585056
  Values :    FALSE     TRUE

$position
NULL

$meanCoverage
numeric-Rle of length 18585056 with 2343537 runs
  Lengths:   1419      9   2356     13    293      7     86      7     26     47      1     16      8     11      9      8     15      2      7     17      9     11     22 ...      2      1      1      1      7     25     67      1      4   1860    109    606    118    197     56     25    125     95      7     85     63   1008
  Values :      0   0.25      0   0.25      0   0.25      0   0.25      1   1.25   3.75   3.25    3.5   3.75      4   4.25   6.75   7.75    7.5   7.25   7.75    7.5      8 ...      4   3.75      3    2.5   2.25   4.75      4   3.75    2.5      0      5      0      1      0   0.25      0    0.5      0   0.25    0.5   0.25      0

$groupMeans
$groupMeans$col0
numeric-Rle of length 18585056 with 1326213 runs
  Lengths:  4190    73     1    52    15    75     7    44     9    15    64    11  1354    49    98    52  2178    11  1411    51    87   134    12    51   592     9 ...     3     1     2     8     2     1     1     1     1     1     1     1     1     1    40     1     3     1     8    25    67     1     4  1860   109  2385
  Values :     0     1     6     5    10    12    14     9    14     9     7     2     0     5    10     5     0   0.5     0     2     4     0     4     2     0   0.5 ...  91.5    92    86  86.5  80.5    64    58  53.5    44  43.5  28.5    20    18  18.5    19   9.5   6.5     5   4.5   9.5     8   7.5     5     0    10     0

$groupMeans$fpa8
numeric-Rle of length 18585056 with 2163864 runs
  Lengths:  1419     9  2356    13   293     7    86     7    26    64     8    11     9    25     7    17     9    11    22     7     7    17     8    11     8    30 ...     1     1     1     2     1     2     3     6     6    14    12     1     2     2     1  2679   118   197    56    25   125    95     7    85    63  1008
  Values :     0   0.5     0   0.5     0   0.5     0   0.5     1   1.5     2   2.5     3   3.5     3   2.5   3.5     3     4     5   5.5   9.5     9   8.5     8   7.5 ...    15    13  12.5   7.5     6     4   4.5     5   4.5   3.5   2.5     2   1.5     1   0.5     0     2     0   0.5     0     1     0   0.5     1   0.5     0

ADD REPLY • link 8.4 years ago fruce_ki ▴ 20

1

Entering edit mode

How did you create

chr4Cov

and can you show me the head of it?

Thanks. It looks like it has to do with position = NULL.

ADD REPLY • link 8.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

BINGO!
I was passing cutoff=NULL to loadCoverage() in an attempt to disable filtering at that stage (since analyzeChr() also has a cutoffPre) while keeping the output structure in the format recognised by analyzeChr(). Since the function did not complain, I assumed it was a valid value. I now tried passing cutoff=1 instead, which is probably more sensible overall, and analyzeChr() does not crash.

What is the difference between cutoff in filterData() and cutoffPre in analyzeChr()? Why are two cutoffs needed?

ADD REPLY • link 8.4 years ago fruce_ki ▴ 20

0

Entering edit mode

Although I'm using the latest R and Bioconductor available for MacOSX, I noticed your version of derfinder is newer. Is there any chance that you recently fixed something relevant to this?

ADD REPLY • link 8.4 years ago fruce_ki ▴ 20

0

Entering edit mode

I'm using Bioc-devel but your version of derfinder and mine are the same in terms of these functions.

ADD REPLY • link 8.4 years ago Leonardo Collado Torres ★ 1.1k

0

Entering edit mode

I've tried different values for the F cutoff from 0.9 to 0.0005 and all of them crashed. It doesn't seem to depend on the value.

ADD REPLY • link 8.4 years ago fruce_ki ▴ 20

0

Entering edit mode

It looks like this is a dispatch issue, perhaps caused by some weird corruption of the methods cache. It might help to reinstall some of the packages involved, like BiocGenerics and S4Vectors.

ADD REPLY • link 8.4 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Hi Michael,

I nuked all of my R and re-installed everything fresh. The error persists.

ADD REPLY • link 8.4 years ago fruce_ki ▴ 20

score 2 · Accepted Answer · 2017-08-21

Hi,

The following code reproduces the problem.

library('derfinder')

datadir <- system.file('extdata', 'genomeData', package='derfinder')
files <- rawFiles(datadir = datadir, samplepatt = '*accepted_hits.bam$', 
    fileterm = NULL)
## Shorten the column names
names(files) <- gsub('_accepted_hits.bam', '', names(files))

## Read and filter the data
dataSmall <- loadCoverage(files = files, chr = '21', cutoff = NULL)

## Collapse the coverage information
collapsedFull <- collapseFullCoverage(list('chr21' = dataSmall), verbose = TRUE)

## Calculate library size adjustments
sampleDepths <- sampleDepth(collapsedFull, probs = c(0.5), nonzero=TRUE, 
    verbose=TRUE)

## Build the models
groupInfo <- genomeInfo$pop
adjustvars <- data.frame(genomeInfo$gender)
models <- makeModels(sampleDepths, testvars=groupInfo, adjustvars=adjustvars)

## Analyze the chromosome
results <- analyzeChr(chr='21', coverageInfo= dataSmall, models=models, 
    cutoffFstat=1, cutoffType='manual', groupInfo=groupInfo, mc.cores=1, 
    writeOutput=FALSE, returnOutput=TRUE, runAnnotation = FALSE, cutoffPre = 0)
names(results)

The evaluated code would then look like this:

> ## Analyze the chromosome
> results <- analyzeChr(chr='21', coverageInfo= dataSmall, models=models,
+     cutoffFstat=1, cutoffType='manual', groupInfo=groupInfo, mc.cores=1,
+     writeOutput=FALSE, returnOutput=TRUE, runAnnotation = FALSE, cutoffPre = 0)
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-08-21 13:31:30 analyzeChr: Pre-processing the coverage data
2017-08-21 13:31:32 analyzeChr: Calculating statistics
2017-08-21 13:31:32 calculateStats: calculating the F-statistics
2017-08-21 13:33:04 analyzeChr: Calculating pvalues
2017-08-21 13:33:04 analyzeChr: Using the following manual cutoff for the F-statistics 1
2017-08-21 13:33:04 calculatePvalues: identifying data segments
Error in attr(x, "tsp") <- c(1, NROW(x), 1) :
  attempt to set an attribute on NULL
> packageVersion('derfinder')
[1] ‘1.11.6’

The issue is that preprocessCoverage() re-filters the data using cutoff = cutoffPre only when colsubset is specified. The original idea was that users would load a filtered version of the data from the get go, that is, specifying some non-NULL value for cutoff in loadCoverage(). This would reduce the memory burden. I have made changes to the documentation to make this point clear. Now the error looks like this:

> ## Analyze the chromosome
> results <- analyzeChr(chr='21', coverageInfo= dataSmall, models=models,
+     cutoffFstat=1, cutoffType='manual', groupInfo=groupInfo, mc.cores=1,
+     writeOutput=FALSE, returnOutput=TRUE, runAnnotation = FALSE, cutoffPre = 0)
extendedMapSeqlevels: sequence names mapped from NCBI to UCSC for species homo_sapiens
2017-08-21 14:08:41 analyzeChr: Pre-processing the coverage data
Error in preprocessCoverage(coverageInfo = coverageInfo, groupInfo = groupInfo,  :
  'coverageInfo$position' is NULL when it shouldn't be. Use a cutoff when loading the data with loadCoverage() or fullCoverage(). Alternatively, you can specify 'colsubset' such that preproccessCoverage() will run filterData() again.

The new changes are part of derfinder 1.10.6 (bioc-release) and 1.11.8 (bioc-devel).

Best, Leonardo