CAGEr : problem while filtering the bamFile
1
2
Entering edit mode
Vivek.b ▴ 100
@vivekb-7661
Last seen 4.3 years ago
Germany

Dear CAGEr team

I have been trying to run CAGEr on my test dataset and having problem in reading in the BAM files to make CAGEset object.

> input = list.files("test",pattern = "test-.*sorted.bam$",full.names = TRUE)
> mycage <- new("CAGEset", genomeName = "BSgenome.Dmelanogaster.UCSC.dm6",
+                 inputFiles = input, inputFilesType = "bam",
+                 sampleLabels = c("test-C", "test-D", "test-E", "test-G"))
> ctss <- getCTSS(mycage,removeFirstG = TRUE,correctSystematicG = TRUE)

Gives me :

Reading in file: test/test-C_sorted.bam...
	-> Filtering out low quality reads...
	-> Removing the first base of the reads if 'G' and not aligned to the genome...
	-> Estimating the frequency of adding a 'G' nucleotide and correcting the systematic bias...
Error in setnames(CTSS, c("chr", "pos", "strand", sample.labels[i])) : 
  Can't assign 4 names to a 1 column data.table

After looking into the code I thought I should try not using the removeFirstG option.. so I tried :

ctss <- getCTSS(mycage,removeFirstG = FALSE,correctSystematicG = FALSE)

and got :

Reading in file: test/test-C_sorted.bam...
	-> Filtering out low quality reads...
Error in `$<-.data.frame`(`*tmp*`, "tag_count", value = 1) : 
  replacement has 1 row, data has 0

So basically got stuck at the same place in the source code, but with a different error.

If I turn only correctSystematicG false, i get the same error :

ctss <- getCTSS(mycage, removeFirstG = TRUE, correctSystematicG = FALSE)

Reading in file: ../rawdata_results/fetish-C_sorted.bam...
	-> Filtering out low quality reads...
	-> Removing the first base of the reads if 'G' and not aligned to the genome...
Error in `$<-.data.frame`(`*tmp*`, "tag_count", value = 1) : 
  replacement has 1 row, data has 0

 

I would appreciate any help with this problem..

Thanks

Vivek

CAGEr cage • 2.3k views
ADD COMMENT
2
Entering edit mode
Vivek.b ▴ 100
@vivekb-7661
Last seen 4.3 years ago
Germany

I solved the problem today. Although it was a silly thing but was hard to catch. The problem was that I mapped my data using ENSEMBL annotation, and had no "chr" in my seqnames. Since the getCTSS methods uses this : 

reads.GRanges <- reads.GRanges[seqnames(reads.GRanges) %in% seqnames(genome)]

It removed all my entries and gives empty Granges. The method throws no warning here and gets stuck at the next step.

I want to request the authors to add a warning here, and also add a way to catch and solve for this chr vs no-chr problem while making the comparison, since others might also face the same issue.

Best,

Vivek

 

ADD COMMENT
0
Entering edit mode

hi, I am using CAGEr too. And  I download the bam file from FANTOM5 website. I got the same problem like you. Can you tell me how to solve the problem?

> input.file <- c("G:/CAGEr1110/Mouse%20Neurons%20-%20striatal%2c%20donor3.CNhs12111.11646-122D8.mm10.nobarcode.bam")
> samples <- "test"
> myCAGEset <- new("CAGEset", genomeName = "BSgenome.Mmusculus.UCSC.mm10",
+                  inputFiles = input.file, inputFilesType = "bam",
+                  sampleLabels = samples)
> getCTSS(myCAGEset)

Reading in file: G:/CAGEr1110/Mouse%20Neurons%20-%20striatal%2c%20donor3.CNhs12111.11646-122D8.mm10.nobarcode.bam...
-> Filtering out low quality reads...
-> Removing the first base of the reads if 'G' and not aligned to the genome...
-> Estimating the frequency of adding a 'G' nucleotide and correcting the systematic bias...
Error in setnames(CTSS, c("chr", "pos", "strand", sample.label)) : 
  Can't assign 4 names to a 1 column data.table

the website I download the  bam file is http://fantom.gsc.riken.jp/5/datafiles/latest/basic/mouse.primary_cell.hCAGE/

by the way the info  of my R is 

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936    LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                               LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome.Mmusculus.UCSC.mm10_1.4.0 BSgenome_1.50.0                    rtracklayer_1.42.0                
 [4] Biostrings_2.50.1                  XVector_0.22.0                     GenomicRanges_1.34.0              
 [7] GenomeInfoDb_1.18.0                IRanges_2.16.0                     S4Vectors_0.20.0                  
[10] BiocGenerics_0.28.0                CAGEr_1.24.0                      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0                        stringdist_0.9.5.1                BSgenome.Mmusculus.UCSC.mm9_1.4.0
 [4] lattice_0.20-38                   formula.tools_1.7.1               Rsamtools_1.34.0                 
 [7] gtools_3.8.1                      assertthat_0.2.0                  digest_0.6.18                    
[10] R6_2.3.0                          plyr_1.8.4                        ggplot2_3.1.0                    
[13] pillar_1.3.0                      zlibbioc_1.28.0                   rlang_0.3.0.1                    
[16] lazyeval_0.2.1                    rstudioapi_0.8                    data.table_1.11.8                
[19] vegan_2.5-3                       Matrix_1.2-15                     splines_3.5.1                    
[22] BiocParallel_1.16.0               RCurl_1.95-4.11                   munsell_0.5.0                    
[25] DelayedArray_0.8.0                compiler_3.5.1                    pkgconfig_2.0.2                  
[28] mgcv_1.8-25                       tidyselect_0.2.5                  SummarizedExperiment_1.12.0      
[31] tibble_1.4.2                      GenomeInfoDbData_1.2.0            matrixStats_0.54.0               
[34] XML_3.98-1.16                     reshape_0.8.8                     permute_0.9-4                    
[37] crayon_1.3.4                      dplyr_0.7.7                       GenomicAlignments_1.18.0         
[40] MASS_7.3-51.1                     bitops_1.0-6                      grid_3.5.1                       
[43] nlme_3.1-137                      gtable_0.2.0                      magrittr_1.5                     
[46] scales_1.0.0                      KernSmooth_2.23-15                stringi_1.2.4                    
[49] som_0.3-5.1                       MultiAssayExperiment_1.8.1        bindrcpp_0.2.2                   
[52] tools_3.5.1                       Biobase_2.42.0                    glue_1.3.0                       
[55] purrr_0.2.5                       colorspace_1.3-2                  cluster_2.0.7-1                  
[58] operator.tools_1.6.3              beanplot_1.2                      memoise_1.1.0                    
[61] VGAM_1.0-6                        bindr_0.1.1

thanks  

zhong

ADD REPLY
0
Entering edit mode

Hi

It's been a while since I had this, but I think the problem was solved by switching the gene annotation to ensembl syle.

seqlevelsStyle(BSgenome.Mmusculus.UCSC.mm10) <- "ENSEMBL"

ADD REPLY
0
Entering edit mode

Good point with seqlevelsStyle!

I just pushed versions 1.33.1 and 1.32.1 to Bioconductor, which should address these problems.

ADD REPLY
0
Entering edit mode

Hi, I am also troubled with the problem. If I did not make it wrong, the solution is to redo the mapping of the UCSC GTF file?

ADD REPLY

Login before adding your answer.

Traffic: 463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6