Question: Indexing large chromosomes (like wheat) bam files in Rsamtools
0
6 months ago by
saripalligautam8610 wrote:

Hello All,

Is there a possible way to index bam files for large chromosomes(like wheat) using Rsamtools? Like in case of samtools using linux, .CSI (coordinate sorted index) is possible whereby .bam files for large chromosomes can be indexed. However, the .CSI index generated using samtools (through linux) is not compatible with MEDIPS or QSEA packages which are used to calculate CpG densities for MeDIP-Seq data.

Regards,

rsamtools csi index wheat • 211 views
written 6 months ago by saripalligautam8610
Rsamtools is based on the same htslib library as the regular samtools application, so I imagine if there's a fundamental issue then it will be the same in both cases. That said, perhaps you can update your post to include an example of what happens when you use a .CSI index with those packages to demonstrate the incompatibility.
1

Actually Rsamtools is based on an old code base; it is being updated to htslib this release.

Thanks for your responses. Below is the output which I receive when I try to analyse CpG density using CS code of MEDIPS. Similarly, instead of MEDIPS, when I use QSEA, i get the output error that the index is not found and hence the process terminates. Anyways, since, Martin has replied that Rsamtools is being updated, i think this issue will be probably resolved in the updated version. Could you please update whenever the RSamtools updated version is released?

CS = MEDIPS.couplingVector(pattern = "CG", refObj = medipHD0)

Get genomic sequence pattern positions...

Searching in chr1A ...[ 23590660 ] found.
Searching in chr1B ...[ 27672978 ] found.
Searching in chr1D ...[ 20896261 ] found.
Searching in chr2A ...[ 31176892 ] found.
Searching in chr2B ...[ 32155496 ] found.
Searching in chr2D ...[ 27560821 ] found.
Searching in chr3A ...[ 29825837 ] found.
Searching in chr3B ...[ 33423157 ] found.
Searching in chr3D ...[ 25973362 ] found.
Searching in chr4A ...[ 29541297 ] found.
Searching in chr4B ...[ 27007257 ] found.
Searching in chr4D ...[ 21603933 ] found.
Searching in chr5A ...[ 28266938 ] found.
Searching in chr5B ...[ 101245 ] found.
Searching in chr5D ...[ 175696 ] found.
Searching in chr6A ...[ 300449 ] found.
Searching in chr6B ...[ 436766 ] found.
Searching in chr6D ...[ 528029 ] found.
Searching in chr7A ...Error in .local(con, format, text, ...) : UCSC library operation failed
In .local(con, format, text, ...) :
twoBitReadSeqFrag in chr7A end (15566174) >= seqSize (15508575)

I don't know much about MEDIPS or the wheat genome, but this looks like the sort of error you see when using a different version of a genome for alignment compared to downstream operations.

Looking at the code it looks like this error occurs while calling MEDIPS.getPositions() which doesn't actually require the bam file. Out of interest, what is the output to MEDIPS:::genome_name(medipHD0) ?

I agree with Mike that this doesn't look like an issue related to Rsamtools. Please understand that It is very hard for the people watching this support site to help you if they cannot reproduce the error you're getting. So it would be of great help if you could provide a self-contained example that reproduces the error and that people can run. Also providing the output of your sessionInfo() can be useful and is highly recommended.

As for updating Rsamtools to use Rhtslib (which is a bundle of htslib 1.7): this is still a work-in-progress and, unfortunately, is unlikely to be ready for the BioC 3.8 release (scheduled in 9 days). We're planning to make this available at the beginning of the BioC 3.9 development cycle. This means that, until we release BioC 3.9 in Spring 2019, only people using BioC devel (which will require R 3.6) will be able to use it.

H.

Thanks Mike.

As Herve Pages and you pointed out, I too believe that this is the error due to reference genome which I am using.I also tried other packages to count the number of CpGs in the reference genome (irrespective of .bam files); there too I am getting the similar type of error message.Now, I am trying to do CpG density/coverage analysis using some codes in Linux/ubuntu. I will see if I succeed. However, you may suggest if I can make some changes to do the same in R.

Below is the output of MEDIPS:::genome_name(medipHD0).

This is the BSgenome package which I created using forgeBSgenome since the genome of wheat was not present in th available genomes in BSgenome package.Also the output of sessInfo() is shown below.

> MEDIPS:::genome_name(medipHD0)
[1] "BSgenome.Taestivum.Refseqv1.0"
> sessioInfo()
Error in sessioInfo() : could not find function "sessioInfo"
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ramwas_1.4.0                      filematrix_1.3                    Repitools_1.26.0                  MEDIPS_1.32.0                     Rsamtools_1.32.3                  BSgenome.Taestivum.Refseqv1.0_1.0
[7] BSgenome_1.48.0                   rtracklayer_1.40.6                Biostrings_2.48.0                 XVector_0.20.0                    GenomicRanges_1.32.7              GenomeInfoDb_1.16.0
[13] IRanges_2.14.12                   S4Vectors_0.18.3                  BiocGenerics_0.26.0               BiocInstaller_1.30.0