Question

negative one "-1" values in index files using Rsubread

0

Entering edit mode

peter.r.hoyt • 0

@peterrhoyt-22385

Last seen 21 months ago

United States

Hi all, I'm running Rsubread on a Win10 machine. When running Rsubread (v. 2.0.1) I need to index a cat genome and my output files are wrong. The .reads file is correct, but the .files file works for the first 15 chromosomes, then every entry after that is "-1" in column three.

I'm stumped and have searched for days. Is it because I'm using a reference genome with "N" padding? The repeat Threshold for Rsubread defaults to 100, but shouldn't it just skip the Ns and move to genomic sequence? Why does it fail for EVERY chromosome following the first failure?

The results are the same whether I use the cat genome from Ensemble or from NCBI genome, so it isn't the chromosome names. My code is:

> ref <- "cat9.fasta"
> buildindex(basename = "cats_index", reference = ref, memory = 10000)

The only messages from Rsubread are:

Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| **454688 uninformative subreads were found.**                                  ||
|| These subreads were excluded from index building.

The output `cat_index.files` looks like:
A1  .//subread-index-sam-010708-934819  67
A2  .//subread-index-sam-010708-934819  246136063
A3  .//subread-index-sam-010708-934819  420465740
B1  .//subread-index-sam-010708-934819  566054919
B2  .//subread-index-sam-010708-934819  777738090
B3  .//subread-index-sam-010708-934819  935629173
B4  .//subread-index-sam-010708-934819  1087876913
C1  .//subread-index-sam-010708-934819  1234814487
C2  .//subread-index-sam-010708-934819  1461317866
D1  .//subread-index-sam-010708-934819  1625197636
D2  .//subread-index-sam-010708-934819  1744806531
D3  .//subread-index-sam-010708-934819  1836496368
D4  .//subread-index-sam-010708-934819  1934995377
E1  .//subread-index-sam-010708-934819  2033125790
E2  .//subread-index-sam-010708-934819  2097678790
E3  .//subread-index-sam-010708-934819  -1
F1  .//subread-index-sam-010708-934819  -1
F2  .//subread-index-sam-010708-934819  -1
MT  .//subread-index-sam-010708-934819  -1
X   .//subread-index-sam-010708-934819  -1

Thanks for any help!

I edited the title from "index reads" to "index files"

Pete

software error rsubread • 2.3k views

ADD COMMENT • link updated 5.8 years ago by Yang Liao ▴ 450 • written 5.8 years ago by peter.r.hoyt • 0

0

Entering edit mode

It sounds the index excessed the 31-bit limit. Can you run sessionInfo() to see what is the version and the architecture of R (eg Windows or Linux, 32-bit or 64-bit)?

ADD REPLY • link 5.8 years ago Yang Liao ▴ 450

0

Entering edit mode

Thanks so much Yang, Here is the sessionInfo:

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

0

Entering edit mode

In case anyone might be able to help, i have explored this further by using a few chromosomes concatenated together that span the location between chromosomes E2 and E3, and this smaller set of chromosomes will index with no problems and the files file does not give a -1 value for chromosome E3. This suggests it is not specific formatting of the chromosome files.

I could really use some help!

Thanks, Peter

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

0

Entering edit mode

This is not solved for me yet. I used a random genome from NCBI last night (Anole lizard) and indexed it last night. The .files file was perfect. Could someone test the cat genome on their system?

wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Felis_catus/latest_assembly_versions/GCF_000181335.3_Felis_catus_9.0/GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz

ftp://ftp.ensembl.org/pub/release-99/fasta/felis_catus/dna/Felis_catus.Felis_catus_9.0.dna.toplevel.fa.gz

Thanks,

Peter

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

0

Entering edit mode

This still is a problem and it would be a HUGE help if someone could let me know If I've done something wrong, or this genome has a peculiar feature.

Thanks,

Peter

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

0

Entering edit mode

It looks very much like 32-bit integer overflow -- the maximum signed 32 bit integer value is 2147483647, which is just above the last number before -1 printed above. In the short term it looks like you'll get around this by using a 64-bit platform; both linux & mac worked for me. I used the Bioconductor docker container to do this on my mac; maybe similar steps could work on Windows?

On my mac I created a little directory hierarchy

/Users/maXXX$ tree tmp/cats
tmp/cats
├── data
│   └── GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz
└── libraries

(maXXX is my user account, obfuscated) I then ran docker following the instructions at http://bioconductor.org/help/docker/, especially the 'Mounting Additional Volumes' section. Specifically I ran the command

docker run -it \
       -v /Users/maXXX/tmp/cats/libraries:/usr/local/lib/R/host-site-library \
       -v /Users/maXXX/tmp/cats/data:/home/rstudio/data \
       -e PASSWORD=meow \
       -p 8787:8787 \
       bioconductor/bioconductor_docker:RELEASE_3_10

The first -v says make my mac's directory tmp/cats/libraries available as the 'local' library where R will install new packages The second -v says to share the place where I want my analysis data to be made available with the docker container under the rstudio account. The -e and -p commands set up the environment in which the docker container's RStudio will run.

After running the command, I can go to my web browser and visit http://127.0.0.1:8787. I'm asked for user name (rstudio) and password (meow, as above!) and then I'm in RStudio. I changed into the data directory

setwd("/home/rstudio/data")

install RSubread

BiocManager::install("Rsubread")

and then went about the business of building the index.

Maybe a little intimidating at first, and maybe some challenges on Windows that I don't have any experience with, but actually not too bad once you get the hang of it.

ADD REPLY • link 5.8 years ago Martin Morgan 25k

0

Entering edit mode

Martin that is a great write-up and I really appreciate it. My system is 64-bit and (as far as I know) I've never run up against a 32-bit integer overflow in my entire life (and I'm old!). I need some Docker experience so this is a great time. And you revealed another 32-bit integer overflow in the GTF file where some exon boundaries were giving me -2147483647 and crashing my suerat workflow. Windows does have some challenges, but I've been having pretty good luck with R and RStudio.

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

0

Entering edit mode

> download.file("https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Felis_catus/latest_assembly_versions/GCF_000181335.3_Felis_catus_9.0/GCF_000181335.3_Felis_catus_9.0_genomic.fna.gz","cat.fna.gz")
> buildindex("whatevs/cat", "cat.fna.gz", gappedIndex = TRUE)
<SNIP>
       Rsubread 2.0.1

//================================= setting ==================================\\
||                                                                            ||
||                Index name : cat                                            ||
||               Index space : base space                                     ||
||               Index split : no-split                                       ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : yes                                            ||
||                                                                            ||
||       Free / total memory : 12.2GB / 15.9GB                                ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o cat.fna.gz                                   ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| 139772 uninformative subreads were found.                                  ||
|| These subreads were excluded from index building.                          ||
|| Estimate the index size...                                                 ||
||    8%,   0 mins elapsed, rate=6912.7k bps/s                                ||
||   16%,   0 mins elapsed, rate=7346.8k bps/s                                ||
||   24%,   1 mins elapsed, rate=7454.2k bps/s                                ||
<SNIP>
||   91%,   4 mins elapsed, rate=7739.3k bps/s                                ||
|| 5.2 GB of memory is needed for index building.                             ||
|| Build the index...                                                         ||
||    8%,   0 mins elapsed, rate=4409.5k bps/s                                ||
||   16%,   1 mins elapsed, rate=4595.6k bps/s                                ||
<SNIP>
||   91%,   8 mins elapsed, rate=4759.9k bps/s                                ||
|| Save current index block...                                                ||
||  [ 0.0% finished ]                                                         ||
||  [ 10.0% finished ]                                                        ||
||  [ 20.0% finished ]                                                        ||
<SNIP>
||  [ 100.0% finished ]                                                       ||
||                                                                            ||
||                     Total running time: 24.5 minutes.                      ||
||     Index C:\Users\jmacdon\Desktop\whatevs\cat was successfully built!     ||
||                                                                            ||
\\============================================================================//

> system("ls -lah whatevs")
total 4.9G
drwxr-xr-x 1 jmacdon None    0 Mar 16 14:55 .
drwxr-xr-x 1 jmacdon None    0 Mar 16 14:55 ..
-rw-r--r-- 1 jmacdon None 604M Mar 16 14:55 cat.00.b.array
-rw-r--r-- 1 jmacdon None 4.3G Mar 16 14:55 cat.00.b.tab
-rw-r--r-- 1 jmacdon None 240K Mar 16 14:53 cat.files
-rw-r--r-- 1 jmacdon None    0 Mar 16 14:31 cat.log
-rw-r--r-- 1 jmacdon None 115K Mar 16 14:55 cat.reads
[1] 0

Had to use gappedIndex because my crappy little desktop only has 16 Gb RAM, but otherwise seems fine?

ADD REPLY • link 5.8 years ago James W. MacDonald 68k

0

Entering edit mode

Had to snip out extraneous cruft to get past 5000 character limit. But I think it's pretty clear?

ADD REPLY • link 5.8 years ago James W. MacDonald 68k

0

Entering edit mode

Oh, and

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_2.0.1

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1

ADD REPLY • link 5.8 years ago James W. MacDonald 68k

0

Entering edit mode

James, I sincerely appreciate this. It's been frustrating me for days. This lets me know the problem is my setup and your information was very helpful steering me to R 3.6.1 which I'm hoping will change my outputs. You (and Martin) really went out of your ways to help when I was completely out of ideas earlier today. Thanks!

Pete

ADD REPLY • link 5.8 years ago peter.r.hoyt • 0

score 1 · Answer 1 · 2020-03-17

Hi Peter,

Thanks for the bug report. The "files" file in a Subread index has been deprecated and its content isn't used in any task. This file used to record the offsets of each contig in the input FASTA file, but we then changed our strategy to create a temporary FASTA file ("subread-index-sam-010708-934819" in your case) and then delete it after the index is built. This makes the offsets in the "files" file useless because the offsets describe byte-locations in the deleted temporary file.

I checked the code and found that the "-1" offsets are caused by a 31-bit offset limit in Windows. This will not cause any problem in other parts in the index, hence read mapping should work well on this index.

Cheers, Yang