Hi all,
I'm running Rsubread on a Win10 machine. When running Rsubread (v. 2.0.1) I need to index a cat genome and my output files are wrong. The .reads file is correct, but the .files file works for the first 15 chromosomes, then every entry after that is "-1" in column three. 
I'm stumped and have searched for days. Is it because I'm using a reference genome with "N" padding? The repeat Threshold for Rsubread defaults to 100, but shouldn't it just skip the Ns and move to genomic sequence? Why does it fail for EVERY chromosome following the first failure?
The results are the same whether I use the cat genome from Ensemble or from NCBI genome, so it isn't the chromosome names. My code is:
> ref <- "cat9.fasta"
> buildindex(basename = "cats_index", reference = ref, memory = 10000)
The only messages from Rsubread are:
Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| **454688 uninformative subreads were found.**                                  ||
|| These subreads were excluded from index building. 
The output `cat_index.files` looks like:
A1  .//subread-index-sam-010708-934819  67
A2  .//subread-index-sam-010708-934819  246136063
A3  .//subread-index-sam-010708-934819  420465740
B1  .//subread-index-sam-010708-934819  566054919
B2  .//subread-index-sam-010708-934819  777738090
B3  .//subread-index-sam-010708-934819  935629173
B4  .//subread-index-sam-010708-934819  1087876913
C1  .//subread-index-sam-010708-934819  1234814487
C2  .//subread-index-sam-010708-934819  1461317866
D1  .//subread-index-sam-010708-934819  1625197636
D2  .//subread-index-sam-010708-934819  1744806531
D3  .//subread-index-sam-010708-934819  1836496368
D4  .//subread-index-sam-010708-934819  1934995377
E1  .//subread-index-sam-010708-934819  2033125790
E2  .//subread-index-sam-010708-934819  2097678790
E3  .//subread-index-sam-010708-934819  -1
F1  .//subread-index-sam-010708-934819  -1
F2  .//subread-index-sam-010708-934819  -1
MT  .//subread-index-sam-010708-934819  -1
X   .//subread-index-sam-010708-934819  -1
Thanks for any help!
I edited the title from "index reads" to "index files"
Pete

It sounds the index excessed the 31-bit limit. Can you run sessionInfo() to see what is the version and the architecture of R (eg Windows or Linux, 32-bit or 64-bit)?
Thanks so much Yang, Here is the sessionInfo:
In case anyone might be able to help, i have explored this further by using a few chromosomes concatenated together that span the location between chromosomes E2 and E3, and this smaller set of chromosomes will index with no problems and the
filesfile does not give a-1value for chromosome E3. This suggests it is not specific formatting of the chromosome files.I could really use some help!
Thanks, Peter
This is not solved for me yet. I used a random genome from NCBI last night (Anole lizard) and indexed it last night. The
.filesfile was perfect. Could someone test the cat genome on their system?Thanks,
Peter
This still is a problem and it would be a HUGE help if someone could let me know If I've done something wrong, or this genome has a peculiar feature.
Thanks,
Peter
It looks very much like 32-bit integer overflow -- the maximum signed 32 bit integer value is 2147483647, which is just above the last number before -1 printed above. In the short term it looks like you'll get around this by using a 64-bit platform; both linux & mac worked for me. I used the Bioconductor docker container to do this on my mac; maybe similar steps could work on Windows?
On my mac I created a little directory hierarchy
(maXXX is my user account, obfuscated) I then ran docker following the instructions at http://bioconductor.org/help/docker/, especially the 'Mounting Additional Volumes' section. Specifically I ran the command
The first
-vsays make my mac's directory tmp/cats/libraries available as the 'local' library where R will install new packages The second-vsays to share the place where I want my analysis data to be made available with the docker container under the rstudio account. The-eand-pcommands set up the environment in which the docker container's RStudio will run.After running the command, I can go to my web browser and visit http://127.0.0.1:8787. I'm asked for user name (rstudio) and password (meow, as above!) and then I'm in RStudio. I changed into the data directory
install RSubread
and then went about the business of building the index.
Maybe a little intimidating at first, and maybe some challenges on Windows that I don't have any experience with, but actually not too bad once you get the hang of it.
Martin that is a great write-up and I really appreciate it. My system is 64-bit and (as far as I know) I've never run up against a 32-bit integer overflow in my entire life (and I'm old!). I need some Docker experience so this is a great time. And you revealed another 32-bit integer overflow in the GTF file where some exon boundaries were giving me -2147483647 and crashing my suerat workflow. Windows does have some challenges, but I've been having pretty good luck with R and RStudio.
Had to use gappedIndex because my crappy little desktop only has 16 Gb RAM, but otherwise seems fine?
Had to snip out extraneous cruft to get past 5000 character limit. But I think it's pretty clear?
Oh, and
James, I sincerely appreciate this. It's been frustrating me for days. This lets me know the problem is my setup and your information was very helpful steering me to R 3.6.1 which I'm hoping will change my outputs. You (and Martin) really went out of your ways to help when I was completely out of ideas earlier today. Thanks!
Pete