I'm running Rsubread on a Win10 machine. When running Rsubread (v. 2.0.1) I need to index a cat genome and my output files are wrong. The
.reads file is correct, but the
.files file works for the first 15 chromosomes, then every entry after that is "-1" in column three.
I'm stumped and have searched for days. Is it because I'm using a reference genome with "N" padding? The repeat Threshold for Rsubread defaults to 100, but shouldn't it just skip the Ns and move to genomic sequence? Why does it fail for EVERY chromosome following the first failure?
The results are the same whether I use the cat genome from Ensemble or from NCBI genome, so it isn't the chromosome names. My code is:
> ref <- "cat9.fasta" > buildindex(basename = "cats_index", reference = ref, memory = 10000)
The only messages from Rsubread are:
Check the integrity of provided reference sequences ... || || No format issues were found || || Scan uninformative subreads in reference sequences ... || || **454688 uninformative subreads were found.** || || These subreads were excluded from index building.
The output `cat_index.files` looks like: A1 .//subread-index-sam-010708-934819 67 A2 .//subread-index-sam-010708-934819 246136063 A3 .//subread-index-sam-010708-934819 420465740 B1 .//subread-index-sam-010708-934819 566054919 B2 .//subread-index-sam-010708-934819 777738090 B3 .//subread-index-sam-010708-934819 935629173 B4 .//subread-index-sam-010708-934819 1087876913 C1 .//subread-index-sam-010708-934819 1234814487 C2 .//subread-index-sam-010708-934819 1461317866 D1 .//subread-index-sam-010708-934819 1625197636 D2 .//subread-index-sam-010708-934819 1744806531 D3 .//subread-index-sam-010708-934819 1836496368 D4 .//subread-index-sam-010708-934819 1934995377 E1 .//subread-index-sam-010708-934819 2033125790 E2 .//subread-index-sam-010708-934819 2097678790 E3 .//subread-index-sam-010708-934819 -1 F1 .//subread-index-sam-010708-934819 -1 F2 .//subread-index-sam-010708-934819 -1 MT .//subread-index-sam-010708-934819 -1 X .//subread-index-sam-010708-934819 -1
Thanks for any help!
I edited the title from "index reads" to "index files"
It sounds the index excessed the 31-bit limit. Can you run sessionInfo() to see what is the version and the architecture of R (eg Windows or Linux, 32-bit or 64-bit)?
Thanks so much Yang, Here is the sessionInfo:
In case anyone might be able to help, i have explored this further by using a few chromosomes concatenated together that span the location between chromosomes E2 and E3, and this smaller set of chromosomes will index with no problems and the
filesfile does not give a
-1value for chromosome E3. This suggests it is not specific formatting of the chromosome files.
I could really use some help!
This is not solved for me yet. I used a random genome from NCBI last night (Anole lizard) and indexed it last night. The
.filesfile was perfect. Could someone test the cat genome on their system?
This still is a problem and it would be a HUGE help if someone could let me know If I've done something wrong, or this genome has a peculiar feature.
It looks very much like 32-bit integer overflow -- the maximum signed 32 bit integer value is 2147483647, which is just above the last number before -1 printed above. In the short term it looks like you'll get around this by using a 64-bit platform; both linux & mac worked for me. I used the Bioconductor docker container to do this on my mac; maybe similar steps could work on Windows?
On my mac I created a little directory hierarchy
(maXXX is my user account, obfuscated) I then ran docker following the instructions at http://bioconductor.org/help/docker/, especially the 'Mounting Additional Volumes' section. Specifically I ran the command
-vsays make my mac's directory tmp/cats/libraries available as the 'local' library where R will install new packages The second
-vsays to share the place where I want my analysis data to be made available with the docker container under the rstudio account. The
-pcommands set up the environment in which the docker container's RStudio will run.
After running the command, I can go to my web browser and visit http://127.0.0.1:8787. I'm asked for user name (rstudio) and password (meow, as above!) and then I'm in RStudio. I changed into the data directory
and then went about the business of building the index.
Maybe a little intimidating at first, and maybe some challenges on Windows that I don't have any experience with, but actually not too bad once you get the hang of it.
Martin that is a great write-up and I really appreciate it. My system is 64-bit and (as far as I know) I've never run up against a 32-bit integer overflow in my entire life (and I'm old!). I need some Docker experience so this is a great time. And you revealed another 32-bit integer overflow in the GTF file where some exon boundaries were giving me -2147483647 and crashing my suerat workflow. Windows does have some challenges, but I've been having pretty good luck with R and RStudio.
Had to use gappedIndex because my crappy little desktop only has 16 Gb RAM, but otherwise seems fine?
Had to snip out extraneous cruft to get past 5000 character limit. But I think it's pretty clear?
James, I sincerely appreciate this. It's been frustrating me for days. This lets me know the problem is my setup and your information was very helpful steering me to R 3.6.1 which I'm hoping will change my outputs. You (and Martin) really went out of your ways to help when I was completely out of ideas earlier today. Thanks!