Hi all,
I'm running Rsubread on a Win10 machine. When running Rsubread (v. 2.0.1) I need to index a cat genome and my output files are wrong. The .reads
file is correct, but the .files
file works for the first 15 chromosomes, then every entry after that is "-1" in column three.
I'm stumped and have searched for days. Is it because I'm using a reference genome with "N" padding? The repeat Threshold for Rsubread defaults to 100, but shouldn't it just skip the Ns and move to genomic sequence? Why does it fail for EVERY chromosome following the first failure?
The results are the same whether I use the cat genome from Ensemble or from NCBI genome, so it isn't the chromosome names. My code is:
> ref <- "cat9.fasta"
> buildindex(basename = "cats_index", reference = ref, memory = 10000)
The only messages from Rsubread are:
Check the integrity of provided reference sequences ... ||
|| No format issues were found ||
|| Scan uninformative subreads in reference sequences ... ||
|| **454688 uninformative subreads were found.** ||
|| These subreads were excluded from index building.
The output `cat_index.files` looks like:
A1 .//subread-index-sam-010708-934819 67
A2 .//subread-index-sam-010708-934819 246136063
A3 .//subread-index-sam-010708-934819 420465740
B1 .//subread-index-sam-010708-934819 566054919
B2 .//subread-index-sam-010708-934819 777738090
B3 .//subread-index-sam-010708-934819 935629173
B4 .//subread-index-sam-010708-934819 1087876913
C1 .//subread-index-sam-010708-934819 1234814487
C2 .//subread-index-sam-010708-934819 1461317866
D1 .//subread-index-sam-010708-934819 1625197636
D2 .//subread-index-sam-010708-934819 1744806531
D3 .//subread-index-sam-010708-934819 1836496368
D4 .//subread-index-sam-010708-934819 1934995377
E1 .//subread-index-sam-010708-934819 2033125790
E2 .//subread-index-sam-010708-934819 2097678790
E3 .//subread-index-sam-010708-934819 -1
F1 .//subread-index-sam-010708-934819 -1
F2 .//subread-index-sam-010708-934819 -1
MT .//subread-index-sam-010708-934819 -1
X .//subread-index-sam-010708-934819 -1
Thanks for any help!
I edited the title from "index reads" to "index files"
Pete
It sounds the index excessed the 31-bit limit. Can you run sessionInfo() to see what is the version and the architecture of R (eg Windows or Linux, 32-bit or 64-bit)?
Thanks so much Yang, Here is the sessionInfo:
In case anyone might be able to help, i have explored this further by using a few chromosomes concatenated together that span the location between chromosomes E2 and E3, and this smaller set of chromosomes will index with no problems and the
files
file does not give a-1
value for chromosome E3. This suggests it is not specific formatting of the chromosome files.I could really use some help!
Thanks, Peter
This is not solved for me yet. I used a random genome from NCBI last night (Anole lizard) and indexed it last night. The
.files
file was perfect. Could someone test the cat genome on their system?Thanks,
Peter
This still is a problem and it would be a HUGE help if someone could let me know If I've done something wrong, or this genome has a peculiar feature.
Thanks,
Peter
It looks very much like 32-bit integer overflow -- the maximum signed 32 bit integer value is 2147483647, which is just above the last number before -1 printed above. In the short term it looks like you'll get around this by using a 64-bit platform; both linux & mac worked for me. I used the Bioconductor docker container to do this on my mac; maybe similar steps could work on Windows?
On my mac I created a little directory hierarchy
(maXXX is my user account, obfuscated) I then ran docker following the instructions at http://bioconductor.org/help/docker/, especially the 'Mounting Additional Volumes' section. Specifically I ran the command
The first
-v
says make my mac's directory tmp/cats/libraries available as the 'local' library where R will install new packages The second-v
says to share the place where I want my analysis data to be made available with the docker container under the rstudio account. The-e
and-p
commands set up the environment in which the docker container's RStudio will run.After running the command, I can go to my web browser and visit http://127.0.0.1:8787. I'm asked for user name (rstudio) and password (meow, as above!) and then I'm in RStudio. I changed into the data directory
install RSubread
and then went about the business of building the index.
Maybe a little intimidating at first, and maybe some challenges on Windows that I don't have any experience with, but actually not too bad once you get the hang of it.
Martin that is a great write-up and I really appreciate it. My system is 64-bit and (as far as I know) I've never run up against a 32-bit integer overflow in my entire life (and I'm old!). I need some Docker experience so this is a great time. And you revealed another 32-bit integer overflow in the GTF file where some exon boundaries were giving me -2147483647 and crashing my suerat workflow. Windows does have some challenges, but I've been having pretty good luck with R and RStudio.
Had to use gappedIndex because my crappy little desktop only has 16 Gb RAM, but otherwise seems fine?
Had to snip out extraneous cruft to get past 5000 character limit. But I think it's pretty clear?
Oh, and
James, I sincerely appreciate this. It's been frustrating me for days. This lets me know the problem is my setup and your information was very helpful steering me to R 3.6.1 which I'm hoping will change my outputs. You (and Martin) really went out of your ways to help when I was completely out of ideas earlier today. Thanks!
Pete