getSeq failing with DNAStringSet and GRanges
1
0
Entering edit mode
Mog • 0
@32b6b9ff
Last seen 13 months ago
United Kingdom

Hi all, I'm trying to run ORFik on a folder of .fastas. These fastas are outputted by an earlier python script, they're HTS contigs marked as viral in origin. Everything runs fine until I need to extract just the ORFs from the fastas. Following the manual, I ryn findORFsFasta. The for loop is just to find each file in turn. It finds the ORFs fine, and the BioStringSet generates fine, I can print them.

library(ORFik)
library(GenomicRanges)
library(BSgenome)
library(Biostrings)

all_fastas <- list.files("rma_fastas")
setwd("rma_fastas")

for (fasta in all_fastas)
    {
    fasta_dss <- readDNAStringSet(fasta, "fasta")
    ORFs <- findORFsFasta(fasta, startCodon = startDefinition(1), stopCodon = stopDefinition(2), longestORF = TRUE, minimumLength = 50, is.circular = FALSE)
    extracted_ORFs <- getSeq(fasta_dss, names = ORFs)
    }

But extracted_ORFs won't generate. It creates a massive error log that I just can't understand. I double checked the manual and it says for GRanges, names = (GRanges) is just fine. No clue what the problem is. I updated my R, my Bioconductor manager, and the packages, to no success. I'm running in Jupyter Lab because I don't have admin privs on this machine and Anaconda doesn't have the latest R Studio versions (or base R versions) for me to get around that problem.

Error in h(simpleError(msg, call)): error in evaluating the argument 'value' in selecting a method for function 'unsplit': error in evaluating the argument 'x' in selecting a method for function 'extractAt': subscript contains invalid names
Traceback:

1. getSeq(fasta_dss, names = ORFs)
2. getSeq(fasta_dss, names = ORFs)
3. .local(x, ...)
4. unsplit(extractAt(x[names(rl)], unname(rl)), seqnames(gr))
5. extractAt(x[names(rl)], unname(rl))
6. x[names(rl)]
7. x[names(rl)]
8. subset_along_ROWS(x, i, drop = drop)
9. extractROWS(x, i)
10. extractROWS(x, i)
11. .dropUnusedPoolElts(callNextMethod())
12. callNextMethod()
13. .nextMethod(x = x, i = i)
14. normalizeSingleBracketSubscript(i, x, as.NSBS = TRUE)
15. NSBS(i, x, exact = exact, strict.upper.bound = !allow.append, 
  .     allow.NAs = allow.NAs)
16. NSBS(i, x, exact = exact, strict.upper.bound = !allow.append, 
  .     allow.NAs = allow.NAs)
17. .subscript_error("subscript contains invalid ", what)
18. stop(wmsg(...), call. = FALSE)
19. .handleSimpleError(function (cond) 
  . .Internal(C_tryCatchHelper(addr, 1L, cond)), "subscript contains invalid names", 
  .     base::quote(NULL))
20. h(simpleError(msg, call))
21. .handleSimpleError(function (cond) 
  . .Internal(C_tryCatchHelper(addr, 1L, cond)), "error in evaluating the argument 'x' in selecting a method for function 'extractAt': subscript contains invalid names", 
  .     base::quote(h(simpleError(msg, call))))
22. h(simpleError(msg, call))

Here's the specs, but I get the same problem running an anaconda build with a more outdated set.

R version 4.2.1 (2022-06-23 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale: [1] LC_COLLATE=English_United Kingdom.utf8 [2] LC_CTYPE=English_United Kingdom.utf8
[3] LC_MONETARY=English_United Kingdom.utf8 [4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.utf8

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] fansi_1.0.3 crayon_1.5.2 digest_0.6.29 utf8_1.2.2
[5] IRdisplay_1.1 repr_1.1.4 lifecycle_1.0.2 jsonlite_1.8.0 [9] evaluate_0.16 pillar_1.8.1 rlang_1.0.6 cli_3.4.1
[13] uuid_1.1-0 vctrs_0.4.1 IRkernel_1.3 tools_4.2.1
[17] glue_1.6.2 fastmap_1.1.0 compiler_4.2.1 base64enc_0.1-3 [21] pbdZMQ_0.3-7 htmltools_0.5.3

Any help would be appreciated. I've been fighting with this all day and I do really need ORFik for what I want to do; no other package really does what I require.

EDIT: I have access to a Linux server but as far as actually writing and testing goes, it's a method of last resort! I prefer to get things working on Windows then stitch them in since the server has no graphics. EDIT 2: This also fails. What is my error here? It's not a version problem I suppose?

ORFik Bioconductor GenomicRanges Biostrings BSgenome • 2.5k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 13 minutes ago
United States

The example for findORFsFasta has this

Examples:

     # location of the example fasta file
     example_genome <- system.file("extdata/Danio_rerio_sample", "genome_dummy.fasta",
      package = "ORFik")
     orfs <- findORFsFasta(example_genome)
     # To store ORF sequences (you need indexed genome .fai file):
     fa <- FaFile(example_genome)
     names(orfs) <- paste0("ORF_", seq.int(length(orfs)), "_", seqnames(orfs))
     orf_seqs <- getSeq(fa, orfs)
     # You sequences (fa), needs to have isCircular(fa) == TRUE for it to work
     # on circular wrapping ranges!

     # writeXStringSet(DNAStringSet(orf_seqs), "orfs.fasta")

And note that fa in this example is a FaFile, not a DNAStringSet. If I run the example

> getSeq(fa, orfs)
DNAStringSet object of length 1235:
       width seq                                            names               
   [1]    33 CTGCAACGGGCAATATGTCTCTGTGTGGATTAA              Chromosome
   [2]    42 CTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGA     Chromosome
   [3]    66 ATGAAACGCATTAGCACCACCA...ACAGGTAACGGTGCGGGCTGA Chromosome
   [4]  2463 ATGCGAGTGTTGAAGTTCGGCG...TCATGGAAGTTAGGAGTCTGA Chromosome
   [5]    30 CTGGGAGCGTTTTTGCCAGGAACTGGGTAA                 Chromosome
   ...   ... ...
[1231]    12 TTGGCGATGTGA                                   Chromosome
[1232]    24 TTGTCCATCCCAGTGCCGTCTTGA                       Chromosome
[1233]    12 TTGAGGGGTTAA                                   Chromosome
[1234]   894 ATGAAACAGCAGGGAACTACCC...CCGGTGTGGAAAGGACGTTAA Chromosome
[1235]   375 CTGGAAGCCGATGGCTGGCTGC...ATTAGAAAGAATCTGAAATAA Chromosome

But if I do what you have done

> z <- readDNAStringSet(path(fa))
> getSeq(z, orfs)
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'value' in selecting a method for function 'unsplit': error in evaluating the argument 'x' in selecting a method for function 'extractAt': subscript contains invalid names

Which shows why perusing the help page, including examples, is a good idea.

ADD COMMENT
0
Entering edit mode

Ah thanks for that. I was just using the manual .pdf, I misread what it said about DNAStringSets as being an input for some reason. I haven't used R in a while. Thanks so much for getting back to me so quickly and pointing me in the right direction despite my rookie error!

ADD REPLY

Login before adding your answer.

Traffic: 893 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6