How to use subSeq with DNAStringSet and information of GroupedIRanges?
1
0
Entering edit mode
Sophie • 0
@sophie-13085
Last seen 7.2 years ago

Hi,

I'm totally new working with Bioconductor and hope you can help me with my following problem.

My data is a "Large DataFrame" including different DNA sequences (seq) in every row. These are from the "DNAStringSet" class from "Biostring"-package. One variable (pos) contains the information of the position of the first nucleobase. The goal is to filter out one nucleobase at one specific position. This position is not included in every row and each row does not start at the same position. So the distance between the starting position of the row and the position I'm looking for is varying. The position information is as well stored in the @ranges, which is from the class "GroupedIRanges" of the "XVector"-package. So I tryed using the subseq-function:

data$subseq_test <- subseq(data$seq@ranges, start = 6, end = 6)   # not working

Fehler in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘subseq’ for signature ‘"GroupedIRanges"’

Without @ranges:

data$subseq_test <- subseq(data$seq, start = 6, end = 6) 

# working, but not the way I want. It gives me the 6th nucleobase of every row counting from 1 from the beginning

As I read in another post, @ranges should not be used. My question is, how I can get this one position?

Here you can see some information of the data:

https://www.dropbox.com/s/ktffp9s7mm183hu/data.png?dl=0

And my sessioninfo:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X Yosemite 10.10.5

locale:
[1] de_DE.UTF-8/de_DE.UTF-8/de_DE.UTF-8/C/de_DE.UTF-8/de_DE.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Biostrings_2.42.1    XVector_0.14.1       IRanges_2.8.2        S4Vectors_0.12.2     BiocGenerics_0.20.0  BiocInstaller_1.24.0
subseq Xvector dnastringset • 4.9k views
ADD COMMENT
0
Entering edit mode

If 'pos' is a vector telling you the location of the base you're interested in, can you achieve what you want with something along the lines of subseq(data$seq, start=pos, width=1) or, if pos is only defined for certain rows, then you may need to subset data first to be those rows with a defined pos.

ADD REPLY
0
Entering edit mode

Thank you, Gavin, for your answer! The idea with using width was a good one. But for start I needed to calculate a new variable, that calculates me the actual position for every line, starting counting from one. As I recognized, this is also not the „real“ solution, cause there are some insertions and delations that needs to be consider as well for the position calculating.

ADD REPLY
0
Entering edit mode

In general this solution is working, but my start position is a position, which is not in every DNAString of every data row. So the start position in my case has NA values. If I just use the complied cases, so dropping all the rows with NA in the start position variable, than the code works fine. For my further analysis I need the other rows as well, so I am not allowed to drop the NA cases. That is why I tried the following:

data$subseq <- ifelse(is.na(data$pos), NA, subseq(data$seq, start = pos, width = 1))
# Error in .Call2("solve_user_SEW", refwidths, start, end, width, translate.negative.coord,  :
  solving row 1: either the supplied start or the supplied end (but not both) must be NA when the supplied width is not NA

I do not get the error here. The ifelse, should skip the the NA rows for the subseq-function, but as it seams (in row 1 pos is NA), it is not doing, what it should do. Would be great, if somebody has a reason or solution for my problem.

ADD REPLY
2
Entering edit mode
@herve-pages-1542
Last seen 9 hours ago
Seattle, WA, United States

Hi Sophie,

An important thing to keep in mind is that trying to access the internals of an S4 object with @ is almost always a bad idea. In Bioconductor, all S4 objects should be accessed/manipulated via accessor functions or other functions or methods. In the case of DNAStringSet objects, the only accessors that are provided are length(), names()width(), and [[. Think of these objects as the equivalent of ordinary character vectors in base R. How the sequences are stored inside a DNAStringSet object is really the internal business of the Biostrings package and the end user should not worry or care about this. There are many functions to operate on these objects. Please see a summary of all supported operations in the Biostrings Quick Overview document listed on the landing page of the Biostrings package:

https://bioconductor.org/packages/Biostrings

For your particular problem, I would suggest that you take a look at extractAt() and replaceAt() in table 3 (Sequence transformation and editing). Make sure to check the examples: it's very likely that one of them is doing almost exactly what you need.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Hervé,

sorry for the late response. Thank you for the detailed description for not using the internals of a S4 object. It helped me understanding more about R and Bioconductor.
My data is more complex, than I thought in the beginning. So I first need to figure out, if extractAT() could be a solution for me now. But it seems, it could be.

ADD REPLY

Login before adding your answer.

Traffic: 708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6