From a DNAStringSet to a data.frame
1
0
Entering edit mode
@ramirobarrantes-7796
Last seen 3 months ago
United States

I have a very easy question I think, I have a DNAStringSet, for example:

DNAStringSet object of length 5: width seq names
[1] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq1 [2] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq2 [3] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq3 [4] 1701 ATGAAGGCAATACTAGTAGTTCTGCTATATACATTT...CTAATGGGTCTCTACAGTGTAGAATATGTATTTAA Seq4 [5] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq5

And I would like to transform that into a data frame, eg: Pos Seq1 Seq2 Seq3 Seq4 Seq5 1 A A A A A 2 T T T T T ....etc

But I can't see to find a way to do efficiently, any suggestions? I am using a combination of extractAt and other things, but everything seems very cumbersome.

(Ultimately, I want to also remove identical columns and export it all into a csv file, but I haven't been able to go past this step to my satisfaction)

Thank you

Biostrings • 1.1k views
ADD COMMENT
2
Entering edit mode
@herve-pages-1542
Last seen 6 days ago
Seattle, WA, United States

Hi,

Normally as.data.frame() will do and will produce a data.frame with 1 row per sequence in the DNAStringSet object.

However it seems that you want to make a data.frame with 1 _column_ per sequence in the DNAStringSet, which is only possible here because all the sequences have the same length (constant-width DNAStringSet). You can do this by first turning your DNAStringSet object into a matrix with as.matrix(), then transposing, then turning the transposed matrix into a data.frame with as.data.frame():

library(Biostrings)

dna <- DNAStringSet(c(Seq1="ACCAA", Seq2="TTTTA", Seq3="AACGG", Seq4="CTCGT"))

dna
# DNAStringSet object of length 4:
#     width seq                                               names               
# [1]     5 ACCAA                                             Seq1
# [2]     5 TTTTA                                             Seq2
# [3]     5 AACGG                                             Seq3
# [4]     5 CTCGT                                             Seq4

as.matrix(dna)
#      [,1] [,2] [,3] [,4] [,5]
# Seq1 "A"  "C"  "C"  "A"  "A" 
# Seq2 "T"  "T"  "T"  "T"  "A" 
# Seq3 "A"  "A"  "C"  "G"  "G" 
# Seq4 "C"  "T"  "C"  "G"  "T" 

as.data.frame(t(as.matrix(dna)))
#   Seq1 Seq2 Seq3 Seq4
# 1    A    T    A    C
# 2    C    T    A    T
# 3    C    T    C    C
# 4    A    T    G    G
# 5    A    A    G    T

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Thank you Hervé, this is exactly what I needed.

ADD REPLY

Login before adding your answer.

Traffic: 405 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6