Question

From a DNAStringSet to a data.frame

0

Entering edit mode

ramiro.barrantes ▴ 10

@ramirobarrantes-7796

Last seen 5 months ago

United States

I have a very easy question I think, I have a DNAStringSet, for example:

DNAStringSet object of length 5: width seq names
[1] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq1 [2] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq2 [3] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq3 [4] 1701 ATGAAGGCAATACTAGTAGTTCTGCTATATACATTT...CTAATGGGTCTCTACAGTGTAGAATATGTATTTAA Seq4 [5] 1701 ATGAAGGCAAACCTACTGGTCCTGTTATGTGCACTT...CTAATGGATCTTTGCAGTGCAGAATATGCATCTGA Seq5

And I would like to transform that into a data frame, eg: Pos Seq1 Seq2 Seq3 Seq4 Seq5 1 A A A A A 2 T T T T T ....etc

But I can't see to find a way to do efficiently, any suggestions? I am using a combination of extractAt and other things, but everything seems very cumbersome.

(Ultimately, I want to also remove identical columns and export it all into a csv file, but I haven't been able to go past this step to my satisfaction)

Thank you

Biostrings • 2.4k views

ADD COMMENT • link 21 months ago ramiro.barrantes ▴ 10

score 2 · Accepted Answer · 2024-03-25

Hi,

Normally as.data.frame() will do and will produce a data.frame with 1 row per sequence in the DNAStringSet object.

However it seems that you want to make a data.frame with 1 _column_ per sequence in the DNAStringSet, which is only possible here because all the sequences have the same length (constant-width DNAStringSet). You can do this by first turning your DNAStringSet object into a matrix with as.matrix(), then transposing, then turning the transposed matrix into a data.frame with as.data.frame():

library(Biostrings)

dna <- DNAStringSet(c(Seq1="ACCAA", Seq2="TTTTA", Seq3="AACGG", Seq4="CTCGT"))

dna
# DNAStringSet object of length 4:
#     width seq                                               names               
# [1]     5 ACCAA                                             Seq1
# [2]     5 TTTTA                                             Seq2
# [3]     5 AACGG                                             Seq3
# [4]     5 CTCGT                                             Seq4

as.matrix(dna)
#      [,1] [,2] [,3] [,4] [,5]
# Seq1 "A"  "C"  "C"  "A"  "A" 
# Seq2 "T"  "T"  "T"  "T"  "A" 
# Seq3 "A"  "A"  "C"  "G"  "G" 
# Seq4 "C"  "T"  "C"  "G"  "T" 

as.data.frame(t(as.matrix(dna)))
#   Seq1 Seq2 Seq3 Seq4
# 1    A    T    A    C
# 2    C    T    A    T
# 3    C    T    C    C
# 4    A    T    G    G
# 5    A    A    G    T

Hope this helps,

H.