Search
Question: Handling DNA sequence loaded from a file
0
2.6 years ago by
India
Agaz Hussain Wani260 wrote:

I have a file with DNA sequence of length 900 characters. I use Biostrings package to read the sequence from the file into R.

library(Biostrings)

ref <- readDNAStringSet(filepath)

I get something like

A DNAStringSet instance of length 1
width seq                               names
[1]   900 AACTGGTTACCTGCCGTGAGTAAATTAAAATT...GACGCAACGGTTCCGACTACTCTGCTGCGGTG AGCTTTTCATTCTGACT...

I would like to know, how to get the full DNA sequence into other variable. How to access the full sequence.

written 2.6 years ago by Agaz Hussain Wani260

It is there, only some of the letters are showing. What do you want to do with your sequence now that you have read it in?

I want to have the full sequence as a character string and do some operation on that. Suppose I want to display the first 500 characters or do some string operation on that.

1

On the one hand you could display them with as.character(ref) but if you wanted to do something you'd typically use a function, e.g., reverseComplement(ref) or narrow(ref, 500). This would return another DNAStringSet, which you could then coerce to character if you wanted, but really there's not much to be gained by looking at 500 letters (or 3 billion, or 10 million 100-mers) with the naked eye. Check out the help page ?DNAStringSet including the examples to get some ideas; the infrastructure is very flexible.

Looking at your DNAStringSet, I wonder what the file that you are reading from looks like? readDNAStringSet is expecting a FASTA file

> An.identifer
AAACAACCA

and the identifier is used as the 'name'; it seems like you have a 'name' that is a sequence. If you have a plain text file with 1 DNA sequence per line, you could do something like

DNAStringSet(readLines(filepath))
Thanks for your comments. When I try as.character(ref), I get something like

ATGGCCCCCTTTGCCCAGGGAAAGGGCCCCTGGGGGGGGTTTTTTTGCCCACGTGCAGTCAGTCAGTAAGTCAAGTCAATGACGTCAATGCAACGTGCAAAAAAAAAAAAAGGAAATTAAAGGGCCCTGACGTGCAGCCAGTCAGTCAAAGGG
"" 

When I try to find the number of characters

nchar(ref)

It returns 0. What is going wrong??.

Do you mean nchar(ref) ? Please cut-and-paste from your R session into the support site, because it is very important to be precise. I have

> ref = DNAStringSet(c("AAATT", "AAA"))
> nchar(ref)
[1] 5 3

using the current version of software. What is the output of your sessionInfo()?

> sessionInfo()
R version 3.2.4 Patched (2016-03-16 r70348)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] Biostrings_2.38.4    XVector_0.10.0       IRanges_2.4.8
[4] S4Vectors_0.8.11     BiocGenerics_0.16.1  BiocInstaller_1.20.1

loaded via a namespace (and not attached):
[1] zlibbioc_1.16.0