transformation of csv file to fasta format with biostrings and r studio
1
0
Entering edit mode
HELEN • 0
@ba287559
Last seen 21 months ago
Greece

hello, i would like to transform a csv file which include dna sequences to fasta format, using r studio and the library biostring. The csv file has 4 columns, the name of the chromosome, the start and end of dna sequence and the strand (missing or +). What code should i use to transform this file.

DNASeq rstudio fasta Biostrings csv • 3.7k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

Here's a fake example.

> starts <- floor(runif(10)*1e5)

> ends <- starts + 50

> d.f <- data.frame(chr = paste0("chr", 1:10), starts = starts, ends = ends, strand = c("*", "+")[sample(1:2, 10, TRUE)])
> d.f
     chr starts  ends strand
1   chr1  17882 17932      *
2   chr2  79746 79796      +
3   chr3  41976 42026      *
4   chr4  96256 96306      +
5   chr5  79374 79424      *
6   chr6  25500 25550      *
7   chr7  54683 54733      +
8   chr8  38826 38876      +
9   chr9  18686 18736      +
10 chr10  22916 22966      *

## Err, skipped a step

> gr <- GRanges(d.f$chr, IRanges(d.f$starts, d.f$ends), d.f$strand)

> gr
GRanges object with 10 ranges and 0 metadata columns:
       seqnames      ranges strand
          <Rle>   <IRanges>  <Rle>
   [1]     chr1 17882-17932      *
   [2]     chr2 79746-79796      +
   [3]     chr3 41976-42026      *
   [4]     chr4 96256-96306      +
   [5]     chr5 79374-79424      *
   [6]     chr6 25500-25550      *
   [7]     chr7 54683-54733      +
   [8]     chr8 38826-38876      +
   [9]     chr9 18686-18736      +
  [10]    chr10 22916-22966      *
  -------
  seqinfo: 10 sequences from an unspecified genome; no seqlengths
> seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg19, gr)
Error in .starfreeStrand(strand(names)) : 
  cannot mix "*" with other strand values

## URP can't use missing strand. Let's just get the forward strand

> gr <- GRanges(d.f$chr, IRanges(d.f$starts, d.f$ends))
> seqs <- getSeq(BSgenome.Hsapiens.UCSC.hg19, gr)
> seqs
DNAStringSet object of length 10:
     width seq
 [1]    51 GTGTGTGACAGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTC
 [2]    51 TACAGCCCCTCCAAAAAACAAAGACAGTTGGGAAGGTGTCAAATGGAGGAT
 [3]    51 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 [4]    51 AGAGAAAGACTCCATCTCAAAAAAATAATAATAAAATAGATTTTGTAAGAA
 [5]    51 TGGCCTCGGGTTAGTAAAGAAAGAAGGAAATGCAGCAGCCTATGACCAAGA
 [6]    51 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
 [7]    51 CTGGCCAGGCTGAGGCCCACAGTGAGTTCGTGATAAGGTAGGACCAGAGCC
 [8]    51 GGATGGGACAAGAAGAAGCTGGGCTGACAAGCCCAACATAGTGGAGCCAGC
 [9]    51 GACGGTGCTGAGTTCCCTGCACTCTCAGAAGGGACAGGCCCTATGCTGCCA
[10]    51 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

> writeXStringSet(seqs, "whatevs.fa")
> readLines("whatevs.fa")
 [1] ">"                                                  
 [2] "GTGTGTGACAGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTC"
 [3] ">"                                                  
 [4] "TACAGCCCCTCCAAAAAACAAAGACAGTTGGGAAGGTGTCAAATGGAGGAT"
 [5] ">"                                                  
 [6] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
 [7] ">"                                                  
 [8] "AGAGAAAGACTCCATCTCAAAAAAATAATAATAAAATAGATTTTGTAAGAA"
 [9] ">"                                                  
[10] "TGGCCTCGGGTTAGTAAAGAAAGAAGGAAATGCAGCAGCCTATGACCAAGA"
[11] ">"                                                  
[12] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
[13] ">"                                                  
[14] "CTGGCCAGGCTGAGGCCCACAGTGAGTTCGTGATAAGGTAGGACCAGAGCC"
[15] ">"                                                  
[16] "GGATGGGACAAGAAGAAGCTGGGCTGACAAGCCCAACATAGTGGAGCCAGC"
[17] ">"                                                  
[18] "GACGGTGCTGAGTTCCCTGCACTCTCAGAAGGGACAGGCCCTATGCTGCCA"
[19] ">"                                                  
[20] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"

You can add names for each FASTA entry

> names(seqs) <- paste0("Sequence_", 1:10)
> writeXStringSet(seqs, "whatevs.fa")
> readLines("whatevs.fa")
 [1] ">Sequence_1"                                        
 [2] "GTGTGTGACAGGCTATATGCGCGGCCAGCAGACCTGCAGGGCCCGCTCGTC"
 [3] ">Sequence_2"                                        
 [4] "TACAGCCCCTCCAAAAAACAAAGACAGTTGGGAAGGTGTCAAATGGAGGAT"
 [5] ">Sequence_3"                                        
 [6] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
 [7] ">Sequence_4"                                        
 [8] "AGAGAAAGACTCCATCTCAAAAAAATAATAATAAAATAGATTTTGTAAGAA"
 [9] ">Sequence_5"                                        
[10] "TGGCCTCGGGTTAGTAAAGAAAGAAGGAAATGCAGCAGCCTATGACCAAGA"
[11] ">Sequence_6"                                        
[12] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
[13] ">Sequence_7"                                        
[14] "CTGGCCAGGCTGAGGCCCACAGTGAGTTCGTGATAAGGTAGGACCAGAGCC"
[15] ">Sequence_8"                                        
[16] "GGATGGGACAAGAAGAAGCTGGGCTGACAAGCCCAACATAGTGGAGCCAGC"
[17] ">Sequence_9"                                        
[18] "GACGGTGCTGAGTTCCCTGCACTCTCAGAAGGGACAGGCCCTATGCTGCCA"
[19] ">Sequence_10"                                       
[20] "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN"
>
ADD COMMENT
0
Entering edit mode

Thank you very much. Your answer was really helpfull but i have a problem on using the GRanges and IRages seq from Bioconductor. The error that is appeared is: e(s) not installed when version(s) same as current; use force = TRUE to re-install: 'GenomeRanges'. Do you have any suggestion how to solve it. I tried different things such as dowload devtools and genefilter but nothing changed. Also, do you know some other sequence which can replace the GRanges and IRanges. Thank you.

ADD REPLY
0
Entering edit mode

It's better for us if you just copy and paste the R output rather than just a snippet like that. Do note that there isn't a package called GenomeRanges so that might be your problem. You want GenomicRanges instead. The 'error' you post isn't an error, but instead a message telling you that you already have the current version of a package installed, and if you want to re-install you need to add force = TRUE to BiocManager::install in order to make it re-install. Otherwise it will just tell you that you already have the current version.

If you do paste code, either enclose between a set of triple backticks (```) so it is formatted correctly, or alternatively paste the code, highlight the code, and then click on the button above the reply box that says 'code'.

ADD REPLY
0
Entering edit mode

My problem about the GRanges and IRanges was solved. But another problem was appeared. When i run this code: gr <- GRanges(chr, IRanges(start = start,end = end)), it tells that:

 Error in .new_IRanges_from_start_end(start, end) : 
  'start' and 'end' must be numeric vectors

what did i do wrong? Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6