Extarcting large number of sequences using BSgenome package

0

Entering edit mode

viritha kaza ▴ 580

@viritha-kaza-4318

Last seen 10.6 years ago

Hi group, I am interested in retrieving about 2000 sequences with the specific chromosome number,start and end site. I was thinking of using BSgenome package for this. >source("http://bioconductor.org/biocLite.R") > biocLite("BSgenome","BSgenome.Hsapiens.UCSC.hg19") > library(BSgenome) > library("BSgenome.Hsapiens.UCSC.hg19") >myseq<- getSeq(Hsapiens,"chr2",start=10000,end=10020) # this would work for specific values. #So I tried to use a dataframe where it would retrieve chromosome number from by using >full<-as.matrix(read.table(test_seq.txt,sep=\t,quote=,header=Tr ue, as.is=TRUE)) >full.df<-data.frame(full) # test_seq contains this information #chromosome Start End #chr2 10000 10020 #chr3 10000 10020 > myseq<- getSeq(Hsapiens,full.df$chromosome,start=10000,end=10020) #but then when I use start=full.df$Start. It naturally throws an error saying 'start' must be a vector of integers Questions: How Do I handle this? Does start here mean that each chromosome numbering starts from 1? How do I split each sequence retrieved and create as fasta format (>)with sequence name attached to them retrieved from my input file? >sessionInfo() R version 2.13.0 (2011-04-13) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 BSgenome_1.20.0 [3] Biostrings_2.20.1 GenomicRanges_1.4.6 [5] IRanges_1.10.4 loaded via a namespace (and not attached): [1] tools_2.13.0 Waiting for your suggestions, Thank you, Viritha [[alternative HTML version deleted]]

BSgenome BSgenome BSgenome BSgenome • 2.0k views

ADD COMMENT • link updated 13.6 years ago by Hervé Pagès 16k • written 13.6 years ago by viritha kaza ▴ 580

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 2 days ago

Seattle, WA, United States

Hi Viritha, On 11-08-16 08:01 AM, viritha kaza wrote: > Hi group, > I am interested in retrieving about 2000 sequences with the specific > chromosome number,start and end site. > I was thinking of using BSgenome package for this. > > >> source("http://bioconductor.org/biocLite.R") > >> biocLite("BSgenome","BSgenome.Hsapiens.UCSC.hg19") > >> library(BSgenome) > >> library("BSgenome.Hsapiens.UCSC.hg19") > >> myseq<- getSeq(Hsapiens,"chr2",start=10000,end=10020) > > # this would work for specific values. > > #So I tried to use a dataframe where it would retrieve chromosome number > from by using > >> full<-as.matrix(read.table(?test_seq.txt?,sep=?\t?,quote=??,header= True, > as.is=TRUE)) Why are you doing as.matrix here? Do you *really* want all the columns to be coerced to the same type? (which in your case is probably character because you have already one column of that type). > >> full.df<-data.frame(full) So you're back to a data.frame, except that now all the columns are character vectors :-/ > > # test_seq contains this information > > #chromosome Start End > #chr2 10000 10020 > #chr3 10000 10020 Why not copy and paste what you get when you just type full.df? That way would know exactly what you ended up with (if full.df is too big, just show us the output of head(full.df)). Anyway, it looks like you'll have to get rid of those #. Otherwise passing Hsapiens,full.df$chromosome to getSeq() with those # in it will surely cause problems. > >> myseq<- getSeq(Hsapiens,full.df$chromosome,start=10000,end=10020) > > #but then when I use start=full.df$Start. It naturally throws an error > saying 'start' must be a vector of integers > > Questions: > > How Do I handle this? By having full.df$Start be a vector of integers and also by reading the man page for getSeq(). > > Does start here mean that each chromosome numbering starts from 1? Yes, the first base at the 5' end of the + strand (or 3' end of the - strand) is considered to be position 1. > > How do I split each sequence retrieved and create as fasta format (>)with > sequence name attached to them retrieved from my input file? How do you want to "split" each sequence retrieved? As a good starting point you could have a look at write.XStringSet(). Note that the names you put on your DNAStringSet object will be used as the record description lines in the resulting FASTA file. Cheers, H. > > > >> sessionInfo() > > R version 2.13.0 (2011-04-13) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg19_1.3.17 BSgenome_1.20.0 > [3] Biostrings_2.20.1 GenomicRanges_1.4.6 > [5] IRanges_1.10.4 > > loaded via a namespace (and not attached): > [1] tools_2.13.0 > > > > Waiting for your suggestions, > > Thank you, > > Viritha > > [[alternative HTML version deleted]] > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD COMMENT • link 13.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thanks for your response. System is considering as characters in the dataframe.So I was thinking of using them as vectors.But if I scan as seperate elements and convert them to numeric using lapply.It becomes atomic numeric.How do I make them as vectors without using combining fuction as c(1,2,3)?I am basically new to sequences using R.So will read write.XStringSet(). Thanks, Viritha 2011/8/16 Hervé Pagès <hpages@fhcrc.org> > Hi Viritha, > > > On 11-08-16 08:01 AM, viritha kaza wrote: > >> Hi group, >> I am interested in retrieving about 2000 sequences with the specific >> chromosome number,start and end site. >> I was thinking of using BSgenome package for this. >> >> >> source("http://bioconductor.**org/biocLite.R<http: bioconductor.or="" g="" bioclite.r=""> >>> ") >>> >> >> biocLite("BSgenome","BSgenome.**Hsapiens.UCSC.hg19") >>> >> >> library(BSgenome) >>> >> >> library("BSgenome.Hsapiens.**UCSC.hg19") >>> >> >> myseq<- getSeq(Hsapiens,"chr2",start=**10000,end=10020) >>> >> >> # this would work for specific values. >> >> #So I tried to use a dataframe where it would retrieve chromosome number >> from by using >> >> full<-as.matrix(read.table(**test_seq.txt,sep=\t,quote=** >>> ,header=True, >>> >> as.is=TRUE)) >> > > Why are you doing as.matrix here? Do you *really* want all the columns > to be coerced to the same type? (which in your case is probably > character because you have already one column of that type). > > > >> full.df<-data.frame(full) >>> >> > So you're back to a data.frame, except that now all the columns are > character vectors :-/ > > > >> # test_seq contains this information >> >> #chromosome Start End >> #chr2 10000 10020 >> #chr3 10000 10020 >> > > Why not copy and paste what you get when you just type full.df? That > way would know exactly what you ended up with (if full.df is too big, > just show us the output of head(full.df)). > Anyway, it looks like you'll have to get rid of those #. Otherwise > passing Hsapiens,full.df$chromosome to getSeq() with those # in it > will surely cause problems. > > > >> myseq<- getSeq(Hsapiens,full.df$**chromosome,start=10000,end=**10020) >>> >> >> #but then when I use start=full.df$Start. It naturally throws an error >> saying 'start' must be a vector of integers >> >> Questions: >> >> How Do I handle this? >> > > By having full.df$Start be a vector of integers and also by reading > the man page for getSeq(). > > > >> Does start here mean that each chromosome numbering starts from 1? >> > > Yes, the first base at the 5' end of the + strand (or 3' end of the - > strand) is considered to be position 1. > > > >> How do I split each sequence retrieved and create as fasta format (>)with >> sequence name attached to them retrieved from my input file? >> > > How do you want to "split" each sequence retrieved? As a good starting > point you could have a look at write.XStringSet(). Note that > the names you put on your DNAStringSet object will be used as the > record description lines in the resulting FASTA file. > > Cheers, > H. > > >> >> >> sessionInfo() >>> >> >> R version 2.13.0 (2011-04-13) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Hsapiens.UCSC.hg19_1.**3.17 BSgenome_1.20.0 >> [3] Biostrings_2.20.1 GenomicRanges_1.4.6 >> [5] IRanges_1.10.4 >> >> loaded via a namespace (and not attached): >> [1] tools_2.13.0 >> >> >> >> Waiting for your suggestions, >> >> Thank you, >> >> Viritha >> >> [[alternative HTML version deleted]] >> >> >> >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.6 years ago viritha kaza ▴ 580

0

Entering edit mode

Viritha, On 11-08-16 10:54 AM, viritha kaza wrote: > Hi Herve, > Thanks for your response. System is considering as characters in the > dataframe.So I was thinking of using them as vectors.But if I scan as > seperate elements and convert them to numeric using lapply.It becomes > atomic numeric.How do I make them as vectors without using combining > fuction as c(1,2,3)?I am basically new to sequences using R.So will read > write.XStringSet(). Your question is particularly unclear and it seems like you would benefit a lot from reading a little bit about data frames and read.table(). There are a lot of arguments to this function that give you a lot of control on how the data will actually be loaded into the data frame. In particular look at the 'colClasses' arg: it lets you control the types of the columns of the data frame. That will help you solve the first step which is loading the file into a data frame. I don't think you need any scan/lappl/combining for this step. Once you've solved this, step 2 will be to use getSeq() to extract your sequences and step 3 will be to write your sequences to a FASTA file with write.XStringSet(). Right now, it's almost impossible to know where you are encountering difficulties. Please have a look at our posting guide before your next post: http://bioconductor.org/help/mailing-list/posting-guide/ Thanks! H. > Thanks, > Viritha > 2011/8/16 Hervé Pagès <hpages at="" fhcrc.org="" <mailto:hpages="" at="" fhcrc.org="">> > > Hi Viritha, > > > On 11-08-16 08:01 AM, viritha kaza wrote: > > Hi group, > I am interested in retrieving about 2000 sequences with the specific > chromosome number,start and end site. > I was thinking of using BSgenome package for this. > > > source("http://bioconductor.__org/biocLite.R > <http: bioconductor.org="" bioclite.r="">") > > > biocLite("BSgenome","BSgenome.__Hsapiens.UCSC.hg19") > > > library(BSgenome) > > > library("BSgenome.Hsapiens.__UCSC.hg19") > > > myseq<- getSeq(Hsapiens,"chr2",start=__10000,end=10020) > > > # this would work for specific values. > > #So I tried to use a dataframe where it would retrieve > chromosome number > from by using > > full<-as.matrix(read.table(?__test_seq.txt?,sep=?\t?,quo te=?__?,header=True, > > as.is <http: as.is=""/>=TRUE)) > > > Why are you doing as.matrix here? Do you *really* want all the columns > to be coerced to the same type? (which in your case is probably > character because you have already one column of that type). > > > > full.df<-data.frame(full) > > > So you're back to a data.frame, except that now all the columns are > character vectors :-/ > > > > # test_seq contains this information > > #chromosome Start End > #chr2 10000 10020 > #chr3 10000 10020 > > > Why not copy and paste what you get when you just type full.df? That > way would know exactly what you ended up with (if full.df is too big, > just show us the output of head(full.df)). > Anyway, it looks like you'll have to get rid of those #. Otherwise > passing Hsapiens,full.df$chromosome to getSeq() with those # in it > will surely cause problems. > > > > myseq<- > getSeq(Hsapiens,full.df$__chromosome,start=10000,end=__10020) > > > #but then when I use start=full.df$Start. It naturally throws an > error > saying 'start' must be a vector of integers > > Questions: > > How Do I handle this? > > > By having full.df$Start be a vector of integers and also by reading > the man page for getSeq(). > > > > Does start here mean that each chromosome numbering starts from 1? > > > Yes, the first base at the 5' end of the + strand (or 3' end of the - > strand) is considered to be position 1. > > > > How do I split each sequence retrieved and create as fasta > format (>)with > sequence name attached to them retrieved from my input file? > > > How do you want to "split" each sequence retrieved? As a good starting > point you could have a look at write.XStringSet(). Note that > the names you put on your DNAStringSet object will be used as the > record description lines in the resulting FASTA file. > > Cheers, > H. > > > > > sessionInfo() > > > R version 2.13.0 (2011-04-13) > Platform: i386-pc-mingw32/i386 (32-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 > [2] LC_CTYPE=English_United States.1252 > [3] LC_MONETARY=English_United States.1252 > [4] LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Hsapiens.UCSC.hg19_1.__3.17 BSgenome_1.20.0 > [3] Biostrings_2.20.1 GenomicRanges_1.4.6 > [5] IRanges_1.10.4 > > loaded via a namespace (and not attached): > [1] tools_2.13.0 > > > > Waiting for your suggestions, > > Thank you, > > Viritha > > [[alternative HTML version deleted]] > > > > > _________________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org <mailto:bioconductor at="" r-project.org=""> > https://stat.ethz.ch/mailman/__listinfo/bioconductor > <https: stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: > http://news.gmane.org/gmane.__science.biology.informatics.__conductor > <http: news.gmane.org="" gmane.science.biology.informatics.conductor=""> > > > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org <mailto:hpages at="" fhcrc.org=""> > Phone: (206) 667-5791 <tel:%28206%29%20667-5791> > Fax: (206) 667-1319 <tel:%28206%29%20667-1319> > > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

ADD REPLY • link 13.6 years ago Hervé Pagès 16k

0

Entering edit mode

Hi Herve, Thanks for your suggestions and comments.Sorry for my unclear questions.I am learing R as I am trying to build the code. I wrote the code with asper your suggestions.Its working perfectly fine. >source("http://bioconductor.org/biocLite.R") >biocLite(c("BSgenome","BSgenome.Hsapiens.UCSC.hg19")) >library(BSgenome) >library("BSgenome.Hsapiens.UCSC.hg19") > tb<-read.table("test.txt",header=TRUE,sep="\t",colClasses=c('character ','character','integer','integer')) >myseq<- getSeq(Hsapiens,tb$chr,start=tb$Start,end=tb$End) > x <-DNAStringSet(myseq) > names(x)<-tb$Name > write.XStringSet(x,"test_seq1.txt",format="fasta") #my test file looks like this Name chr Start End chr2:10000-10020 chr2 10000 10020 chr19:1560295-1560450 chr19 1560295 1560450 I would be using this to extarct 2000 sequences.So please comment on the efficiency,or other suggestions. Thanks once again, Viritha 2011/8/18 Hervé Pagès <hpages@fhcrc.org> > Viritha, > > > On 11-08-16 10:54 AM, viritha kaza wrote: > >> Hi Herve, >> Thanks for your response. System is considering as characters in the >> dataframe.So I was thinking of using them as vectors.But if I scan as >> seperate elements and convert them to numeric using lapply.It becomes >> atomic numeric.How do I make them as vectors without using combining >> fuction as c(1,2,3)?I am basically new to sequences using R.So will read >> write.XStringSet(). >> > > Your question is particularly unclear and it seems like you would > benefit a lot from reading a little bit about data frames and > read.table(). There are a lot of arguments to this function that > give you a lot of control on how the data will actually be loaded > into the data frame. In particular look at the 'colClasses' arg: > it lets you control the types of the columns of the data frame. > > That will help you solve the first step which is loading the > file into a data frame. I don't think you need any > scan/lappl/combining for this step. > > Once you've solved this, step 2 will be to use getSeq() to extract > your sequences and step 3 will be to write your sequences to a FASTA > file with write.XStringSet(). > > Right now, it's almost impossible to know where you are encountering > difficulties. Please have a look at our posting guide before your > next post: > > http://bioconductor.org/help/**mailing-list/posting- guide/<http: bioconductor.org="" help="" mailing-list="" posting-guide=""/> > > Thanks! > H. > > > Thanks, >> Viritha >> 2011/8/16 Hervé Pagès <hpages@fhcrc.org <mailto:hpages@fhcrc.org="">> >> >> >> Hi Viritha, >> >> >> On 11-08-16 08:01 AM, viritha kaza wrote: >> >> Hi group, >> I am interested in retrieving about 2000 sequences with the >> specific >> chromosome number,start and end site. >> I was thinking of using BSgenome package for this. >> >> >> source("http://bioconductor.__**org/biocLite.R >> <http: bioconductor.org="" **bioclite.r<http:="" bioconducto="" r.org="" bioclite.r=""> >> >") >> >> >> biocLite("BSgenome","BSgenome.**__Hsapiens.UCSC.hg19") >> >> >> library(BSgenome) >> >> >> library("BSgenome.Hsapiens.__**UCSC.hg19") >> >> >> myseq<- getSeq(Hsapiens,"chr2",start=_**_10000,end=10020) >> >> >> # this would work for specific values. >> >> #So I tried to use a dataframe where it would retrieve >> chromosome number >> from by using >> >> full<-as.matrix(read.table(__**test_seq.txt,sep=\t,quote= >> **__,header=True, >> >> as.is <http: as.is=""/>=TRUE)) >> >> >> >> Why are you doing as.matrix here? Do you *really* want all the columns >> to be coerced to the same type? (which in your case is probably >> character because you have already one column of that type). >> >> >> >> full.df<-data.frame(full) >> >> >> So you're back to a data.frame, except that now all the columns are >> character vectors :-/ >> >> >> >> # test_seq contains this information >> >> #chromosome Start End >> #chr2 10000 10020 >> #chr3 10000 10020 >> >> >> Why not copy and paste what you get when you just type full.df? That >> way would know exactly what you ended up with (if full.df is too big, >> just show us the output of head(full.df)). >> Anyway, it looks like you'll have to get rid of those #. Otherwise >> passing Hsapiens,full.df$chromosome to getSeq() with those # in it >> will surely cause problems. >> >> >> >> myseq<- >> getSeq(Hsapiens,full.df$__**chromosome,start=10000,end=__** >> 10020) >> >> >> #but then when I use start=full.df$Start. It naturally throws an >> error >> saying 'start' must be a vector of integers >> >> Questions: >> >> How Do I handle this? >> >> >> By having full.df$Start be a vector of integers and also by reading >> the man page for getSeq(). >> >> >> >> Does start here mean that each chromosome numbering starts from 1? >> >> >> Yes, the first base at the 5' end of the + strand (or 3' end of the - >> strand) is considered to be position 1. >> >> >> >> How do I split each sequence retrieved and create as fasta >> format (>)with >> sequence name attached to them retrieved from my input file? >> >> >> How do you want to "split" each sequence retrieved? As a good starting >> point you could have a look at write.XStringSet(). Note that >> the names you put on your DNAStringSet object will be used as the >> record description lines in the resulting FASTA file. >> >> Cheers, >> H. >> >> >> >> >> sessionInfo() >> >> >> R version 2.13.0 (2011-04-13) >> Platform: i386-pc-mingw32/i386 (32-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 >> [2] LC_CTYPE=English_United States.1252 >> [3] LC_MONETARY=English_United States.1252 >> [4] LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] BSgenome.Hsapiens.UCSC.hg19_1.**__3.17 BSgenome_1.20.0 >> [3] Biostrings_2.20.1 GenomicRanges_1.4.6 >> [5] IRanges_1.10.4 >> >> loaded via a namespace (and not attached): >> [1] tools_2.13.0 >> >> >> >> Waiting for your suggestions, >> >> Thank you, >> >> Viritha >> >> [[alternative HTML version deleted]] >> >> >> >> >> ______________________________**___________________ >> Bioconductor mailing list >> Bioconductor@r-project.org <mailto:bioconductor@r-**project.org<bioconductor@r-project.org>> >> >> >> https://stat.ethz.ch/mailman/_**_listinfo/bioconductor<https :="" stat.ethz.ch="" mailman="" __listinfo="" bioconductor=""> >> <https: stat.ethz.ch="" mailman="" **listinfo="" bioconductor<https:="" stat.ethz.ch="" mailman="" listinfo="" bioconductor=""> >> > >> Search the archives: >> http://news.gmane.org/gmane.__**science.biology.informatics.__** >> conductor<http: news.gmane.org="" gmane.__science.biology.informatics="" .__conductor=""> >> <http: news.gmane.org="" gmane.**science.biology.informatics.**="">> conductor<http: news.gmane.org="" gmane.science.biology.informatics.c="" onductor=""> >> > >> >> >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages@fhcrc.org <mailto:hpages@fhcrc.org> >> Phone: (206) 667-5791 <tel:%28206%29%20667-5791> >> Fax: (206) 667-1319 <tel:%28206%29%20667-1319> >> >> >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]

ADD REPLY • link 13.6 years ago viritha kaza ▴ 580

Login before adding your answer.