Small bug in function 'countskip.FASTA.entries' from package altcdfenvs

0

Entering edit mode

Norman Pavelka ▴ 190

@norman-pavelka-1214

Last seen 9.6 years ago

Hi Lingsheng, On 15 Nov 2005, at 19:05, Lingsheng Dong wrote: > Hi, Norman, > > Nice to see you are doing the similar project as I am doing. > > Another bug I found was in the function "get.RNA.ID": > get.RNA.IDs <- function(x) { > reg <- regexpr("(Hs#|NM)[^[:blank:]|]+", x) > r <- substr(my.entries$headers, reg, reg + attr(reg, "match.length") > -1) > return(r) > } > I am not sure how to correct it yet. But it couldn't get ID for > sequences without a "NMxxxxxx" ID in the header. I won't call that a bug. You simply have to change the regular expression in order to match the IDs you have in your particular FASTA file. I'm using the following function that simply gets the first string it encounters after the ">" sign in a FASTA header and strips away the space character after the string as well as all other characters that come after the space character. In this way you will get any ID regardless of how it begins with... You only have to check if the space character is OK also in your situation, or if another separator would be more appropriate. Oftern "|" or ";" signs are used to subdivide different pieces of information in a FASTA header. get.transcript.ids <- function(x) { tmpstring <- sub("^>","",x) tmpstring <- sub(" .+","",tmpstring) return(tmpstring) } > Still another problem you may want consider: > The "matchprobes" function gives all possible matches. In my case, a > lot of probes match hundreds of target sequences. It means there will > be too many crossing hybredization probes if you put all probes > matching a target sequence into one probe set. > I couldn't find a ready to use funciton to solve this problem yet. I > am thinking to export the matching result into a database software and > manually delete crossing hybridezaiton probes. > Not sure if this a quick solution. > Hope you can give some suggetion. I also thought of that problem, but Laurent Gautier already gave some clues in his BMC Bioinformatics paper on how to handle this situation. Though I still didn't try, I guess that everything could be done very quickly inside R, without the need of exporting into an external database. If you like, I can share with you my experience, as soon as I have done some trials... > Thanks. > LIngsheng Good luck! Norman

probe probe • 1.2k views

ADD COMMENT • link updated 18.4 years ago by lgautier@altern.org ▴ 950 • written 18.4 years ago by Norman Pavelka ▴ 190

0

Entering edit mode

lgautier@altern.org ▴ 950

@lgautieralternorg-747

Last seen 9.6 years ago

> Hi Lingsheng, > > On 15 Nov 2005, at 19:05, Lingsheng Dong wrote: <snip> >> Still another problem you may want consider: >> The "matchprobes" function gives all possible matches. In my case, a >> lot of probes match hundreds of target sequences. It means there will >> be too many crossing hybredization probes if you put all probes >> matching a target sequence into one probe set. >> I couldn't find a ready to use funciton to solve this problem yet. I >> am thinking to export the matching result into a database software and >> manually delete crossing hybridezaiton probes. >> Not sure if this a quick solution. >> Hope you can give some suggetion. > > I also thought of that problem, but Laurent Gautier already gave some > clues in his BMC Bioinformatics paper on how to handle this situation. > Though I still didn't try, I guess that everything could be done very > quickly inside R, without the need of exporting into an external > database. If you like, I can share with you my experience, as soon as I > have done some trials... The functions "countduplicated", "removeIndex", and "unique.CdfEnvAffy" are your friends. Hoping this helps, Laurent

ADD COMMENT • link 18.4 years ago lgautier@altern.org ▴ 950

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20051116/ 91c5639f/attachment.pl

ADD REPLY • link 18.4 years ago Norman Pavelka ▴ 190

0

Entering edit mode

> Dear Laurent, > <snip> > Thanks for pointing to these functions! I will give a trail as soon as > the 'matchprobes' routine is over... > > BTW, I launched the script 150 hours ago, but it's still not finished. > How much computational time should I foresee to need on my standard Mac > G4 machine (OS X Panther)? > > Here are some number to have an idea: I'm > remapping the MOE430 v2.0 arrays (approximately 1 million probes) > against roughly 38000 unique EnsEMBL transcripts... No idea. I just now that it is long. I remember that the HG-U133A against RefSeq (human-only) took many days on SGI processors ('don't remember the specs for the processors). I ended up implementing a drafty way of doing parallel processing (easy to orthogonalize by dividing the reference sequence in the FASTA file in chunks). The functions 'skip.FASTA.entries' and 'read.n.FASTA.entries' were written primarily for that. Hoping this helps, Laurent > Thank you in advance for your feed-back! > > Best, > Norman > > Norman Pavelka > Department of Biotechnology and Bioscience > University of Milano-Bicocca > Piazza della Scienza, 2 > 20126 Milan, Italy > > Phone: +39 02 6448 3556 > Fax: +39 02 6448 3552 > [[alternative text/enriched version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 18.4 years ago lgautier@altern.org ▴ 950

0

Entering edit mode

Dr. Gautier, Please give some clue where we can find the description of these functions or even the souce code. Thank a lot. Lingsheng The fear of the LORD is the beginning of wisdom, and knowledge of the Holy One is understanding. --Proverbs 10:10 >From: lgautier at altern.org >To: "Norman Pavelka" <norman.pavelka at="" unimib.it=""> >CC: "Lingsheng Dong" <dong_lsh at="" hotmail.com="">, bioconductor at stat.math.ethz.ch >Subject: Re: [BioC] Small bug in function 'countskip.FASTA.entries' from > package altcdfenvs >Date: Wed, 16 Nov 2005 16:55:21 +0100 (CET) > > > Hi Lingsheng, > > > > On 15 Nov 2005, at 19:05, Lingsheng Dong wrote: > ><snip> > > >> Still another problem you may want consider: > >> The "matchprobes" function gives all possible matches. In my case, a > >> lot of probes match hundreds of target sequences. It means there will > >> be too many crossing hybredization probes if you put all probes > >> matching a target sequence into one probe set. > >> I couldn't find a ready to use funciton to solve this problem yet. I > >> am thinking to export the matching result into a database software and > >> manually delete crossing hybridezaiton probes. > >> Not sure if this a quick solution. > >> Hope you can give some suggetion. > > > > I also thought of that problem, but Laurent Gautier already gave some > > clues in his BMC Bioinformatics paper on how to handle this situation. > > Though I still didn't try, I guess that everything could be done very > > quickly inside R, without the need of exporting into an external > > database. If you like, I can share with you my experience, as soon as I > > have done some trials... > > >The functions "countduplicated", "removeIndex", and "unique.CdfEnvAffy" >are your friends. > > > >Hoping this helps, > > > >Laurent > >

ADD REPLY • link 18.4 years ago Lingsheng Dong ▴ 80

0

Entering edit mode

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20051117/ dc949339/attachment.pl

ADD REPLY • link 18.4 years ago Norman Pavelka ▴ 190

Login before adding your answer.