Question

subseq with iranges on reverse strand

0

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 17 hours ago

Seattle, WA, United States

Hi Carol, Sorry for the late answer, I was on vacations. I hope you don't mind that I'm cc'ing the Bioconductor mailing list. On 08/08/2014 05:07 AM, carol white wrote: > Hi, > If I want to take from the end of nucleotide seq of a gene that is on > the reverse strand a variable number of bp with iranges, does it not > take from the end of seq for the number of bp that is fixed? because it > seems that it takes differently that I can't find out how. So if I take > the nt seq of the gene on the ncbi web site and extract for ex 10 or 20 > bp from the end, i don't get the same as I do with iranges. As I have > already given the strand as the parameter to the iranges function, I > assume that it has already reverse-complemented. > > How can I figure out how iranges has taken the subsequence? A concrete example and showing the code would help understand exactly what you are trying to do. Note that subseq() is not strand-aware i.e. it doesn't let the user specify the strand so the extracted sequence is never reverse- complemented. To extract the nucleotide sequences corresponding to a given set of genomic ranges, it's better to use the getSeq() function, which is strand-aware. getSeq() typically takes 2 arguments, a BSgenome object and a GRanges object. The BSgenome object contains the full genome sequences of the organism. The GRanges object contains the genomic ranges of your regions of interest. These can be the regions corresponding to the entire genes, or to their flanking sequences, or to some upstream sequences, or to the last 10 or 20 nucleotides of each genes. Each genomic range is defined by a sequence name (e.g. chr1), a start position, an end position, and a strand. Here is an example of how to extract the last 10 nucleotides of all Human genes: library(TxDb.Hsapiens.UCSC.hg19.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene gn <- genes(txdb) min(width(gn)) # check length of shortest gene gn_last10 <- flank(gn, -10, start=FALSE) library(BSgenome.Hsapiens.UCSC.hg19) genome <- BSgenome.Hsapiens.UCSC.hg19 gn_last10_seqs <- getSeq(genome, gn_last10) Then: > gn_last10_seqs A DNAStringSet instance of length 23056 width seq names [1] 10 GTGTCCTCAA 1 [2] 10 AATATTGTGG 10 [3] 10 GTGGCATGCA 100 [4] 10 TTGGATCTGG 1000 [5] 10 TCCAGGCTAC 10000 ... ... ... [23052] 10 GGTATTTTTA 9991 [23053] 10 AAATTTGAAG 9992 [23054] 10 CCGGAAGAGG 9993 [23055] 10 TTTTGTTGTA 9994 [23056] 10 CTGCGTTTAA 9997 The names on this DNAStringSet object are the Entrez Gene ids. If you want to use NCBI genes instead of UCSC genes, you could make your own TxDb object by pointing the makeTranscriptDbFromUCSC() function to the RefSeq Genes track (refGene table): txdb <- makeTranscriptDbFromUCSC("hg19", "refGene") This requires an internet connection and will take a few minutes to complete. The short sequences you will obtain should match the ones you get by querying directly the NCBI web site. Let me know if they don't. Finally note that in the above example, we represented each gene with by a single genomic range, that is, we ignored splicing (and alternate splicing). If you wanted to take them into consideration, you would need to work at the transcript level because "the last 10 or 20 nucleotides of a gene" is not necessarily well defined when intron sequences need to be skipped. Hope this helps, H. > > Thanks > > Carol > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319

Cancer BSgenome BSgenome IRanges Cancer BSgenome BSgenome IRanges • 1.2k views

ADD COMMENT • link 9.7 years ago Hervé Pagès 16k