converting position from '-' strand to '+' strand
3
0
Entering edit mode
Tim Smith ★ 1.1k
@tim-smith-1532
Last seen 7.9 years ago
Apologies if this seems like a trivial question. I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand? For example: ----------------------------- library(biomaRt) mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl') atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand') mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj) mir.locs[1:5,] > ensembl_gene_id chromosome_name start_position end_position strand 1 ENSG00000222732 5 171706206 171706319 1 2 ENSG00000207864 9 97847727 97847823 1 3 ENSG00000221173 9 129338809 129338909 -1 4 ENSG00000222961 5 32379501 32379581 -1 5 ENSG00000221058 18 51612956 51613026 -1 ---------------------------- Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand? many thanks! [[alternative HTML version deleted]]
convert convert • 1.5k views
0
Entering edit mode
@herve-pages-1542
Last seen 1 day ago
Seattle, WA, United States
Hi Tim, Tim Smith wrote: > Apologies if this seems like a trivial question. > > I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand? > For example: > ----------------------------- > library(biomaRt) > > mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl') > > atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand') > > mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj) > mir.locs[1:5,] >> ensembl_gene_id chromosome_name start_position end_position strand > 1 ENSG00000222732 5 171706206 171706319 1 > 2 ENSG00000207864 9 97847727 97847823 1 > 3 ENSG00000221173 9 129338809 129338909 -1 > 4 ENSG00000222961 5 32379501 32379581 -1 > 5 ENSG00000221058 18 51612956 51613026 -1 > > > ---------------------------- > Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand? > many thanks! Note that Ensembl here is using the widely adopted convention (at UCSC, Ensembl, in GFF files, in Bioconductor, etc...) to report feature coordinates relatively to the 5' end of the plus strand (and to call "start position" the position of their leftmost base) even for features that belong to the minus strand. There are a lot of advantages to keep that representation all the way down your analysis. Anyway, if you really want to do that conversion, then you need to know the chromosome lengths for the current Hsapiens genome at Ensembl. You can easily get them with the devel version of the GenomicFeatures package (you need R-2.11 + BioC devel): Make a TranscriptDb object from the 'hsapiens_gene_ensembl' dataset: library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart() Then: > seqlengths(txdb)[1:6] 5 19 10 4 8 20 180915260 59128983 135534747 191154276 146364022 63025520 Then use something like (untested): idx <- which(mir.locs$strand == -1) idx2seqlengths <- seqlengths(txdb)[mir.locs$chromosome_name[idx]] tmp <- mir.locs$start_position[idx] mir.locs$start_position[idx] <- idx2seqlengths - mir.locs$end_position[idx] + 1L mir.locs$end_position[idx] <- idx2seqlengths - tmp + 1L Cheers, H. > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
0
Entering edit mode
The IRanges::reflect() function is relevant here, I think. 2010/4/5 Hervé Pagès <hpages@fhcrc.org> > Hi Tim, > > Tim Smith wrote: > >> Apologies if this seems like a trivial question. >> >> I wanted to have a consistent set of locations and wanted to all the >> locations to begin from the 5' end. How can I convert locations that are >> given for the '-' strand? >> For example: >> ----------------------------- >> library(biomaRt) >> >> mart.obj <- useMart(biomart = 'ensembl', dataset = >> 'hsapiens_gene_ensembl') >> >> atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', >> 'end_position', 'strand') >> >> mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", >> mart=mart.obj) >> mir.locs[1:5,] >> >>> ensembl_gene_id chromosome_name start_position end_position strand >>> >> 1 ENSG00000222732 5 171706206 171706319 1 >> 2 ENSG00000207864 9 97847727 97847823 1 >> 3 ENSG00000221173 9 129338809 129338909 -1 >> 4 ENSG00000222961 5 32379501 32379581 -1 >> 5 ENSG00000221058 18 51612956 51613026 -1 >> >> >> ---------------------------- >> Is there a quick way that I can convert the last 3 rows so that they >> reflect positions from the 5' strand? >> many thanks! >> > > Note that Ensembl here is using the widely adopted convention (at > UCSC, Ensembl, in GFF files, in Bioconductor, etc...) to report > feature coordinates relatively to the 5' end of the plus strand > (and to call "start position" the position of their leftmost base) > even for features that belong to the minus strand. There are a lot > of advantages to keep that representation all the way down your > analysis. > > Anyway, if you really want to do that conversion, then you need to > know the chromosome lengths for the current Hsapiens genome at > Ensembl. You can easily get them with the devel version of the > GenomicFeatures package (you need R-2.11 + BioC devel): > > Make a TranscriptDb object from the 'hsapiens_gene_ensembl' dataset: > > library(GenomicFeatures) > txdb <- makeTranscriptDbFromBiomart() > > Then: > > > seqlengths(txdb)[1:6] > 5 19 10 4 8 20 > 180915260 59128983 135534747 191154276 146364022 63025520 > > Then use something like (untested): > > idx <- which(mir.locs$strand == -1) > idx2seqlengths <- seqlengths(txdb)[mir.locs$chromosome_name[idx]] > tmp <- mir.locs$start_position[idx] > mir.locs$start_position[idx] <- > idx2seqlengths - mir.locs$end_position[idx] + 1L > mir.locs$end_position[idx] <- idx2seqlengths - tmp + 1L > > Cheers, > H. > > > >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M2-B876 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
0
Entering edit mode
Thanks Herve (and James & others) for the clarification. My mistake - I was confused on the convention. ________________________________ From: Hervé Pagès <hpages@fhcrc.org> Cc: bioc <bioconductor@stat.math.ethz.ch> Sent: Mon, April 5, 2010 3:57:37 PM Subject: Re: [BioC] converting position from '-' strand to '+' strand Hi Tim, Tim Smith wrote: > Apologies if this seems like a trivial question. > > I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand? > For example: > ----------------------------- > library(biomaRt) > > mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl') > > atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand') > > mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj) > mir.locs[1:5,] >> ensembl_gene_id chromosome_name start_position end_position strand > 1 ENSG00000222732 5 171706206 171706319 1 > 2 ENSG00000207864 9 97847727 97847823 1 > 3 ENSG00000221173 9 129338809 129338909 -1 > 4 ENSG00000222961 5 32379501 32379581 -1 > 5 ENSG00000221058 18 51612956 51613026 -1 > > > ---------------------------- > Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand? > many thanks! Note that Ensembl here is using the widely adopted convention (at UCSC, Ensembl, in GFF files, in Bioconductor, etc...) to report feature coordinates relatively to the 5' end of the plus strand (and to call "start position" the position of their leftmost base) even for features that belong to the minus strand. There are a lot of advantages to keep that representation all the way down your analysis. Anyway, if you really want to do that conversion, then you need to know the chromosome lengths for the current Hsapiens genome at Ensembl. You can easily get them with the devel version of the GenomicFeatures package (you need R-2.11 + BioC devel): Make a TranscriptDb object from the 'hsapiens_gene_ensembl' dataset: library(GenomicFeatures) txdb <- makeTranscriptDbFromBiomart() Then: > seqlengths(txdb)[1:6] 5 19 10 4 8 20 180915260 59128983 135534747 191154276 146364022 63025520 Then use something like (untested): idx <- which(mir.locs$strand == -1) idx2seqlengths <- seqlengths(txdb)[mir.locs$chromosome_name[idx]] tmp <- mir.locs$start_position[idx] mir.locs$start_position[idx] <- idx2seqlengths - mir.locs$end_position[idx] + 1L mir.locs$end_position[idx] <- idx2seqlengths - tmp + 1L Cheers, H. > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages@fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319 [[alternative HTML version deleted]]
0
Entering edit mode

Hi Hervé, Thanks for the explanation. I am a bit confused about the conventions used by Ensembl. In some cases (insertions), the coordinates in biomart are represented like this:

ID Chr start end strand
rs1554119750 1 5 156594898 156594897 1

Which convention is used in this case? Why is start > end?

0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States
Hi Tim, Tim Smith wrote: > Apologies if this seems like a trivial question. > > I wanted to have a consistent set of locations and wanted to all the > locations to begin from the 5' end. How can I convert locations that > are given for the '-' strand? For example: > ----------------------------- library(biomaRt) > > mart.obj <- useMart(biomart = 'ensembl', dataset = > 'hsapiens_gene_ensembl') > > atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', > 'end_position', 'strand') > > mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", > mart=mart.obj) mir.locs[1:5,] >> ensembl_gene_id chromosome_name start_position end_position strand > 1 ENSG00000222732 5 171706206 171706319 1 > 2 ENSG00000207864 9 97847727 97847823 1 > 3 ENSG00000221173 9 129338809 129338909 -1 > 4 ENSG00000222961 5 32379501 32379581 -1 > 5 ENSG00000221058 18 51612956 51613026 -1 > > > ---------------------------- Is there a quick way that I can convert > the last 3 rows so that they reflect positions from the 5' strand? > many thanks! I'm confused. You want the positions to be from the 5' end regardless of the strand? Wouldn't that make things less consistent? It seems counterintuitive to be counting from different ends of the chromosome. Anyway, if you really want something like that, you could use a combination of tapply() and mapply(). ## split data by chrom mir.list <- tapply(1:dim(mir.locs)[1], mir.locs$chromosome_name, function(x) mir.locs[x,]) ## reverse counting for '-' strand ## first make a list containing the length of each chromosome in your output - I leave that to the reader to figure out how to do that. Let's say it is called 'len.list' ## now a little function to do the reversing little.fun <- function(df, len){ df$start_position <- ifelse(df$strand == -1,len- df$end_position, df$start_position) df$end_position <- ifelse(df$strand == -1,len- df$start_position, df\$end_position) df } ## and use mapply to do the work mir.rev <- mapply(function(x,y) little.fun(mir.list, len.list), SIMPLIFY = FALSE) Note that this isn't tested, but it should be pretty close. Best, Jim > > > > [[alternative HTML version deleted]] > > _______________________________________________ Bioconductor mailing > list Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the > archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
0
Entering edit mode
@michael-watson-iah-c-378
Last seen 7.9 years ago
Begin from the 5' end of what? Chromosomes don't have a 5' "end" ________________________________________ From: bioconductor-bounces@stat.math.ethz.ch [bioconductor- bounces@stat.math.ethz.ch] On Behalf Of Tim Smith [tim_smith_666@yahoo.com] Sent: 05 April 2010 20:07 To: bioc Subject: [BioC] converting position from '-' strand to '+' strand Apologies if this seems like a trivial question. I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand? For example: ----------------------------- library(biomaRt) mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl') atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand') mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj) mir.locs[1:5,] > ensembl_gene_id chromosome_name start_position end_position strand 1 ENSG00000222732 5 171706206 171706319 1 2 ENSG00000207864 9 97847727 97847823 1 3 ENSG00000221173 9 129338809 129338909 -1 4 ENSG00000222961 5 32379501 32379581 -1 5 ENSG00000221058 18 51612956 51613026 -1 ---------------------------- Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand? many thanks! [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor